C API #53

alejandro-colomar · 2020-02-12T16:01:05Z

Hi,

Recently I wanted to write something similar to pandas in C. I was starting (I have been working on that for a few weeks), when I thought someone would have done something similar already, and then I found this.

Do you think it can be ported to C, or at least write an interface between this and C code that would link to the C++ code? If so, maybe I could help you.

Kind regards,
Alex.

hosseinmoein · 2020-02-12T17:57:56Z

That is an interesting question.
This library is as much about interface as is about functionality. It is designed so additional functionality can easily be added. This is meant to be a coherent container, so it is inherently a C++ thing.
Can you put an interface on top of it to make it C-usable? Yes.
But why? And what type of interface are you envisioning for C?
You can always use it as is in a predominately C application. But I am curious to hear what you are envisioning

Best,
HM

alejandro-colomar · 2020-02-12T18:24:53Z

As of now, I have a data structures library (with dynamic arrays, dynamic buffers, linked lists, and binary search trees) and I'm using it in the test program to build my DataFrame emulator.

First I did a working program where I hard-coded many things, and now I'm transforming the program into a library by generalizing it.

The API I envisioned is this one:

enum	Alx_DataFrame_Type {
	ALX_DF_S64 = 1,
	ALX_DF_DBL,
	ALX_DF_STR
};

struct	Alx_DataFrame_Cell {
	union {
		int64_t			z;
		double			r;
		struct Alx_DynBuf	*s;
	};
	int	err;
};

struct	Alx_DataFrame_Row {
	struct Alx_LinkedList	*cells;
	int			err;
};

union	Alx_DataFrame_Desc {
	struct	Alx_DataFrame_Desc_Txt {
		int	uniq;
		int	top;
		int	freq;
	};
	struct	Alx_DataFrame_Desc_Num {
		double	mean;
		double	std;
		double	min;
		double	q_25;
		double	q_50;
		double	q_75;
		double	max;
	};
};

struct	Alx_DataFrame_Col {
	int				type;   /* enum Alx_DataFrame_Type */
	struct Alx_DynBuf		*hdr;   /* column header string */
	cmp_f				*cmp;   /* user custom comparison function for the data (if not, standard comparations are done */
	bool				ltd_values;   /* limited set of possible values? */
	struct Alx_BST			*values;   /* values; either stored by the parser, or passed by the user if limited set of values is true */
	struct Alx_DataFrame_Desc	*desc;
};

struct	Alx_DataFrame {
	struct Alx_LinkedList	*cols;
	struct Alx_LinkedList	*rows;
};

int	alx_df_init		(struct Alx_DataFrame **df);
void	alx_df_deinit		(struct Alx_DataFrame *df);
int	alx_df_add_col		(struct Alx_DataFrame *restrict df,
				 int type, char *restrict hdr,
				 cmp_f *cmp,
				 struct Alx_BST *restrict values);
int	alx_df_parse		(struct Alx_DataFrame *restrict df,
				 FILE *restrict istream);
int	alx_df_drop_row		(struct Alx_DataFrame *df,
				 ptrdiff_t nrow);
int	alx_df_drop_col		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_dropna		(struct Alx_DataFrame *df);
int	alx_df_sort		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_sort_bwd		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_describe		(struct Alx_DataFrame *df);
int	alx_df_fprn_data	(FILE *restrict ostream,
				 struct Alx_DataFrame *restrict df);
int	alx_df_fprn_desc	(FILE *restrict ostream,
				 struct Alx_DataFrame *restrict df);

The dataframe would consist of a linked list of rows, and a linked list with column configurations and descriptions.

The rows are also linked lists of cells, which in the end contain the data in dynamic buffers.

A simple program using it would be the following (Its a prototype; it may have errors; also, I didn't care about error handling):

enum Fields {
	FLDS_ID,
	FLDS_NAME,
	FLDS_AGE,
	FLDS_HEIGHT,

	FIELDS
};
const char *const hdrs[FIELDS] = {
	[FLDS_ID]	= "id",
	[FLDS_NAME]	= "name",
	[FLDS_AGE]	= "age",
	[FLDS_HEIGHT]	= "height"
};
const char *const types[FIELDS] = {
	[FLDS_ID]	= ALX_DF_S64,
	[FLDS_NAME]	= ALX_DF_STR,
	[FLDS_AGE]	= ALX_DF_S64,
	[FLDS_HEIGHT]	= ALX_DF_DBL
};

int main(void)
{
	struct Alx_DataFrame	*df;
	FILE			*less;

	fp = fopen("file.csv", "r");

	alx_df_init(&df);
	for (ptrdiff_t i = 0; i < FIELDS; i++)
		alx_df_add_col(df, types[i], hdrs[i], NULL, NULL);
	alx_df_parse(df, fp);

	alx_sort_bwd(FLDS_AGE);		/* oldest first */
	less	= popen("less -S", "w");
	alx_df_fprn_data(less, df);	/* print data with less(1) */
	pclose(less);

	alx_df_describe(df);		/* calculate description */
	less	= popen("less -S", "w");
	alx_df_fprn_desc(less, df);	/* print description with less(1) */
	pclose(less);

	alx_df_dropna(df);		/* drop rows with invalid values */
	less	= popen("less -S", "w");
	alx_df_fprn_data(less, df);	/* print data with less(1) */
	pclose(less);
	alx_df_describe(df);		/* need to calculate description again */
	less	= popen("less -S", "w");
	alx_df_fprn_desc(less, df);	/* print description with less(1) */
	pclose(less);


	return	0;
}

alejandro-colomar · 2020-02-12T18:42:50Z

The problem with what I have now, as I see it, is that i have zillions of mallocs, and I'm concerned about performance. Maybe your library could be faster.

Nevertheless, as it's relatively easy and simple, I'll first finish my library just to measure its performance. It'll take me some time, though. If yours relies on arrays, it will probably be much faster. That's why I thought of porting or wrapping it to C.

hosseinmoein · 2020-02-12T19:05:55Z

So, I followed a few principals in this library

I must support any type either built-in or user defined without needing new code
Never chase pointers ala linked lists, including virtual function calls
Have all column data in continuous memory space
Never use more space than you need (i.e. unions)
Avoid copying data as much as possible. Unfortunately, sometimes you have to
Use multi-threading but only when it makes sense

alejandro-colomar · 2020-02-12T19:10:43Z

Regarding 2 & 3:

I first tried to do that, but I don't know how to do it, and I don't know if it is possible in C. How do you store all data from a column contiguously, if every field can have a different type? Do you use templates for that in C++?

Would you know how to do a C interface for your library similar to what I wrote? I don't know much about the internals of your library (I don't know much C++). I could help in the C code.

hosseinmoein · 2020-02-12T19:12:59Z

Yes, this library relies very heavily on templates. I am not sure how/if that is possible in C.
Columns could be of different types. But each element in a given column is of the same type.

I suggest you look at my documentation and code to get some ideas. You could just use it as is in your apps

hosseinmoein added enhancement question labels Feb 12, 2020

hosseinmoein closed this as completed Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C API #53

C API #53

alejandro-colomar commented Feb 12, 2020

hosseinmoein commented Feb 12, 2020

alejandro-colomar commented Feb 12, 2020 •

edited

alejandro-colomar commented Feb 12, 2020 •

edited

hosseinmoein commented Feb 12, 2020 •

edited

alejandro-colomar commented Feb 12, 2020 •

edited

hosseinmoein commented Feb 12, 2020 •

edited

C API #53

C API #53

Comments

alejandro-colomar commented Feb 12, 2020

hosseinmoein commented Feb 12, 2020

alejandro-colomar commented Feb 12, 2020 • edited

alejandro-colomar commented Feb 12, 2020 • edited

hosseinmoein commented Feb 12, 2020 • edited

alejandro-colomar commented Feb 12, 2020 • edited

hosseinmoein commented Feb 12, 2020 • edited

alejandro-colomar commented Feb 12, 2020 •

edited

alejandro-colomar commented Feb 12, 2020 •

edited

hosseinmoein commented Feb 12, 2020 •

edited

alejandro-colomar commented Feb 12, 2020 •

edited

hosseinmoein commented Feb 12, 2020 •

edited