Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C API #53

Closed
alejandro-colomar opened this issue Feb 12, 2020 · 6 comments
Closed

C API #53

alejandro-colomar opened this issue Feb 12, 2020 · 6 comments

Comments

@alejandro-colomar
Copy link

Hi,

Recently I wanted to write something similar to pandas in C. I was starting (I have been working on that for a few weeks), when I thought someone would have done something similar already, and then I found this.

Do you think it can be ported to C, or at least write an interface between this and C code that would link to the C++ code? If so, maybe I could help you.

Kind regards,
Alex.

@hosseinmoein
Copy link
Owner

That is an interesting question.
This library is as much about interface as is about functionality. It is designed so additional functionality can easily be added. This is meant to be a coherent container, so it is inherently a C++ thing.
Can you put an interface on top of it to make it C-usable? Yes.
But why? And what type of interface are you envisioning for C?
You can always use it as is in a predominately C application. But I am curious to hear what you are envisioning

Best,
HM

@alejandro-colomar
Copy link
Author

alejandro-colomar commented Feb 12, 2020

As of now, I have a data structures library (with dynamic arrays, dynamic buffers, linked lists, and binary search trees) and I'm using it in the test program to build my DataFrame emulator.

First I did a working program where I hard-coded many things, and now I'm transforming the program into a library by generalizing it.

The API I envisioned is this one:

enum	Alx_DataFrame_Type {
	ALX_DF_S64 = 1,
	ALX_DF_DBL,
	ALX_DF_STR
};

struct	Alx_DataFrame_Cell {
	union {
		int64_t			z;
		double			r;
		struct Alx_DynBuf	*s;
	};
	int	err;
};

struct	Alx_DataFrame_Row {
	struct Alx_LinkedList	*cells;
	int			err;
};

union	Alx_DataFrame_Desc {
	struct	Alx_DataFrame_Desc_Txt {
		int	uniq;
		int	top;
		int	freq;
	};
	struct	Alx_DataFrame_Desc_Num {
		double	mean;
		double	std;
		double	min;
		double	q_25;
		double	q_50;
		double	q_75;
		double	max;
	};
};

struct	Alx_DataFrame_Col {
	int				type;   /* enum Alx_DataFrame_Type */
	struct Alx_DynBuf		*hdr;   /* column header string */
	cmp_f				*cmp;   /* user custom comparison function for the data (if not, standard comparations are done */
	bool				ltd_values;   /* limited set of possible values? */
	struct Alx_BST			*values;   /* values; either stored by the parser, or passed by the user if limited set of values is true */
	struct Alx_DataFrame_Desc	*desc;
};

struct	Alx_DataFrame {
	struct Alx_LinkedList	*cols;
	struct Alx_LinkedList	*rows;
};

int	alx_df_init		(struct Alx_DataFrame **df);
void	alx_df_deinit		(struct Alx_DataFrame *df);
int	alx_df_add_col		(struct Alx_DataFrame *restrict df,
				 int type, char *restrict hdr,
				 cmp_f *cmp,
				 struct Alx_BST *restrict values);
int	alx_df_parse		(struct Alx_DataFrame *restrict df,
				 FILE *restrict istream);
int	alx_df_drop_row		(struct Alx_DataFrame *df,
				 ptrdiff_t nrow);
int	alx_df_drop_col		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_dropna		(struct Alx_DataFrame *df);
int	alx_df_sort		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_sort_bwd		(struct Alx_DataFrame *df,
				 ptrdiff_t ncol);
int	alx_df_describe		(struct Alx_DataFrame *df);
int	alx_df_fprn_data	(FILE *restrict ostream,
				 struct Alx_DataFrame *restrict df);
int	alx_df_fprn_desc	(FILE *restrict ostream,
				 struct Alx_DataFrame *restrict df);

The dataframe would consist of a linked list of rows, and a linked list with column configurations and descriptions.

The rows are also linked lists of cells, which in the end contain the data in dynamic buffers.

A simple program using it would be the following (Its a prototype; it may have errors; also, I didn't care about error handling):

enum Fields {
	FLDS_ID,
	FLDS_NAME,
	FLDS_AGE,
	FLDS_HEIGHT,

	FIELDS
};
const char *const hdrs[FIELDS] = {
	[FLDS_ID]	= "id",
	[FLDS_NAME]	= "name",
	[FLDS_AGE]	= "age",
	[FLDS_HEIGHT]	= "height"
};
const char *const types[FIELDS] = {
	[FLDS_ID]	= ALX_DF_S64,
	[FLDS_NAME]	= ALX_DF_STR,
	[FLDS_AGE]	= ALX_DF_S64,
	[FLDS_HEIGHT]	= ALX_DF_DBL
};

int main(void)
{
	struct Alx_DataFrame	*df;
	FILE			*less;

	fp = fopen("file.csv", "r");

	alx_df_init(&df);
	for (ptrdiff_t i = 0; i < FIELDS; i++)
		alx_df_add_col(df, types[i], hdrs[i], NULL, NULL);
	alx_df_parse(df, fp);

	alx_sort_bwd(FLDS_AGE);		/* oldest first */
	less	= popen("less -S", "w");
	alx_df_fprn_data(less, df);	/* print data with less(1) */
	pclose(less);

	alx_df_describe(df);		/* calculate description */
	less	= popen("less -S", "w");
	alx_df_fprn_desc(less, df);	/* print description with less(1) */
	pclose(less);

	alx_df_dropna(df);		/* drop rows with invalid values */
	less	= popen("less -S", "w");
	alx_df_fprn_data(less, df);	/* print data with less(1) */
	pclose(less);
	alx_df_describe(df);		/* need to calculate description again */
	less	= popen("less -S", "w");
	alx_df_fprn_desc(less, df);	/* print description with less(1) */
	pclose(less);


	return	0;
}

@alejandro-colomar
Copy link
Author

alejandro-colomar commented Feb 12, 2020

The problem with what I have now, as I see it, is that i have zillions of mallocs, and I'm concerned about performance. Maybe your library could be faster.

Nevertheless, as it's relatively easy and simple, I'll first finish my library just to measure its performance. It'll take me some time, though. If yours relies on arrays, it will probably be much faster. That's why I thought of porting or wrapping it to C.

@hosseinmoein
Copy link
Owner

hosseinmoein commented Feb 12, 2020

So, I followed a few principals in this library

  1. I must support any type either built-in or user defined without needing new code
  2. Never chase pointers ala linked lists, including virtual function calls
  3. Have all column data in continuous memory space
  4. Never use more space than you need (i.e. unions)
  5. Avoid copying data as much as possible. Unfortunately, sometimes you have to
  6. Use multi-threading but only when it makes sense

@alejandro-colomar
Copy link
Author

alejandro-colomar commented Feb 12, 2020

Regarding 2 & 3:

I first tried to do that, but I don't know how to do it, and I don't know if it is possible in C. How do you store all data from a column contiguously, if every field can have a different type? Do you use templates for that in C++?

Would you know how to do a C interface for your library similar to what I wrote? I don't know much about the internals of your library (I don't know much C++). I could help in the C code.

@hosseinmoein
Copy link
Owner

hosseinmoein commented Feb 12, 2020

Yes, this library relies very heavily on templates. I am not sure how/if that is possible in C.
Columns could be of different types. But each element in a given column is of the same type.

I suggest you look at my documentation and code to get some ideas. You could just use it as is in your apps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants