Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

Open
4ndr3aR opened this issue Jun 11, 2024 · 0 comments
Assignees

Comments

@4ndr3aR
Copy link

4ndr3aR commented Jun 11, 2024

Hi! I think I've a similar problem as in #679. Our dataset contains .xz files as data/samples (just point clouds text files saved through np.savetxt(fd, points, fmt='%.5e') then xz-compressed for efficiency reasons) and .txt files as GT (using spaces as separators). The first line of GT contains a sort of header, just a int that tells the number of lines that must be read in the file. Then there is an arbitrary number of lines containing 7 or 8 columns, again, space-separated. I honestly don't see an "easy" way to represent all this in Croissant in a meaningful way, or at least I can't understand how to proceed. Let's say that, since it's a tool designed to "ingest" ML datasets, I would have at least expected a language closer to the discipline (dataset, subset, sample, ground truth, etc.). I've uploaded two single file_objects, one as GT and one as data/sample. Then the interface asks me the names of the fields, then it allows me to specify a regular expression (that is actually a good idea to grab e.g. the header/number of lines) but the interface gives no feedback about what's happening really or about what would happen with a given input. I think I'll give up for the moment, the idea is good but the tool doesn't seem usable yet, at least it isn't for non-standard cases like our dataset.

@pierrot0 pierrot0 self-assigned this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants