Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support TSV and JSON file formats as input data #66

Closed
serv opened this issue Feb 13, 2019 · 10 comments
Closed

Feature Request: Support TSV and JSON file formats as input data #66

serv opened this issue Feb 13, 2019 · 10 comments
Labels
feature New feature or request

Comments

@serv
Copy link

serv commented Feb 13, 2019

CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the columns can be tricky and identification of delimiter comma is harder for CSV parsers.

Can you add support for data file formats, TSV and JSON which do not have the problems above as much?

@w4nderlust
Copy link
Collaborator

Thank you for your suggestion.
TSV is a no brainer and will come really soon!
For JSON, the structure may be arguable (columnwise? rowwise?). Will have to put some thought into the best solution, but definitely considering it.

@w4nderlust w4nderlust added the feature New feature or request label Feb 13, 2019
@dshlai
Copy link

dshlai commented Feb 14, 2019

pandas has to_json and read_json methods in the Dataframe that support several different user specified orientation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

So in command line we can have similar “orient” option for user to indicate which orientation the file is using.

It can also embed schema in the output json file as well. Would this work as well.

@w4nderlust
Copy link
Collaborator

Thanks for pointing it out @dshlai . Reading this it looks like pandas has several more I/O options that i was aware of. So yes, adding support for most of the things that pandas support is definitely feasible without too many problems. Will be coming up soon.

@loretoparisi
Copy link

loretoparisi commented Feb 14, 2019

@w4nderlust so the good old read_json could be used for that!

@w4nderlust w4nderlust added this to Feature Requests in Ludwig Development Feb 18, 2020
@jeffin07
Copy link

@w4nderlust Is this issue still open or anyone working on ?

@ifokeev
Copy link
Contributor

ifokeev commented Apr 21, 2020

@jeffin07 all work now is around TF2, so if you're interested in this feature, you could provide the PR. Thanks.

@w4nderlust
Copy link
Collaborator

@jeffin07 it is still open and needed. We are also planing a reorganization of the preprocessing pipeline, this: https://github.com/uber/ludwig/tree/preprocessing_strategy is a branch with an example of how the preprocessing would look like after refactoring, which will easily enable this feature, but I haven't spent time on it yet as I am full steam on TF2.
Would you be interested in helping out with the refactoring and adding this feature? Adding it without the refactoring would be more difficult actually, and lead to ad hoc code, while the proposed refactoring creates a more generic solution for any kind of data format.

@jeffin07
Copy link

@w4nderlust Yes i would love to help in refactoring, which will also help me to understand the project more.I will checkout the branch you specified.Can you give me some guides so that it will be helpful

@w4nderlust
Copy link
Collaborator

Sure definitely.
What I suggest you to do is to take a dataset and train a model with ludwig first. Take one of the examples in on the website, maybe text classification. I suggest you to put beakpoints everywhere in the preprocessing.py script to see what actually happens during the preprocessing, for instance how metadata is obtained and how preprocessing parameters are used and how the final data transformation id performed. You'll notice that each feature type has its own features that implement those things. I would suggest to begin with to llok just at a couple of them, for instance sequence (which is medium complex) and category (which is medium easy). Numerical and binary are the easiest, while images and audio are the most complex at the moment.
Another thing you will notice is the use of caching with HDF5 files for processed data and JSON files for metadata.
After you have an understanding of how the whole process work, you'll realize what makes it kinda tricky to extend the current design with additional data formats.
Finally, if you look a that branch that I pointed you to, you'll see a sketch of the design that I would like to follow to make preprocessing more flexibly with pluggable data formats a preprocessing strategies.
After you get to that point for sure you'll have a lot of questions. Feel free to reach out to me privately and I can answer all of them (that's just for sparing the github issue with posts).
After you have a clear picture we can define together specific tasks to perform.

@w4nderlust
Copy link
Collaborator

This has been recently added. Closing.

Ludwig Development automation moved this from Feature Requests to Done Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Development

No branches or pull requests

6 participants