Feature Request: Support TSV and JSON file formats as input data #66

serv · 2019-02-13T07:03:04Z

CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the columns can be tricky and identification of delimiter comma is harder for CSV parsers.

Can you add support for data file formats, TSV and JSON which do not have the problems above as much?

The text was updated successfully, but these errors were encountered:

w4nderlust · 2019-02-13T08:24:27Z

Thank you for your suggestion.
TSV is a no brainer and will come really soon!
For JSON, the structure may be arguable (columnwise? rowwise?). Will have to put some thought into the best solution, but definitely considering it.

dshlai · 2019-02-14T03:10:48Z

pandas has to_json and read_json methods in the Dataframe that support several different user specified orientation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

So in command line we can have similar “orient” option for user to indicate which orientation the file is using.

It can also embed schema in the output json file as well. Would this work as well.

w4nderlust · 2019-02-14T05:47:39Z

Thanks for pointing it out @dshlai . Reading this it looks like pandas has several more I/O options that i was aware of. So yes, adding support for most of the things that pandas support is definitely feasible without too many problems. Will be coming up soon.

loretoparisi · 2019-02-14T19:55:13Z

@w4nderlust so the good old read_json could be used for that!

jeffin07 · 2020-04-21T07:16:05Z

@w4nderlust Is this issue still open or anyone working on ?

ifokeev · 2020-04-21T08:00:00Z

@jeffin07 all work now is around TF2, so if you're interested in this feature, you could provide the PR. Thanks.

w4nderlust · 2020-04-22T01:30:51Z

@jeffin07 it is still open and needed. We are also planing a reorganization of the preprocessing pipeline, this: https://github.com/uber/ludwig/tree/preprocessing_strategy is a branch with an example of how the preprocessing would look like after refactoring, which will easily enable this feature, but I haven't spent time on it yet as I am full steam on TF2.
Would you be interested in helping out with the refactoring and adding this feature? Adding it without the refactoring would be more difficult actually, and lead to ad hoc code, while the proposed refactoring creates a more generic solution for any kind of data format.

jeffin07 · 2020-04-24T05:47:15Z

@w4nderlust Yes i would love to help in refactoring, which will also help me to understand the project more.I will checkout the branch you specified.Can you give me some guides so that it will be helpful

w4nderlust · 2020-04-24T06:31:45Z

Sure definitely.
What I suggest you to do is to take a dataset and train a model with ludwig first. Take one of the examples in on the website, maybe text classification. I suggest you to put beakpoints everywhere in the preprocessing.py script to see what actually happens during the preprocessing, for instance how metadata is obtained and how preprocessing parameters are used and how the final data transformation id performed. You'll notice that each feature type has its own features that implement those things. I would suggest to begin with to llok just at a couple of them, for instance sequence (which is medium complex) and category (which is medium easy). Numerical and binary are the easiest, while images and audio are the most complex at the moment.
Another thing you will notice is the use of caching with HDF5 files for processed data and JSON files for metadata.
After you have an understanding of how the whole process work, you'll realize what makes it kinda tricky to extend the current design with additional data formats.
Finally, if you look a that branch that I pointed you to, you'll see a sketch of the design that I would like to follow to make preprocessing more flexibly with pluggable data formats a preprocessing strategies.
After you get to that point for sure you'll have a lot of questions. Feel free to reach out to me privately and I can answer all of them (that's just for sparing the github issue with posts).
After you have a clear picture we can define together specific tasks to perform.

w4nderlust · 2020-09-29T00:05:30Z

This has been recently added. Closing.

w4nderlust added the feature New feature or request label Feb 13, 2019

loretoparisi mentioned this issue Feb 14, 2019

KeyError: 'text' #85

Closed

w4nderlust added this to Feature Requests in Ludwig Development Feb 18, 2020

w4nderlust closed this as completed Sep 29, 2020

Ludwig Development automation moved this from Feature Requests to Done Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support TSV and JSON file formats as input data #66

Feature Request: Support TSV and JSON file formats as input data #66

serv commented Feb 13, 2019

w4nderlust commented Feb 13, 2019

dshlai commented Feb 14, 2019

w4nderlust commented Feb 14, 2019

loretoparisi commented Feb 14, 2019 •

edited

jeffin07 commented Apr 21, 2020

ifokeev commented Apr 21, 2020

w4nderlust commented Apr 22, 2020

jeffin07 commented Apr 24, 2020

w4nderlust commented Apr 24, 2020

w4nderlust commented Sep 29, 2020

Feature Request: Support TSV and JSON file formats as input data #66

Feature Request: Support TSV and JSON file formats as input data #66

Comments

serv commented Feb 13, 2019

w4nderlust commented Feb 13, 2019

dshlai commented Feb 14, 2019

w4nderlust commented Feb 14, 2019

loretoparisi commented Feb 14, 2019 • edited

jeffin07 commented Apr 21, 2020

ifokeev commented Apr 21, 2020

w4nderlust commented Apr 22, 2020

jeffin07 commented Apr 24, 2020

w4nderlust commented Apr 24, 2020

w4nderlust commented Sep 29, 2020

loretoparisi commented Feb 14, 2019 •

edited