add from_pandas and from_dict #350

lhoestq · 2020-07-07T15:03:53Z

I added two new methods to the Dataset class:

from_pandas() to create a dataset from a pandas dataframe
from_dict() to create a dataset from a dictionary (keys = columns)

It uses the pa.Table.from_pandas and pa.Table.from_pydict funcitons to do so.
It is also possible to specify the features types via features=... if there are ambiguities (null/nan values), otherwise the arrow schema is infered from the data automatically by pyarrow.

One question that I have right now:

Should we also add a save() method that would write the dataset on the disk ? Right now if we create a Dataset using those two new methods, the data are kept in RAM. Then to reload it we can call the from_file() method.

thomwolf

Really great!

add from_pandas and from_dict

ced2602

lhoestq requested a review from thomwolf July 7, 2020 15:03

thomwolf approved these changes Jul 8, 2020

View reviewed changes

lhoestq merged commit 249d933 into master Jul 8, 2020

lhoestq deleted the add-from-pandas-and-dict-methods branch July 8, 2020 14:14

lhoestq mentioned this pull request Jul 20, 2020

from_dict delete? #414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add from_pandas and from_dict #350

add from_pandas and from_dict #350

lhoestq commented Jul 7, 2020

thomwolf left a comment

add from_pandas and from_dict #350

add from_pandas and from_dict #350

Conversation

lhoestq commented Jul 7, 2020

thomwolf left a comment

Choose a reason for hiding this comment