New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use parquet for storing the dataset #2

Open

rth opened this issue Aug 14, 2020 · 0 comments

rth commented Aug 14, 2020 •

edited

Loading

Thanks for making this historical data available!

It might be worth switching to parquet (supported in pandas), the dataset would be much smaller and faster to load:

historique_stations.csv: 1.5GB and 424MB zip compressed
historique_stations.parquet (with snappy compression): 78 MB and probably less after converting dates and GPS coordinates to correct dtype

Stored with pandas.DataFrame.to_parquet,

df.to_parquet("historique_stations.parquet", compression="snappy")

This would require adding pyarrow as a dependency

rth changed the title ~~Use parquet for for storing the dataset~~ Use parquet for storing the dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment