Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parquet for storing the dataset #2

Open
rth opened this issue Aug 14, 2020 · 0 comments
Open

Use parquet for storing the dataset #2

rth opened this issue Aug 14, 2020 · 0 comments

Comments

@rth
Copy link

rth commented Aug 14, 2020

Thanks for making this historical data available!

It might be worth switching to parquet (supported in pandas), the dataset would be much smaller and faster to load:

  • historique_stations.csv: 1.5GB and 424MB zip compressed
  • historique_stations.parquet (with snappy compression): 78 MB and probably less after converting dates and GPS coordinates to correct dtype

Stored with pandas.DataFrame.to_parquet,

df.to_parquet("historique_stations.parquet", compression="snappy")

This would require adding pyarrow as a dependency

@rth rth changed the title Use parquet for for storing the dataset Use parquet for storing the dataset Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant