-
Notifications
You must be signed in to change notification settings - Fork 13
compress files for export with gzip #451 #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Use the copy statement to write data from postgres to a csv file. | ||
| """ | ||
|
|
||
| temp_file = "temp.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laurentS I'm pretty sure that there is a better solution here. How can we avoid using a temp file here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the magic of pandas, something like:
pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly tested using pandas, but couldn't get to a solution easily. That's why I keep the existing logic, but now use a proper tempory file instead.
laurentS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up writing a lot of comments, sorry. I think it works, I've mostly suggested ways to clean up and keep the code shorter, but you can easily ignore :)
| Use the copy statement to write data from postgres to a csv file. | ||
| """ | ||
|
|
||
| temp_file = "temp.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the magic of pandas, something like:
pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.
| Use the copy statement to write data from postgres to a csv file. | ||
| """ | ||
|
|
||
| temp_file = "temp.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.
| groups_filename = f"{DATA_PATH}/api/groups/groups_{project_id}.csv" | ||
| agg_results_filename = f"{DATA_PATH}/api/agg_results/agg_results_{project_id}.csv" | ||
| agg_results_by_user_id_filename = f"{DATA_PATH}/api/users/users_{project_id}.csv" | ||
| results_filename = f"{DATA_PATH}/api/results/results_{project_id}.csv.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the users of these files more likely to be comfortable with zip instead of gzip? I know gzip is usually a bit better at compressing, but I'm not sure how well supported it is on windows for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid point. For now I assumed that this would work for people. Hence, I think that I will leave it as it is and use gzip. And if there is someone that has problems with using the files, I would tackle it then.
| agg_results_df.to_csv( | ||
| agg_results_filename, | ||
| index_label="idx", | ||
| compression="gzip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docs, it looks like the default value for compression is infer, does it mean it guesses from the filename extension? just wondering if that can help make the code slightly more flexible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the code. You are right that it is infered from the filename.
|
|
||
| with gzip.open(filename, 'rb') as f_in: | ||
| with open(csv_file, 'wb') as f_out: | ||
| shutil.copyfileobj(f_in, f_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use shutil.copy(filename, csv_file) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that first the content of the zipped file needs to be unzipped and then put into the csv file. I'm not sure if the "normal" shutil.copy command will do this.
| Check if file is compressed. | ||
| """ | ||
| csv_file = "temp.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment about temp files. But I also don't really understand why you need a tempfile at all. Does ogr2ogr modify the csv_file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ogr2ogr reads the csv file and converts it into a geojson file. For this purpose we need an uncompressed csv file. This is why I create the temporary csv file here and uncompress the content later. I added a more detailed comment to the function.
| csv_file = "temp.csv" | ||
| geojson_file = "temp.geojson" | ||
| outfile = filename.replace(".csv", f"_{geometry_field}.geojson") | ||
| filename_without_path = csv_file.split("/")[-1].replace(".csv", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and while I'm on all these cool built-in functions, python has pathlib which has tons of methods to play with filenames, paths and stuff. What you're doing here is probably fine as you don't really need portability, but always good to know about them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used anymore. since we know the name of the temporary csv file this is used instead.
| json.dump(json_data, fout) | ||
|
|
||
| # remove temp files | ||
| os.remove(csv_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if csv_file and geojson file are both temp, I would make it clear in their variable names, like tmp_csv_file, etc... This got me confused until I reached the end of the function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the names for both variables.
|
Hey @laurentS , |
|
Great! Looks good to me :) |
This PR uses compression for the files that are exported.
Not all files are compressed, but only these that are relatively big and not used much. In particular we do not compress the tasking manager geometries files that are used by project manager most frequently.
However, the biggest files are
results,tasksandagg_resultsand these (andgroupsandusers) will be compressed from now on.Once this PR is merged we need to run the
generate-statscommand for all projects. Then we should remove all "old" files that didn't use compression.