compress files for export with gzip #451 #453

Hagellach37 · 2021-03-03T13:19:24Z

This PR uses compression for the files that are exported.

Not all files are compressed, but only these that are relatively big and not used much. In particular we do not compress the tasking manager geometries files that are used by project manager most frequently.

However, the biggest files are results, tasks and agg_results and these (and groups and users) will be compressed from now on.

Once this PR is merged we need to run the generate-stats command for all projects. Then we should remove all "old" files that didn't use compression.

docker-compose run -d mapswipe_workers mapswipe_workers --verbose generate-stats-all-projects

rm -rf mapswipe-data/api/agg_results/*.csv
rm -rf mapswipe-data/api/agg_results/*.geojson
rm -rf mapswipe-data/api/tasks/*.csv
rm -rf mapswipe-data/api/results/*.csv
rm -rf mapswipe-data/api/groups/*.csv
rm -rf mapswipe-data/api/users/*.csv

Hagellach37 · 2021-03-03T13:20:45Z

mapswipe_workers/mapswipe_workers/generate_stats/project_stats.py

    Use the copy statement to write data from postgres to a csv file.
    """

+    temp_file = "temp.csv"


@laurentS I'm pretty sure that there is a better solution here. How can we avoid using a temp file here?

How about using the magic of pandas, something like:

pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")

I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.

also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.

I quickly tested using pandas, but couldn't get to a solution easily. That's why I keep the existing logic, but now use a proper tempory file instead.

laurentS

I ended up writing a lot of comments, sorry. I think it works, I've mostly suggested ways to clean up and keep the code shorter, but you can easily ignore :)

laurentS · 2021-03-03T14:42:09Z

mapswipe_workers/mapswipe_workers/generate_stats/project_stats.py

    Use the copy statement to write data from postgres to a csv file.
    """

+    temp_file = "temp.csv"


How about using the magic of pandas, something like:

pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")

I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.

laurentS · 2021-03-03T14:45:20Z

mapswipe_workers/mapswipe_workers/generate_stats/project_stats.py

    Use the copy statement to write data from postgres to a csv file.
    """

+    temp_file = "temp.csv"


also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.

laurentS · 2021-03-03T14:46:48Z

mapswipe_workers/mapswipe_workers/generate_stats/project_stats.py

-    groups_filename = f"{DATA_PATH}/api/groups/groups_{project_id}.csv"
-    agg_results_filename = f"{DATA_PATH}/api/agg_results/agg_results_{project_id}.csv"
-    agg_results_by_user_id_filename = f"{DATA_PATH}/api/users/users_{project_id}.csv"
+    results_filename = f"{DATA_PATH}/api/results/results_{project_id}.csv.gz"


Are the users of these files more likely to be comfortable with zip instead of gzip? I know gzip is usually a bit better at compressing, but I'm not sure how well supported it is on windows for instance.

That's a valid point. For now I assumed that this would work for people. Hence, I think that I will leave it as it is and use gzip. And if there is someone that has problems with using the files, I would tackle it then.

laurentS · 2021-03-03T14:48:49Z

mapswipe_workers/mapswipe_workers/generate_stats/project_stats.py

+        agg_results_df.to_csv(
+            agg_results_filename,
+            index_label="idx",
+            compression="gzip"


In the docs, it looks like the default value for compression is infer, does it mean it guesses from the filename extension? just wondering if that can help make the code slightly more flexible...

I adjusted the code. You are right that it is infered from the filename.

laurentS · 2021-03-03T14:55:25Z

mapswipe_workers/mapswipe_workers/utils/geojson_functions.py

+
+    with gzip.open(filename, 'rb') as f_in:
+        with open(csv_file, 'wb') as f_out:
+            shutil.copyfileobj(f_in, f_out)


Why not use shutil.copy(filename, csv_file) here?

I think that first the content of the zipped file needs to be unzipped and then put into the csv file. I'm not sure if the "normal" shutil.copy command will do this.

laurentS · 2021-03-03T14:56:43Z

mapswipe_workers/mapswipe_workers/utils/geojson_functions.py

+
+    Check if file is compressed.
+    """
+    csv_file = "temp.csv"


See my other comment about temp files. But I also don't really understand why you need a tempfile at all. Does ogr2ogr modify the csv_file?

ogr2ogr reads the csv file and converts it into a geojson file. For this purpose we need an uncompressed csv file. This is why I create the temporary csv file here and uncompress the content later. I added a more detailed comment to the function.

laurentS · 2021-03-03T14:57:56Z

mapswipe_workers/mapswipe_workers/utils/geojson_functions.py

+    csv_file = "temp.csv"
+    geojson_file = "temp.geojson"
+    outfile = filename.replace(".csv", f"_{geometry_field}.geojson")
+    filename_without_path = csv_file.split("/")[-1].replace(".csv", "")


and while I'm on all these cool built-in functions, python has pathlib which has tons of methods to play with filenames, paths and stuff. What you're doing here is probably fine as you don't really need portability, but always good to know about them :)

this is not used anymore. since we know the name of the temporary csv file this is used instead.

laurentS · 2021-03-03T15:00:31Z

mapswipe_workers/mapswipe_workers/utils/geojson_functions.py

+        json.dump(json_data, fout)
+
+    # remove temp files
+    os.remove(csv_file)


if csv_file and geojson file are both temp, I would make it clear in their variable names, like tmp_csv_file, etc... This got me confused until I reached the end of the function here.

I changed the names for both variables.

Hagellach37 · 2021-03-03T17:21:55Z

Hey @laurentS ,
thanks for the comments. I adjusted some parts of the code and kept others. I tried to provide more details on my thoughts in the comments. Maybe you can go over them again whenever you have time and then I will merge things.

laurentS · 2021-03-04T10:36:47Z

Great! Looks good to me :)

Hagellach37 added 2 commits March 3, 2021 14:03

compress files for export with gzip #451

d8c9287

use compression for user stats

0977bfc

Hagellach37 commented Mar 3, 2021

View reviewed changes

Hagellach37 requested a review from laurentS March 3, 2021 13:22

laurentS approved these changes Mar 3, 2021

View reviewed changes

Hagellach37 added 3 commits March 3, 2021 17:54

use tempfiles in python #451 #453

b5c2e34

infer compression from file path #451 #453

468ffe8

use temporary file for sql query results

f0c5eb1

Hagellach37 merged commit 11a3e4b into dev Mar 4, 2021

Hagellach37 deleted the zip-files branch March 4, 2021 21:39

Hagellach37 mentioned this pull request Mar 4, 2021

v2.2.6 #454

Merged

compress files for export with gzip #451 #453

compress files for export with gzip #451 #453

Uh oh!

Conversation

Hagellach37 commented Mar 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurentS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hagellach37 commented Mar 3, 2021

Uh oh!

laurentS commented Mar 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants