Skip to content

Conversation

@Hagellach37
Copy link
Member

This PR uses compression for the files that are exported.

Not all files are compressed, but only these that are relatively big and not used much. In particular we do not compress the tasking manager geometries files that are used by project manager most frequently.

However, the biggest files are results, tasks and agg_results and these (and groups and users) will be compressed from now on.

Once this PR is merged we need to run the generate-stats command for all projects. Then we should remove all "old" files that didn't use compression.

docker-compose run -d mapswipe_workers mapswipe_workers --verbose generate-stats-all-projects
rm -rf mapswipe-data/api/agg_results/*.csv
rm -rf mapswipe-data/api/agg_results/*.geojson
rm -rf mapswipe-data/api/tasks/*.csv
rm -rf mapswipe-data/api/results/*.csv
rm -rf mapswipe-data/api/groups/*.csv
rm -rf mapswipe-data/api/users/*.csv

Use the copy statement to write data from postgres to a csv file.
"""

temp_file = "temp.csv"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurentS I'm pretty sure that there is a better solution here. How can we avoid using a temp file here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the magic of pandas, something like:

pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")

I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly tested using pandas, but couldn't get to a solution easily. That's why I keep the existing logic, but now use a proper tempory file instead.

@Hagellach37 Hagellach37 requested a review from laurentS March 3, 2021 13:22
Copy link
Member

@laurentS laurentS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up writing a lot of comments, sorry. I think it works, I've mostly suggested ways to clean up and keep the code shorter, but you can easily ignore :)

Use the copy statement to write data from postgres to a csv file.
"""

temp_file = "temp.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the magic of pandas, something like:

pd.read_sql_query(sql_query).to_csv(filename, compression="gzip")

I haven't tested, but you get the idea. The first call returns a DataFrame, the second one saves it. And there should be some parameter to do batches of rows, so that pandas can manage the memory a bit, if the result of the query is too big.

Use the copy statement to write data from postgres to a csv file.
"""

temp_file = "temp.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, for more general stuff, python has tempfile.TemporaryFile which is probably a bit more robust than creating your own files.

groups_filename = f"{DATA_PATH}/api/groups/groups_{project_id}.csv"
agg_results_filename = f"{DATA_PATH}/api/agg_results/agg_results_{project_id}.csv"
agg_results_by_user_id_filename = f"{DATA_PATH}/api/users/users_{project_id}.csv"
results_filename = f"{DATA_PATH}/api/results/results_{project_id}.csv.gz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the users of these files more likely to be comfortable with zip instead of gzip? I know gzip is usually a bit better at compressing, but I'm not sure how well supported it is on windows for instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point. For now I assumed that this would work for people. Hence, I think that I will leave it as it is and use gzip. And if there is someone that has problems with using the files, I would tackle it then.

agg_results_df.to_csv(
agg_results_filename,
index_label="idx",
compression="gzip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs, it looks like the default value for compression is infer, does it mean it guesses from the filename extension? just wondering if that can help make the code slightly more flexible...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the code. You are right that it is infered from the filename.


with gzip.open(filename, 'rb') as f_in:
with open(csv_file, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use shutil.copy(filename, csv_file) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that first the content of the zipped file needs to be unzipped and then put into the csv file. I'm not sure if the "normal" shutil.copy command will do this.

Check if file is compressed.
"""
csv_file = "temp.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about temp files. But I also don't really understand why you need a tempfile at all. Does ogr2ogr modify the csv_file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ogr2ogr reads the csv file and converts it into a geojson file. For this purpose we need an uncompressed csv file. This is why I create the temporary csv file here and uncompress the content later. I added a more detailed comment to the function.

csv_file = "temp.csv"
geojson_file = "temp.geojson"
outfile = filename.replace(".csv", f"_{geometry_field}.geojson")
filename_without_path = csv_file.split("/")[-1].replace(".csv", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and while I'm on all these cool built-in functions, python has pathlib which has tons of methods to play with filenames, paths and stuff. What you're doing here is probably fine as you don't really need portability, but always good to know about them :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used anymore. since we know the name of the temporary csv file this is used instead.

json.dump(json_data, fout)

# remove temp files
os.remove(csv_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if csv_file and geojson file are both temp, I would make it clear in their variable names, like tmp_csv_file, etc... This got me confused until I reached the end of the function here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the names for both variables.

@Hagellach37
Copy link
Member Author

Hey @laurentS ,
thanks for the comments. I adjusted some parts of the code and kept others. I tried to provide more details on my thoughts in the comments. Maybe you can go over them again whenever you have time and then I will merge things.

@laurentS
Copy link
Member

laurentS commented Mar 4, 2021

Great! Looks good to me :)

@Hagellach37 Hagellach37 merged commit 11a3e4b into dev Mar 4, 2021
@Hagellach37 Hagellach37 deleted the zip-files branch March 4, 2021 21:39
@Hagellach37 Hagellach37 mentioned this pull request Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants