Code used to facilitate an investigation into how file size affects querying speed on AWS Athena.
Run inside a tmux
session with the following command, substituting the number of rows each time.
rm -rf /data/rp1615 && mkdir /data/rp1615 && cd ~/Documents && ./main 100000000 && rm -rf /data/rp1615 && exit
All tables had 100000000 rows. Query tested was SELECT count(ax) FROM row100 WHERE ax > 0
.
Results:
Rows per file | Time taken /s |
---|---|
100 | 201.15 |
1000 | 19.53 |
10000 | 4.07 |
100000 | 2.79 |
1000000 | 2.35 |
10000000 | 2.27 |
100000000 | 2.83 |
As shown in the graph, the query time is significantly longer when small files are used due to the additional overhead of creating new connections to S3 and reading additional metadata. The single file performance (largest entry) is also slower than multiple files as the query cannot be parallelised.
Additional information can be found on the AWS Big Data Blog and Upsolver.