Skip to content

Code used to facilitate an investigation into how file size affects querying speed on AWS Athena.

Notifications You must be signed in to change notification settings

kine-dmd/athena-query-speed-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Athena Query Speed Test

Code used to facilitate an investigation into how file size affects querying speed on AWS Athena.

Running instructions

Run inside a tmux session with the following command, substituting the number of rows each time.

rm -rf /data/rp1615 && mkdir /data/rp1615 && cd ~/Documents && ./main 100000000 && rm -rf /data/rp1615 && exit

Method

All tables had 100000000 rows. Query tested was SELECT count(ax) FROM row100 WHERE ax > 0.

Results

Results:

Rows per file Time taken /s
100 201.15
1000 19.53
10000 4.07
100000 2.79
1000000 2.35
10000000 2.27
100000000 2.83

Results analysis

queryTimes As shown in the graph, the query time is significantly longer when small files are used due to the additional overhead of creating new connections to S3 and reading additional metadata. The single file performance (largest entry) is also slower than multiple files as the query cannot be parallelised.

Further reading

Additional information can be found on the AWS Big Data Blog and Upsolver.

About

Code used to facilitate an investigation into how file size affects querying speed on AWS Athena.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published