on a 10 node i3en.xlarge cluster, after importing the dataset to local disk:
s4 query | seconds |
---|---|
count rides by passengers | 4.152 |
count rides by date | 5.958 |
sum distance by date | 11.391 |
top n by distance | 4.159 |
sort by distance | 162.052 |
presto query | seconds |
---|---|
count rides by passengers | 7.834 |
count rides by date | 14.138 |
sum distance by date | 15.497 |
top n by distance | 8.965 |
sort by distance | 700.628 |
install cli-aws
make sure region is us-east-1 since that is where the taxi data is
>> aws-zones | grep us-east-1
launch an emr cluster with 10 nodes spot, this costs about $1.50/hour
>> cluster_id=$(aws-emr-new --count 10 test-cluster)
wait for the cluster to become ready
>> time aws-emr-wait-for-state $cluster_id
396.704 seconds
pull the csv data from s3 to hdfs
>> time aws-emr-ssh $cluster_id --cmd 's3-dist-cp --src="s3://nyc-tlc/trip data/" --srcPattern=".*yellow.*" --dest=/taxi_csv/'
292.210 seconds
create the tables
>> time aws-emr-hive -i $cluster_id schema.hql
convert csv to orc
>> time aws-emr-presto -i $cluster_id csv_to_orc.pql
309.197 seconds
run queries
>> aws-emr-presto -i $cluster_id count_rides_by_passengers.pql
7.834 seconds
>> aws-emr-presto -i $cluster_id count_rides_by_date.pql
14.138 seconds
>> aws-emr-presto -i $cluster_id sum_distance_by_date.pql
15.497 seconds
>> aws-emr-presto -i $cluster_id top_n_by_distance.pql
8.965 seconds
>> aws-emr-hive -i $cluster_id sort_by_distance.hql
>> aws-emr-presto -i $cluster_id sort_by_distance.pql
700.628 seconds
delete the cluster
>> aws-emr-rm $cluster_id
launch an s4 cluster with 10 nodes spot, this costs about $1.50/hour
>> time num=10 bash scripts/new_cluster.sh s4-cluster
223.798 seconds
tunnel cluster internal traffic through a cluster node via ssh
>> bash scripts/connect_to_cluster.sh s4-cluster
pull csv data from s3 and convert to bsv
>> time bash schema.sh
124.032 seconds
run queries
>> bash count_rides_by_passengers.sh
3.074 seconds
>> bash count_rides_by_date.sh
4.743 seconds
>> bash sum_distance_by_date.sh
10.419 seconds
>> bash top_n_by_distance.sh
1.804 seconds
>> bash sort_by_distance.sh
149.200 seconds