Running queries on Open City Model using AWS Athena
AWS Athena provides a great option for running queries against the Open City Model if your use case has an analytical need. In this example we'll explore how you can setup Athena to query the Open City Model data directly where it lives, without the need to download it.
The only thing you need to run this example is an AWS account
I. Setup AWS Athena with Open City Model data for querying
The first thing we need to do is log in to AWS and navigate to the AWS Athena Console. From here we will define a basic schema representing the Open City Model data that we want to be able to query from Athena.
NOTE: all of these steps simply involve running the provided SQL in the Athena
Query Editor tab.
- Create a new
databasein Athena to represent Open City Model data.
CREATE DATABASE opencitymodel;
- Define a
tablerepresenting the Open City Model buildings data that we want to access
CREATE EXTERNAL TABLE `opencitymodel.jun2019`( `hash` string, `ubid` string, `grid` string, `lat` double, `lon` double, `area` double, `height` double, `height_source` string, `height_predict` double, `fp` string, `fp_source` string) PARTITIONED BY ( `state` string, `county` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://opencitymodel/2019-jun/parquet/' TBLPROPERTIES ( 'parquet.compress'='SNAPPY', 'has_encrypted_data'='false')
- Add our table partitions to Athena
Our data is partitioned by
county to make it optimal for querying by region if you are not interested in looking at the entire USA. This step simply informs Athena about what specific partitions exist so it can optimize data access for our queries when we run them. You'll have to be a bit patient with this step because it can take several minutes (10m to be more precise).
MSCK REPAIR TABLE opencitymodel.jun2019
II. Run queries against Open City Model
Now we get to the fun part. With our schema established we can now simply run any queries that we want and Athena handles the heavy lifting.
- Simple query to see the data
This is just a simple one to get us going and validate everything is working. We'll simply list a random 10 buildings to get a sense for what is included in the data.
SELECT * FROM opencitymodel.jun2019 LIMIT 10;
- Export a list all buildings in a given lat/lon bounding box
In this example we use a simple geospatial query to find the tallest buildings that are within a given bounding box and then we could decide to export that data into a CSV file if we wanted.
In this example we are looking at the buildings in a small neighborhood of San Francisco, Ca.
SELECT ubid, height, area, lat, lon FROM opencitymodel.jun2019 WHERE lat BETWEEN 37.800921453 AND 37.8047694146 AND lon BETWEEN -122.4090803069 AND -122.4028448283 AND state = 'California' AND county = '06075' ORDER BY height DESC;
- Calculate the total surface area (in hectares) used by buildings in each California county
We can also run some interesting aggregations based on attributes such as building height or area. In this case we query to determine how much land area is used for buildings in each of California's counties.
SELECT county, round(sum(area)/10000) as area_hectares FROM opencitymodel.jun2019 WHERE state = 'California' GROUP BY county ORDER BY area_hectares;
From the results we can see that that California has a pretty wide spectrum of land use with Los Angeles county using a massive 69.2k hectares of land for buildings while just a bit farther north Alpine county has a meager 34 hectares of land for buildings.