# Data gathering and preprocessing

This notebook is dedicated to data gathering and preprocessing. That is, with this notebook you'll be able to: download the data, unzip it (if necessary), and, finally, prepare the data for further vizualization and modeling.  

In this study, we used open data from the following sources:
* London Datastore (London shape files);
* Office for National Statistics (London population);
* Transport for London (metro traffic);
* Wikimedia Commons (metro station locations);
* OpenStreetMaps (amenities).

The links to the data sets can be found in the __References section__. Besides that, we used results from Verma et al. (2020). In this work, authors proposed spatio-temporal clustering of London metro stations based on their traffic. We will use results of this clustering in the second notebook.

To gather the data we used URL links to the websites of the data providers. Note, that the __data sets can be updated__ by corresponding agencies; therefore, some discrepancies are possible: new variables will become available, or some data set will have fewer attributes. To gather amenities data we used Python package OSMnx (Boeing, 2017).

The notebook is split into three sections: Data Gathering, Data Preprocessing and References. Each of the sections consists of subsections covering different data sets.

In [1]:
import requests, zipfile, io, os
from datetime import datetime
import pandas as pd
import geopandas as gpd
from tqdm import tqdm
import osmnx as ox

ImportError: cannot import name 'CRS' from 'pyproj' (C:\Users\Mikhail\AppData\Local\Continuum\miniconda3\envs\spacetimegeo\lib\site-packages\pyproj\__init__.py)

## 1. Data gathering

London shape files

In [12]:
url = 'https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "data/raw/geometry/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Succefully created new directory data/raw/geometry/
Downloading date: 26-04-2020 17:41:56


London population

In [10]:
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationestimates%2fdatasets%2fcensusoutputareaestimatesinthelondonregionofengland%2fmid2017/sape20dt10amid2017coaunformattedsyoaestimateslondon.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "data/raw/population/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Succefully created new directory data/raw/population/
Downloading date: 26-04-2020 17:39:00


OpenStreetMaps amenities

In [39]:
geometry = gpd.read_file("data/raw/geometry/statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.shp")
geometry = geometry.to_crs(epsg=4326)
print(f'Initial number of shape files : {geometry.shape[0]}.')

Initial number of shape files : 25053.


Since initial number of polygons is high, it will take many hours to download amenities data. Aggregating into fewer, but larger polygons reduces the download time significantly.

In [40]:
# Get borough boundaries. Their will be of polygon/multipolygons type
polygons = geometry.dissolve(by="LSOA11NM")['geometry']
n_polygons = polygons.shape[0]
print(f'Resulting number of polygons : {n_polygons}.')

Resulting number of shape files : 4835.


In [41]:
amenities = []
n_fails = 0
for polygon in tqdm(polygons, total=n_polygons):
    try:
        amenities.append(ox.pois.pois_from_polygon(polygon))
    except:
        n_fails += 1
        pass

print(f"Unable to colect amenities from {n_fails} polygons.")

100%|██████████████████████████████████████████████████████████████████████████| 4835/4835 [00:00<00:00, 539175.26it/s]

Unable to colect amenities from 4835 polygons.





In [None]:
# get downloaded POIs into a geodataframe
# amenities = gpd.GeoDataFrame(pd.concat(amenities, ignore_index=True))

In [None]:
# save amenity data
# amenities.to_csv(data_path + "amenities.csv")

In [None]:
# save amenity data
# amenities = amenities[["osmid", "geometry", "amenity", "name", "building", "address"]]
# amenities.to_file(data_path + "amenities_LSOA.json", driver="GeoJSON", encoding="utf-8")

## 2. Data preprocessing

## 3. References
1. London Datastore (2019). Statistical GIS Boundary Files for London. Retrieved from https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
2. Office for National Statistics (2019). Census Output Area population estimates – London, England (supporting information). Retrieved from https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/censusoutputareaestimatesinthelondonregionofengland
3. Transport for London (2020). Transport for London API. Retrieved from https://api-portal.tfl.gov.uk/docs
4. Wikimedia Commons (2020). London Underground geographic maps/CSV. Retrieved from https://commons.wikimedia.org/wiki/London_Underground_geographic_maps/CSV
5. OpenStreetMap contributors (2020). Amenities. Retrieved from https://www.openstreetmap.org.
6. Verma, T., Sirenko, M., Kornecki, I., Cunningham S., Araujo, N. A. M. (2020) Temporal demand profiles of mobility reveal the spatial structure of a city. Manuscript in preparation.
7. Boeing, G. (2017). OSMnx: New Methods for Acquiring, Constructing, Analyzing, and Visualizing Complex Street Networks. Computers, Environment and Urban Systems 65, 126-139. doi:10.1016/j.compenvurbsys.2017.05.004