# Data gathering and preprocessing

This notebook is dedicated to data gathering and preprocessing. That is, with this notebook you'll be able to: download the data, unzip it (if necessary), and, finally, prepare the data for further vizualization and analysis.  

In this study, we used open data from the following sources:
1. London Datastore (London shape files);
2. Office for National Statistics (London population);
3. Transport for London (metro traffic);
4. Wikimedia Commons (metro station locations);
5. OpenStreetMaps (points of interest).

The links to the data sets can be found in the References section.

To gather the data we used URL links to the websites of the data providers. Note, that the __data sets can be updated__ by corresponding agencies. Therefore, some discrepancies are possible: new variables will become available, or some data set will have fewer attributes.

The following figure describes data preprocessing pipeline:

<img src="../figures/data-preprocessing-pipeline.png" width="800"/>

The notebook is split into three sections: Data Gathering, Data Preprocessing and References. Each of the sections consists of subsections covering different data sets.


In [3]:
import requests, zipfile, io, os
from datetime import datetime
import pandas as pd
import geopandas as gpd
from tqdm import tqdm
from pyrosm import OSM
import shutil

## 1. Data gathering

__London shape files__

In [4]:
url = 'https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/geometry/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

z.extractall(path=directory)
print(f'Download date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Download date: 10-06-2020 23:53:21


__London population__

In [5]:
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationestimates%2fdatasets%2fcensusoutputareaestimatesinthelondonregionofengland%2fmid2017/sape20dt10amid2017coaunformattedsyoaestimateslondon.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/population/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Download date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Download date: 10-06-2020 23:53:24


__OpenStreetMaps POIs__

In [6]:
%%time
url = 'https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf'
r = requests.get(url)

directory = "../data/raw/pois/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
with open(directory + 'greater-london-latest.osm.pbf', 'wb') as f:
    f.write(r.content)
print(f'Download date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Download date: 10-06-2020 23:53:30
Wall time: 5.86 s


In [7]:
%%time
# Initialize the OSM parser object
osm = OSM('../data/raw/pois/greater-london-latest.osm.pbf')

# Let's read everything
pois = osm.get_pois()

# Gather info about POI type (combines the tag info from "amenity" and "shop")
pois["poi_type"] = pois["amenity"]
pois["poi_type"] = pois["poi_type"].fillna(pois["shop"])

Wall time: 2min 15s


In [8]:
pois.shape

(117549, 122)

In [9]:
directory = "../data/processed/pois/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
pois.to_csv(directory + "pois.csv", index=False)

## 2. Data preprocessing

__Connect shape files and population__

In [16]:
%%time
# Load population data
population = pd.read_excel("../data/raw/population/SAPE20DT10a-mid-2017-coa-unformatted-syoa-estimates-london.xlsx", 
                           sheet_name="Mid-2017 Persons", skiprows=4)  # the first 4 rows have irrelevant information, so skip them

# Rename a column
population.rename({"All Ages": "total_population"}, axis=1, inplace=True)

In [17]:
%%time
# Merge geometry and population data, both boroughs and subdistricts
geometry = gpd.read_file('../data/raw/geometry/statistical-gis-boundaries-london/ESRI/OA_2011_London_gen_MHW.shp')
merged = pd.merge(geometry, population[["OA11CD", "total_population"]], on="OA11CD")                            
boroughs = merged.dissolve("LAD11NM", aggfunc="sum", as_index=False)                            
subdistricts = merged.dissolve("WD11CD_BF", aggfunc="sum", as_index=False)

# Define the directory to store the data
directory = "../data/processed/population/"

# Create it if needed
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
                
# Save the data
boroughs.to_file(directory + 'boroughs.json', driver='GeoJSON')
subdistricts.to_file(directory + 'subdistricts.json', driver='GeoJSON')

__POIs__

In [19]:
# load POIs data
pois = pd.read_csv("../data/processed/pois/pois.csv", low_memory=False)
pois = gpd.GeoDataFrame(pois, geometry=gpd.points_from_xy(pois['lon'], pois['lat']))
pois.crs = {'init' : 'epsg=4326'}
pois.shape

(117549, 122)

Sometimes geometry of a POI is of _Multipolygon_ or _Polygon_ type. Let's convert it to _Point_ type for uniformity. 

In [20]:
# Change geometry
pois['geometry'] = pois['geometry'].apply(lambda x:x.centroid)

In "pois_categorization.csv" we introduced a __subjective categorization__ of POIs into a set of categories. Let's assign these categories to the POIs that we've collected.

In [21]:
# Load categories and merge them with POIs data
pois_categories = pd.read_csv("../data/external/pois_categories.csv")
pois = pd.merge(pois, pois_categories, left_on='poi_type', right_on="pois")
pois.drop('amenity', axis=1, inplace=True)

In [22]:
# Remove amenities tagged 'misc'
pois = pois[pois['pois_category'] != "misc"]

In [23]:
%%time
# Merge POIs with geometry
resolution = 'boroughs'
# resolution = 'boroughs'
# Boroughs are tagged LAD11NM, subdistricts WD11CD_BF
if resolution == 'subdistricts':
    column_id = 'WD11CD_BF'
elif resolution == 'boroughs':
    column_id = 'LAD11NM'
polygons = gpd.read_file(f'../data/processed/population/{resolution}.json')
polygons.to_crs(epsg=4326, inplace=True)

  return _prepare_from_string(" ".join(pjargs))


Wall time: 3.48 s


In [24]:
pois_in_polygon = gpd.sjoin(pois, polygons, how="inner", op="intersects")

  "(%s != %s)" % (left_df.crs, right_df.crs)


In [25]:
# Get total counts of POIs types in each borough/subdistricts
pois_counts = pois_in_polygon.groupby(['pois_category', f'{column_id}']).agg(len)
pois_counts = pois_counts.reset_index()
pois_counts = pois_counts.pivot(index="pois_category", columns =f"{column_id}", values= "name")

In [26]:
# Define the directory to store the data
directory = "../data/processed/pois/"

# Create it if needed
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

# Save the data
pois_counts.to_csv(directory + f'pois_counts_{resolution}.csv')

## 3. References
1. London Datastore (2019). Statistical GIS Boundary Files for London. Retrieved from https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
2. Office for National Statistics (2019). Census Output Area population estimates – London, England (supporting information). Retrieved from https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/censusoutputareaestimatesinthelondonregionofengland
3. Transport for London (2020). Transport for London API. Retrieved from https://api-portal.tfl.gov.uk/docs
4. Wikimedia Commons (2020). London Underground geographic maps/CSV. Retrieved from https://commons.wikimedia.org/wiki/London_Underground_geographic_maps/CSV