### Step 1: Data gathering 

__Step goal__: Download and store the datasets used in this study.

__Step overview__:
1. London demographic data;
2. London shape files;
3. Counts data;
4. Metro stations and lines.

#### Introduction

All data is __open access__ and can be found on the official websites. Note, that the data sets can be updated by corresponding agencies; therefore, some discrepancies are possible: new variables will become available, or some data set will have fewer attributes.

In [1]:
import requests, zipfile, io
from datetime import datetime
import os
import pandas as pd
from bs4 import BeautifulSoup as bs

1. London demographic data

In [9]:
url = 'https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationestimates%2fdatasets%2fcensusoutputareaestimatesinthelondonregionofengland%2fmid2017/sape20dt10amid2017coaunformattedsyoaestimateslondon.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/population/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Downloading date: 09-04-2020 18:54:38


2. London shape files

In [10]:
url = 'https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/geometry/london/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Downloading date: 09-04-2020 18:54:43


3. Counts data

In [11]:
url = 'http://tfl.gov.uk/tfl/syndication/feeds/counts.zip?app_id=&app_key='
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/counts/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Downloading date: 09-04-2020 18:54:43


4. Station locations ans lines

In [98]:
r = requests.get(url)
soup = bs(r.content, 'lxml')
pre = soup.select('pre')

file_names = ['stations.csv', 'routes.csv', 'lines.csv']

directory = "../data/raw/geometry/metro_stations/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

for i, p in enumerate(pre):
    df = pd.DataFrame([x.split(',') for x in p.text.split('\n')])
    df.to_csv(directory + file_names[i])
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Downloading date: 09-04-2020 19:34:10


### References
1. Office for National Statistics (2019). Census Output Area population estimates – London, England (supporting information). Retrieved from https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/censusoutputareaestimatesinthelondonregionofengland
2. London Datastore (2019). Statistical GIS Boundary Files for London. Retrieved from https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
3. Transport for London (2020). Transport for London API. Retrieved from https://api-portal.tfl.gov.uk/docs
4. Wikimedia Commons (2020). London Underground geographic maps/CSV. Retrieved from https://commons.wikimedia.org/wiki/London_Underground_geographic_maps/CSV