## Compute OA and LSOA centroids
This notebook is to compute the centroids for each OA and LSOA, and store them as datasets to be used later on. The centroids are computed using the ONS postcode directory (February 2019). The same dataset also offers a lookup between OAs/LSOAs and TTWAs and this information is stored as well.

For the future it would be good to change how LSOA centroid are computed and make a sum over the OA centroids weighted by population (over the OAs rather than postocode, because I think census population data is at the level of the OAs).

In [1]:
import urllib.request, json
import requests
import pandas as pd
import numpy as np
import pickle
import time
import matplotlib.pyplot as plt
import os

In [2]:
# relevant folders
folder1= '/Users/stefgarasto/Local-Data/'
folder2 = '/Users/stefgarasto/Google Drive/Documents/data/'
folder3 = '/Users/stefgarasto/Google Drive/Documents/results/'

In [3]:
# relevant files
ons_pc_file = folder2 + 'ONS/ONS-Postcode-Directory-Latest-Centroids.csv'
pop_density_file = folder2 + 'ONS/population_density1.csv'
ttwa_file = folder2 + 'ONS/Travel_to_Work_Areas_December_2011_Boundaries.csv'

In [4]:
# first, load the list of all TTWA
ttwa_data = pd.read_csv(ttwa_file)
# first column is ttwa codes, second column is ttwa names

In [5]:
# load the lookup table between postcodes, OAs, LSOAa and TTWAs
ons_pc = pd.read_csv(ons_pc_file)
# only keep relevant columns
ons_pc = ons_pc[['X', 'Y', 'objectid', 'pcd', 'pcd2', 'pcds', 'ttwa', 'oa01',
       'lsoa01', 'msoa01', 'oac01', 'oa11', 'lsoa11', 'msoa11',
       'oac11', 'lat', 'long']]
# show a snippet
ons_pc.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,X,Y,objectid,pcd,pcd2,pcds,ttwa,oa01,lsoa01,msoa01,oac01,oa11,lsoa11,msoa11,oac11,lat,long
0,-2.073663,57.13791,1001,AB1 3QF,AB1 3QF,AB1 3QF,S22000047,S00000139,S01000082,S02000015,6B1,S00089106,S01006631,S02001258,7A1,57.137926,-2.073655
1,-2.078354,57.137306,1002,AB1 3QH,AB1 3QH,AB1 3QH,S22000047,S00000137,S01000084,S02000012,2B1,S00089104,S01006627,S02001257,7B3,57.137321,-2.078346
2,-2.07923,57.137287,1003,AB1 3QJ,AB1 3QJ,AB1 3QJ,S22000047,S00000137,S01000084,S02000012,2B1,S00089104,S01006627,S02001257,7B3,57.137303,-2.079222
3,-2.080454,57.137825,1004,AB1 3QL,AB1 3QL,AB1 3QL,S22000047,S00001535,S01000084,S02000012,5C2,S00090590,S01006627,S02001257,3C2,57.137841,-2.080446
4,-2.080287,57.137241,1005,AB1 3QN,AB1 3QN,AB1 3QN,S22000047,S00001535,S01000084,S02000012,5C2,S00090590,S01006627,S02001257,3C2,57.137257,-2.080279


In [6]:
# group by LSOAs and get centroids
# Note: the centroids are taken by averaging over the centroids of all postcodes in that LSOA. Ideally, we would
# instead compute a population weighted average - however, census data is at most at the OA level, so we could
# at most compute the OA centroids first, then average those multiplied by population density (TODO-next)
lsoa_list = []
lsoa_lat = []
lsoa_long = []
t0 = time.time()
lsoa_groups = ons_pc.groupby('lsoa11')
# get average lat and long
lsoa_data = lsoa_groups[['lat','long']].agg(np.mean)
# get the unique TTWA for each LSOA (see check done below)
lsoa_data = lsoa_data.join(lsoa_groups['ttwa'].agg(pd.Series.unique))

# check that the same lsoa always belongs to the same ttwa
# There are none, and there's no need to do it more than once
#for name,group in lsoa_groups: 
#    if not group['ttwa'].describe()['unique'] == 1:
#        print('LSOA {} is in more than one TTWA'.format(name))
print('Done in {:4f} s'.format(time.time() - t0))
lsoa_data['ttwa'].describe()

Done in 5.404177 s


count         42621
unique          230
top       E30000234
freq           4777
Name: ttwa, dtype: object

In [7]:
# group by OAs and get centroids
# Note: the centroids are taken by averaging over the centroids of all postcodes in that OA. We will use this only 
# for calls to the lmiforall API, since they are free - we can then average distances and occupations breakdown across
# LSOAs if needed
oa_list = []
oa_lat = []
oa_long = []
t0 = time.time()
oa_groups = ons_pc.groupby('oa11')
# get average lat and long
oa_data = oa_groups[['lat','long']].agg(np.mean)
# get the unique TTWA for each LSOA (if it is not unique it should throw and error)
oa_data = oa_data.join(oa_groups[['ttwa','lsoa11']].agg(pd.Series.unique))
print('Done in {:4f} s'.format(time.time() - t0))
oa_data[['lsoa11','ttwa']].describe()

Done in 57.190902 s


Unnamed: 0,lsoa11,ttwa
count,232034,232034
unique,42621,230
top,S01010390,E30000234
freq,14,24840


In [8]:
# save the extracted dictionaries if needed
exists = os.path.isfile(folder3 + 'PIN/oa_centroids_dictionary.pickle')
if not exists:
    print('Saving the OA data')
    oa_data.to_pickle(folder3 + 'PIN/oa_centroids_dictionary.pickle')
exists = os.path.isfile(folder3 + 'PIN/lsoa_centroids_dictionary.pickle')
if not exists:
    print('Saving the LSOA data')
    lsoa_data.to_pickle(folder3 + 'PIN/lsoa_centroids_dictionary.pickle')

# eliminate the ons_pc dataset - now it's not needed anymore
ons_pc = None

Saving the OA data
Saving the LSOA data


In [None]:
'''
Check memory usage of all variables
'''

import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') 
        and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)
