<a href="https://colab.research.google.com/github/jamesrichardbunting/neurodegeneration_pollution/blob/main/104_postcode_pollution_harmonisation_lookup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Data prep

## 104. Postcode / pollution lookup (2/2)

In Notebook 103 I generated and appended Ordnance Survey (OS) National Grid reference codes to every postcode and every pollution map cell, corresponding to the 1km square cell in which they lie. 

In this notebook, I will use the Grid Reference codes as lookup values between the two datasets, appending postcodes to every pollution cell.


In [1]:
# Import packages
import pandas as pd
import numpy as np

In [49]:
# Load datasets into working variables
postcodes = pd.read_csv('/content/final_postcodes.csv', 
                        dtype={'Postcode' : 'str',
                               'Eastings' : 'Int64',
                               'Northings' : 'Int64',
                               'Grid_ref' : 'str'})

pm25_long = pd.read_csv('/content/pm25_long_grid.csv', 
                        dtype={'Eastings': 'Int64',
                               'Northings': 'Int64',
                               '2002': np.float64,
                               '2003': np.float64,
                               '2004': np.float64,
                               '2005': np.float64,
                               '2006': np.float64,
                               '2007': np.float64,
                               '2008': np.float64,
                               '2009': np.float64,
                               '2010': np.float64,
                               '2011': np.float64,
                               '2012': np.float64,
                               '2013': np.float64,
                               '2014': np.float64,
                               '2015': np.float64,
                               '2016': np.float64,
                               '2017': np.float64,
                               '2018': np.float64,
                               '2019': np.float64,
                               'Grid_ref' : 'str'})

In [50]:
# Check the files have loaded properly
print(postcodes.head())
print(pm25_long.head())

   Unnamed: 0 Postcode  Eastings  Northings Grid_ref
0           0  AB101AB    394235     806529   NJ9406
1           1  AB101AF    394235     806529   NJ9406
2           2  AB101AG    394230     806469   NJ9406
3           3  AB101AH    394235     806529   NJ9406
4           4  AB101AL    394296     806581   NJ9406
   Unnamed: 0  Eastings  Northings  ...      2018      2019  Grid_ref
0           0    449500    1204500  ...  2.986548  2.809464    HP4904
1           1    450500    1204500  ...  2.986847  2.809300    HP5004
2           2    451500    1204500  ...  2.998164  2.820699    HP5104
3           3    452500    1204500  ...  3.003749  2.826752    HP5204
4           4    453500    1204500  ...  3.021582  2.842594    HP5304

[5 rows x 22 columns]


In [51]:
# Remove unneeded columns
postcodes = postcodes.drop(['Unnamed: 0', 'Eastings', 'Northings'], axis=1)
pm25_long = pm25_long.drop(['Unnamed: 0', 'Eastings', 'Northings'], axis=1)

# Check the output
print(postcodes.head())
print(pm25_long.head())

  Postcode Grid_ref
0  AB101AB   NJ9406
1  AB101AF   NJ9406
2  AB101AG   NJ9406
3  AB101AH   NJ9406
4  AB101AL   NJ9406
       2002      2003      2004  ...      2018      2019  Grid_ref
0  5.508577  5.904274  4.958202  ...  2.986548  2.809464    HP4904
1  5.509383  5.905135  4.961749  ...  2.986847  2.809300    HP5004
2  5.511036  5.906586  4.964243  ...  2.998164  2.820699    HP5104
3  5.512560  5.908208  4.967636  ...  3.003749  2.826752    HP5204
4  5.514749  5.914227  4.981335  ...  3.021582  2.842594    HP5304

[5 rows x 19 columns]


### Group postcodes by Grid Reference

My first job is to group postcodes that reside in the same 1km square This will make the eventual lookup between dataframes quicker and more efficient.  

In [58]:
# Group postcodes by 'Grid_ref' value
trans_postcodes = postcodes.groupby('Grid_ref')

# Convert to list type
trans_postcodes = trans_postcodes['Postcode'].apply(list)

# Reset index
trans_postcodes = trans_postcodes.reset_index()

In [62]:
# Check output 
print(trans_postcodes.head())
print(trans_postcodes.shape)

  Grid_ref            Postcode
0   HP5303  [ZE2 9DD, ZE2 9DE]
1   HP5304  [ZE2 9BB, ZE2 9BZ]
2   HP5803           [ZE2 9DW]
3   HP5900  [ZE2 9DN, ZE2 9EH]
4   HP6000           [ZE2 9DL]
(123036, 2)


In [63]:
print(f"""
I have reduced the dimensionality of the postcode dataset to: {trans_postcodes.shape[0] / postcodes.shape[0]:.1%}
""")


I have reduced the dimensionality of the postcode dataset to: 7.2%



### Merge postcode and pollution datasets

I can now merge the two datasets according to matching Grid References.


In [65]:
# Merge dataframes on the'Grid_ref' column using default inner join
pm25_long_postcodes = pd.merge(pm25_long,
                               trans_postcodes,
                               on='Grid_ref')

In [74]:
# Check the output 
pm25_long_postcodes.head()

Unnamed: 0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Grid_ref,Postcode
0,5.514749,5.914227,4.981335,3.540267,5.957929,5.11791,4.255414,4.8291,6.699368,7.217539,6.061306,5.838717,6.206523,4.814907,3.379198,3.355659,3.021582,2.842594,HP5304,"[ZE2 9BB, ZE2 9BZ]"
1,5.517454,5.91695,4.989788,3.543163,5.961709,5.111147,4.248519,4.824398,6.697355,7.215009,6.058562,5.827984,6.19236,4.799379,3.373939,3.353308,3.020325,2.841708,HP5303,"[ZE2 9DD, ZE2 9DE]"
2,5.519502,5.9461,5.493266,4.052172,6.479181,5.63884,4.753676,5.486241,6.694517,7.237729,6.024529,5.809897,6.172776,4.772047,3.369199,3.35114,3.012448,2.832444,HU5198,[ZE2 9DQ]
3,5.523847,5.976072,5.592174,4.096052,6.529981,5.661076,4.818233,5.560419,6.763672,7.320381,6.134806,5.903045,6.286397,4.879957,3.418933,3.417825,3.082657,2.897314,HU5498,[ZE2 9DF]
4,5.519471,5.959595,5.542004,4.074066,6.504464,5.649133,4.771908,5.506301,6.722859,7.264177,6.044989,5.845307,6.21894,4.807485,3.377687,3.365021,3.030917,2.847821,HU5297,[ZE2 9DG]
