<a href="https://colab.research.google.com/github/jamesrichardbunting/neurodegeneration_pollution/blob/main/103_postcode_pollution_harmonisation_gr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Data prep

## 103. Postcode / pollution lookup

I now have collated postcode and pollution source files (see Notebooks 101 and 102). 

Both these files contain 'x' and 'y' coordinates corresponding to their easting and northing coordinates on the Ordnance Survey National Grid reference system (also known as British National Grid (BNG)). 

Through a simple transformation of these coordinates I can identify the 1km Grid Reference in which each pollution cell and postcode observation lies. This shared reference will allow me to match postcodes to each pollution map cell through a simple lookup. 

An explanation of this transformation can be found [here](https:/https://www.le.ac.uk/ar/arcgis/OS_coords.html/). In this notebook I will generate Grid References for each postcode and pollution cell, then append postcodes to each matching pollution cell. 


In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import glob

In [None]:
# Define a function to convert Eastings and Northings to the corresponding 1km square OS Grid Reference 

def grid_ref(data):
  
  d = {'00' : 'SV',
       '10' : 'SW',
       '20' : 'SX',
       '30' : 'SY',
       '40' : 'SZ',
       '50' : 'TV',
       '11' : 'SR',
       '21' : 'SS',
       '31' : 'ST',
       '41' : 'SU',
       '51' : 'TQ',
       '61' : 'TR',
       '12' : 'SM',
       '22' : 'SN',
       '32' : 'SO',
       '42' : 'SP',
       '52' : 'TL',
       '62' : 'TM',
       '23' : 'SH',
       '33' : 'SJ',
       '43' : 'SK',
       '53' : 'TF',
       '63' : 'TG',
       '24' : 'SC', 
       '34' : 'SD',
       '44' : 'SE',
       '54' : 'TA',
       '15' : 'NW',
       '25' : 'NX',
       '35' : 'NY',
       '45' : 'NZ',
       '16' : 'NR',
       '26' : 'NS',
       '36' : 'NT',
       '46' : 'NU',
       '07' : 'NL',
       '17' : 'NM',
       '27' : 'NN',
       '37' : 'NO',
       '08' : 'NF',
       '18' : 'NG',
       '28' : 'NH',
       '38' : 'NJ',
       '48' : 'NK',
       '09' : 'NA',
       '19' : 'NB',
       '29' : 'NC',
       '39' : 'ND',
       '110' : 'HW',
       '210' : 'HX',
       '310' : 'HY',
       '410' : 'HZ',
       '311' : 'HT',
       '411' : 'HU',
       '412' : 'HP', 
       '01' : 'NaN',
       '02' : 'Nan',
       '03' : 'Nan',
       '04' : 'NaN',
       '05' : 'NaN',
       '06' : 'NaN',
       '13' : 'NaN',
       '14' : 'NaN',
       '60' : 'NaN',
       '64' : 'NaN',
       '55' : 'NaN',
       '56' : 'NaN',
       '47' : 'NaN',
       '49' : 'NaN'}

  for i in range(data.shape[0]):
    
    outp = ''
    
    str_var = str(data['Eastings'][i])  
    
    if len(str_var) == 6:
        prefix = str_var[0]
        east = str_var[1:3]
    elif len(str_var) == 7: 
        prefix = str_var[0:2]
        east = str_var[2:4]
    elif len(str_var) == 5:
        prefix = '0'
        east = str_var[0:2]
    elif len(str_var) == 4:
        prefix = '0'
        east = '0' + str_var[0]
        
    str_var = str(data['Northings'][i])
    
    if len(str_var) == 6:
        prefix += str_var[0] 
        north = str_var[1:3]
    elif len(str_var) == 7: 
        prefix += str_var[0:2]
        north = str_var[2:4]
    elif len(str_var) == 5:
        prefix += '0'
        north = str_var[0:2]
    elif len(str_var) == 4:
        prefix += '0'
        north = '0' + str_var[0]
    
    if d[prefix] == 'NaN':
        data.loc[i, 'Grid_ref'] = 'NaN'
    else:
        outp += d[prefix] + east + north
        data.loc[i, 'Grid_ref'] = outp

### Postcode data

In [None]:
# Load postcode data into a working variable
postcodes = pd.read_csv('/content/postcodes.csv')

In [None]:
# Call the function on the postcodes dataset
grid_ref(postcodes)

In [None]:
# Check the output
postcodes.head()

Unnamed: 0,Postcode,Eastings,Northings,Grid_ref
0,AB101AB,394235,806529,NJ9406
1,AB101AF,394235,806529,NJ9406
2,AB101AG,394230,806469,NJ9406
3,AB101AH,394235,806529,NJ9406
4,AB101AL,394296,806581,NJ9406


In [None]:
# Export transformed postcodes as .CSV file for later use
postcodes.to_csv('final_postcodes.csv')

In [12]:
# Load postcode data into a working variable
postcodes = pd.read_csv('/content/final_postcodes.csv')

### Pollution data (PM2.5)

In [None]:
# Load pollution data into a working variable
pm25_long = pd.read_csv('/content/pm25_long.csv', 
                        dtype={'Eastings': 'Int64',
                               'Northings': 'Int64',
                               '2002': np.float64,
                               '2003': np.float64,
                               '2004': np.float64,
                               '2005': np.float64,
                               '2006': np.float64,
                               '2007': np.float64,
                               '2008': np.float64,
                               '2009': np.float64,
                               '2010': np.float64,
                               '2011': np.float64,
                               '2012': np.float64,
                               '2013': np.float64,
                               '2014': np.float64,
                               '2015': np.float64,
                               '2016': np.float64,
                               '2017': np.float64,
                               '2018': np.float64,
                               '2019': np.float64}
                        )

I know from working with the pollution dataset in a previous notebook that there are a large number of NaN values. 

These are usually in areas which lie off the coast where no pollution prediction is needed and should be removed. 



In [None]:
# Total number of row
n_rows = pm25_long.shape[0]

# Total number of rows containing NaN values
non_na_rows = pm25_long.shape[0] - pm25_long.dropna().shape[0]

print(f"""
The total number of rows in the PM25 dataset is: {n_rows}
The number of rows with NaNs in the PM25 dataset is: {non_na_rows}
This is equivalent to {non_na_rows / n_rows:.1%}
""")


The total number of rows in the PM25 dataset is: 281802
The number of rows with NaNs in the PM25 dataset is: 37890
This is equivalent to 13.4%



So 13.4% of the records contain at least one NaN value. I will remove these and continue with the larger subset, which is more likely to match the postcode dataset. 

In [None]:
pm25_long = pm25_long.dropna()
pm25_long.reset_index(drop=True, inplace=True)
pm25_long.head()

Unnamed: 0,Eastings,Northings,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,449500,1204500,5.508577,5.904274,4.958202,3.520028,5.928725,5.092914,4.222562,4.795578,6.662558,7.1809,6.003246,5.761731,6.117139,4.733212,3.349245,3.325448,2.986548,2.809464
1,450500,1204500,5.509383,5.905135,4.961749,3.523079,5.932654,5.095246,4.224603,4.796798,6.664222,7.182184,6.002618,5.761755,6.117988,4.733607,3.348644,3.325449,2.986847,2.8093
2,451500,1204500,5.511036,5.906586,4.964243,3.528449,5.943837,5.109222,4.235309,4.80697,6.675422,7.193728,6.016079,5.793636,6.1559,4.767183,3.360124,3.33361,2.998164,2.820699
3,452500,1204500,5.51256,5.908208,4.967636,3.531709,5.947665,5.113245,4.244409,4.815354,6.684472,7.202428,6.033532,5.801564,6.161029,4.769278,3.364254,3.340188,3.003749,2.826752
4,453500,1204500,5.514749,5.914227,4.981335,3.540267,5.957929,5.11791,4.255414,4.8291,6.699368,7.217539,6.061306,5.838717,6.206523,4.814907,3.379198,3.355659,3.021582,2.842594


In [None]:
# Call the function on the pollution dataset
grid_ref(pm25_long)

In [None]:
# Check the output
pm25_long.head()

Unnamed: 0,Eastings,Northings,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Grid_ref
0,449500,1204500,5.508577,5.904274,4.958202,3.520028,5.928725,5.092914,4.222562,4.795578,6.662558,7.1809,6.003246,5.761731,6.117139,4.733212,3.349245,3.325448,2.986548,2.809464,HP4904
1,450500,1204500,5.509383,5.905135,4.961749,3.523079,5.932654,5.095246,4.224603,4.796798,6.664222,7.182184,6.002618,5.761755,6.117988,4.733607,3.348644,3.325449,2.986847,2.8093,HP5004
2,451500,1204500,5.511036,5.906586,4.964243,3.528449,5.943837,5.109222,4.235309,4.80697,6.675422,7.193728,6.016079,5.793636,6.1559,4.767183,3.360124,3.33361,2.998164,2.820699,HP5104
3,452500,1204500,5.51256,5.908208,4.967636,3.531709,5.947665,5.113245,4.244409,4.815354,6.684472,7.202428,6.033532,5.801564,6.161029,4.769278,3.364254,3.340188,3.003749,2.826752,HP5204
4,453500,1204500,5.514749,5.914227,4.981335,3.540267,5.957929,5.11791,4.255414,4.8291,6.699368,7.217539,6.061306,5.838717,6.206523,4.814907,3.379198,3.355659,3.021582,2.842594,HP5304


I should check for NaN values again as this dataset contains some locations that fall outside of the OS Grid Reference system.

In [14]:
# Total number of rows
n_rows = pm25_long.shape[0]

# Total number of rows containing NaN values
non_na_rows = pm25_long.shape[0] - pm25_long.dropna().shape[0]

print(f"""
The total number of rows in the PM25 dataset is: {n_rows}
The number of rows with NaNs in the PM25 dataset is: {non_na_rows}
This is equivalent to {non_na_rows / n_rows:.1%}
""")


The total number of rows in the PM25 dataset is: 243912
The number of rows with NaNs in the PM25 dataset is: 8279
This is equivalent to 3.4%



And again I should remove them. 

In [21]:
pm25_long = pm25_long.dropna()
pm25_long.reset_index(drop=True, inplace=True)
pm25_long.head()

Unnamed: 0,Eastings,Northings,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Grid_ref
0,449500,1204500,5.508577,5.904274,4.958202,3.520028,5.928725,5.092914,4.222562,4.795578,6.662558,7.1809,6.003246,5.761731,6.117139,4.733212,3.349245,3.325448,2.986548,2.809464,HP4904
1,450500,1204500,5.509383,5.905135,4.961749,3.523079,5.932654,5.095246,4.224603,4.796798,6.664222,7.182184,6.002618,5.761755,6.117988,4.733607,3.348644,3.325449,2.986847,2.8093,HP5004
2,451500,1204500,5.511036,5.906586,4.964243,3.528449,5.943837,5.109222,4.235309,4.80697,6.675422,7.193728,6.016079,5.793636,6.1559,4.767183,3.360124,3.33361,2.998164,2.820699,HP5104
3,452500,1204500,5.51256,5.908208,4.967636,3.531709,5.947665,5.113245,4.244409,4.815354,6.684472,7.202428,6.033532,5.801564,6.161029,4.769278,3.364254,3.340188,3.003749,2.826752,HP5204
4,453500,1204500,5.514749,5.914227,4.981335,3.540267,5.957929,5.11791,4.255414,4.8291,6.699368,7.217539,6.061306,5.838717,6.206523,4.814907,3.379198,3.355659,3.021582,2.842594,HP5304


In [24]:
# Export transformed postcodes as .CSV file for later use
pm25_long.to_csv('pm25_long_grid.csv')