# 1. Data wrangling

## 103. Postcode / pollution lookup

Having collated the raw postcode and pollution source files in Notebooks 101 and 102, I can now add postcode data to the pollution maps via a lookup on their shared 'x' and 'y' variables. 

In this notebook I will add postcode data to each 1km gridsquare of the PM.25 pollution dataset.


In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import glob

In [4]:
# Load postcode data into a working variable
postcodes = pd.read_csv('/content/postcodes.csv', 
                        dtype={'Postcode': 'str',
                               'x': 'Int64',
                               'y': 'Int64'}
                        )

In [9]:
# Load PM25 data into a working variable
pm25_long = pd.read_csv('/content/pm25_long.csv', 
                        dtype={'ukgridcode': 'Int64',
                               'x': 'Int64',
                               'y': 'Int64',
                               '2002': np.float64,
                               '2003': np.float64,
                               '2004': np.float64,
                               '2005': np.float64,
                               '2006': np.float64,
                               '2007': np.float64,
                               '2008': np.float64,
                               '2009': np.float64,
                               '2010': np.float64,
                               '2011': np.float64,
                               '2012': np.float64,
                               '2013': np.float64,
                               '2014': np.float64,
                               '2015': np.float64,
                               '2016': np.float64,
                               '2017': np.float64,
                               '2018': np.float64,
                               '2019': np.float64}
                        )

I know from Notebook 102 that the PM25 dataset contains NaN values. 

I should check the number of rows containing NaNs. 

In [27]:
n_rows = pm25_long.shape[0]
non_na_rows = pm25_long.shape[0] - pm25_long.dropna().shape[0]

print(f"""
The total number of rows in the PM25 dataset is: {n_rows}
The number of rows with NaNs in the PM25 dataset is: {non_na_rows}
This is equivalent to {non_na_rows / n_rows:.1%}
""")


The total number of rows in the PM25 dataset is: 281803
The number of rows witht NaNs in the PM25 dataset is: 37891
This is equivalent to 13.4%



I will proceed with this smaller subset for now but may return to increase training data later if needed, by including rows with a small number of NaN values and using imputation methods to fill them.



In [25]:
# Subset out all non-negative rows in the PM25 dataset
nn_pm25_long = pm25_long.dropna(how='any', axis=0)

In [28]:
nn_pm25_long = pd.merge(nn_pm25_long, postcodes, how = 'left', on = ['x','y'])

In [32]:
print(f"""
The total number of rows in the new dataset is: {nn_pm25_long.shape[0]}
The number of rows with NaNs in the new dataset is: {nn_pm25_long.shape[0] - nn_pm25_long.dropna().shape[0]}
""")


The total number of rows in the new dataset is: 243912
The number of rows with NaNs in the new dataset is: 243909

