<a href="https://colab.research.google.com/github/jamesrichardbunting/neurodegeneration_pollution/blob/main/102_pollution_data_wrangling_pm25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Data wrangling

## 102 Combine the pollution datasets

Pollution data is provided as open data by the Department for Environment, Food and Rural Affairs (Defra). 

The data I am working with are modelled (ie, predicted) background pollution maps, provided at 1km x 1km resolution across the UK. In this phase of analysis I am concerned only with the PM2.5 pollutant.

Modelled values of PM2.5 go back to 2002 and each year's predictions (up to 2019) are provided in an individual .CSV file. 

In this notebook I will collate the yearly predictions, producing a time-series dataset, suitable for longitudinal analysis. 


In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import glob

I will define a simple function to identify all pollution files in the working directory, saving the filenames to a new variable. 

In [2]:
# Define a function to identify all pollutions files in the working directory
def pollution_finder(): 
  prefix = 'map*' # Define the string to be used as a search term (note that all pollution files use 'map' as a prefix)
  pollution_files = [j for j in glob.glob(f"*{prefix}")] # Search working directory for files using the prefix, saving the file names
  return pollution_files # Return the list of matching files 


In [28]:
# Call the function and save output to a new variable
pollution_files = pollution_finder()

In [22]:
# Check the ouput
pollution_files

['mappm252013g.csv',
 'mappm2505ac.csv',
 'mappm252008g.csv',
 'mappm252011g.csv',
 'mappm252004g.csv',
 'mappm252019g.csv',
 'mappm252016g.csv',
 'mappm252007g.csv',
 'mappm252002 (1).csv',
 'mappm252010g.csv',
 'mappm252018g.csv',
 'mappm252006gh.csv',
 'mappm252015g.csv',
 'mappm252012g.csv',
 'mappm252009g.csv',
 'mappm252017g.csv',
 'mappm252014g.csv',
 'mappm252003grav.csv']

These files are not ordered, which is needed to ensure the time-series data are appended in the correct order. 

Automatic sorting will not produce perfect results because of inconsistencies in the way the files have been named, but I can fix any mistakes manually. 


In [29]:
# Sort the list of filenames
pollution_files = sorted(pollution_files)

In [24]:
# Check the output
pollution_files

['mappm2505ac.csv',
 'mappm252002 (1).csv',
 'mappm252003grav.csv',
 'mappm252004g.csv',
 'mappm252006gh.csv',
 'mappm252007g.csv',
 'mappm252008g.csv',
 'mappm252009g.csv',
 'mappm252010g.csv',
 'mappm252011g.csv',
 'mappm252012g.csv',
 'mappm252013g.csv',
 'mappm252014g.csv',
 'mappm252015g.csv',
 'mappm252016g.csv',
 'mappm252017g.csv',
 'mappm252018g.csv',
 'mappm252019g.csv']

Only the file for 2005 has been incorrectly sorted. I will fix this manually. 

In [25]:
# Define a simple function to move a list element one position to another
def list_rearranger(lst, rem_pos, ins_pos):
  lst.insert(ins_pos, lst.pop(rem_pos))

In [30]:
# Call the function on the list of pollution filenames to move the 2005 file to its correct position
list_rearranger(pollution_files, 0, 3)

In [31]:
# Check the output
print(pollution_files)

['mappm252002 (1).csv', 'mappm252003grav.csv', 'mappm252004g.csv', 'mappm2505ac.csv', 'mappm252006gh.csv', 'mappm252007g.csv', 'mappm252008g.csv', 'mappm252009g.csv', 'mappm252010g.csv', 'mappm252011g.csv', 'mappm252012g.csv', 'mappm252013g.csv', 'mappm252014g.csv', 'mappm252015g.csv', 'mappm252016g.csv', 'mappm252017g.csv', 'mappm252018g.csv', 'mappm252019g.csv']


Great. Now the filenames are sorted I can collate them in the correct order.

Let's first view the structure of these files to understand how we should perform the collation. 


In [35]:
# Load file and print the first 10 rows
pm25_2002 = pd.read_csv('/content/mappm252002 (1).csv')
pm25_2002.head(10)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,pm2.5,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,2002,,,
1,annual mean,,,
2,ug m-3,,,
3,,,,
4,ukgridcode,x,y,pm252002
5,54291,460500,1221500,MISSING
6,54292,461500,1221500,MISSING
7,54294,463500,1221500,MISSING
8,54979,458500,1220500,MISSING
9,54980,459500,1220500,MISSING


The first 4 rows are given over to metadata and can be subsetted out during the collation process. 

The 5th row contains columns headers. I will leave this in so I can confirm the collation has taken place in the correct order. 

The first 3 columns contain the 'gridcode', easting and northing values for each 1km square. This information is consistent across all yearly files so can be subsetted out of the collation process and added to the combined file afterwards.

In [45]:
# Define a function that accepts a list of filenames, collates them on column 4 only and returns the combined file
def pollution_combiner(filenames):
  comb_pollution = pd.concat([pd.read_csv(file, skiprows=4, usecols=[3]) for file in pollution_files], ignore_index=True, axis = 1)
  return comb_pollution

In [46]:
# Call the function and save output to a new variable
pm25_long = pollution_combiner(pollution_files)

In [48]:
# Check that this was successful
pm25_long.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,pm252002,pm252003grav,pm252004g,pm2505ac,pm252006gh,pm252007g,pm252008g,pm252009g,pm252010g,pm252011g,pm252012g,pm252013g,pm252014g,pm252015g,pm252016g,pm252017g,pm252018g,pm252019g
1,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
3,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
4,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING


This has worked as expected. 

I can now remove the first row and add appropriate column headers.  

In [49]:
# Define headers
yearly_headers = ['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']


In [50]:
# Subset out the first row
pm25_long = pm25_long.iloc[1: , :]

# Add headers
pm25_long.columns = yearly_headers

# Check output is correct
pm25_long.head()

Unnamed: 0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
1,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
3,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
4,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING


In [None]:
# Convert the first row variable to a list 
yearly_headers = yearly_headers.values.tolist()
yearly_headers

[['pm252019g',
  'pm252003grav',
  'pm2505ac',
  'pm252008g',
  'pm252012g',
  'pm252014g',
  'pm252011g',
  'pm252007g',
  'pm252017g',
  'pm252015g',
  'pm252018g',
  'pm252010g',
  'pm252002',
  'pm252016g',
  'pm252006gh',
  'pm252009g',
  'pm252004g',
  'pm252013g']]

In [None]:
# Because this has been converted from a Dataframe, it is a 'list within a list' so select only the inner list
yearly_headers = yearly_headers[0]
yearly_headers

['pm252019g',
 'pm252003grav',
 'pm2505ac',
 'pm252008g',
 'pm252012g',
 'pm252014g',
 'pm252011g',
 'pm252007g',
 'pm252017g',
 'pm252015g',
 'pm252018g',
 'pm252010g',
 'pm252002',
 'pm252016g',
 'pm252006gh',
 'pm252009g',
 'pm252004g',
 'pm252013g']

In [None]:
# Define a simple function to extract the date from these values
def date_extractor(lst):
    for i in range(len(lst)):
        lst[i] = lst[i][4:8]
    return lst

In [None]:
# Apply the function and check the results
yearly_headers = date_extractor(yearly_headers)
yearly_headers

['2019',
 '2003',
 '05ac',
 '2008',
 '2012',
 '2014',
 '2011',
 '2007',
 '2017',
 '2015',
 '2018',
 '2010',
 '2002',
 '2016',
 '2006',
 '2009',
 '2004',
 '2013']

This has worked but, because of an inconsistency with the way the year was recorded in 2005, (it was abbrevated to '05') I will have to fix this date manually.

In [None]:
# Fix the date format for 2005 and check the result
yearly_headers[2] = '2005'
yearly_headers

['2019',
 '2003',
 '2005',
 '2008',
 '2012',
 '2014',
 '2011',
 '2007',
 '2017',
 '2015',
 '2018',
 '2010',
 '2002',
 '2016',
 '2006',
 '2009',
 '2004',
 '2013']

In [None]:
# Apply these years as header columns to the PM2.5 dataset
comb_pollution.columns = yearly_headers
comb_pollution.head()

Unnamed: 0,2019,2003,2005,2008,2012,2014,2011,2007,2017,2015,2018,2010,2002,2016,2006,2009,2004,2013
1,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
3,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
4,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING


In [None]:
comb_pollution = comb_pollution.reindex(sorted(comb_pollution.columns), axis=1)
comb_pollution.head()

Unnamed: 0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
1,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
2,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
3,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
4,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING,MISSING
