<a href="https://colab.research.google.com/github/jamesrichardbunting/neurodegeneration_pollution/blob/main/101_postcode_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Data wrangling

## 1.1 Combine the postcode datasets

Postcode data is provided as open data by the Office for National Statistics (ONS). 

All 1.7 million UK postcodes are provided, along their administrative area, the NHS ward in which they lie, and their 'easting' and 'northing' coordinates. Note that eastings and northings correspond to x and y coordinates (respectively) on a grid reference map of the UK. 

These data are provided as 120 individual .CSV files. 

In this notebook, I will collate these files into a single, more easily searchable file.  

In [None]:
# Import packages
import pandas as pd
import numpy as np
import os
import glob

I will define a simple function to identify all .CSV files in the working directory and save the filenames to a new variable. 

In [None]:
# Define a function to identify all .CSV files in the working directory
def csv_finder(): 
  extension = '.csv' # Define the string to be used as a search term
  csv_files = [i for i in glob.glob(f"*{extension}")] # Search working directory for files containing the .CSV extension and save the file names
  return csv_files # Return the list of CSV files 

In [None]:
# Call the function and save output to a new variable
csv_files = csv_finder()

In [None]:
# Check the ouput
print(csv_files)

['ox.csv', 'm.csv', 'wr.csv', 'bh.csv', 'ss.csv', 'tr.csv', 'sg.csv', 'wv.csv', 'eh.csv', 'e.csv', 'hx.csv', 'tn.csv', 'bs.csv', 'td.csv', 'iv.csv', 'ca.csv', 'co.csv', 'po.csv', 'ml.csv', 'me.csv', 'fk.csv', 'dh.csv', 'ab.csv', 'sk.csv', 'dl.csv', 'wf.csv', 'sa.csv', 'br.csv', 'nr.csv', 'sw.csv', 'ct.csv', 'dg.csv', 'la.csv', 'bl.csv', 'ha.csv', 'en.csv', 'cm.csv', 'sn.csv', 'lu.csv', 'ls.csv', 'b.csv', 'cr.csv', 'ex.csv', 'tf.csv', 'ub.csv', 'sy.csv', 'le.csv', 'hd.csv', 'hg.csv', 'so.csv', 'ld.csv', 'ng.csv', 'ne.csv', 'rg.csv', 'ky.csv', 'wa.csv', 'bn.csv', 'fy.csv', 'ba.csv', 'pl.csv', 'tw.csv', 'np.csv', 'kt.csv', 'sm.csv', 'll.csv', 'w.csv', 'ws.csv', 'ip.csv', 'ts.csv', 'g.csv', 'pa.csv', 'ol.csv', 'ph.csv', 'de.csv', 'wc.csv', 'st.csv', 'rm.csv', 'tq.csv', 'hp.csv', 'sp.csv', 'ig.csv', 'ec.csv', 'cv.csv', 'ka.csv', 'bb.csv', 'dd.csv', 'ch.csv', 'cf.csv', 'gu.csv', 'hu.csv', 'ln.csv', 'cb.csv', 'kw.csv', 'wn.csv', 'nw.csv', 'ze.csv', 'al.csv', 'cw.csv', 'da.csv', 'n.csv', 'pe.c

The filenames are not ordered alphabetically. I should order them to allow more efficient searching later.  

In [None]:
# Sort the filenames and check the output
csv_files = sorted(csv_files)
print(csv_files)

['ab.csv', 'al.csv', 'b.csv', 'ba.csv', 'bb.csv', 'bd.csv', 'bh.csv', 'bl.csv', 'bn.csv', 'br.csv', 'bs.csv', 'ca.csv', 'cb.csv', 'cf.csv', 'ch.csv', 'cm.csv', 'co.csv', 'cr.csv', 'ct.csv', 'cv.csv', 'cw.csv', 'da.csv', 'dd.csv', 'de.csv', 'dg.csv', 'dh.csv', 'dl.csv', 'dn.csv', 'dt.csv', 'dy.csv', 'e.csv', 'ec.csv', 'eh.csv', 'en.csv', 'ex.csv', 'fk.csv', 'fy.csv', 'g.csv', 'gl.csv', 'gu.csv', 'ha.csv', 'hd.csv', 'hg.csv', 'hp.csv', 'hr.csv', 'hs.csv', 'hu.csv', 'hx.csv', 'ig.csv', 'ip.csv', 'iv.csv', 'ka.csv', 'kt.csv', 'kw.csv', 'ky.csv', 'l.csv', 'la.csv', 'ld.csv', 'le.csv', 'll.csv', 'ln.csv', 'ls.csv', 'lu.csv', 'm.csv', 'me.csv', 'mk.csv', 'ml.csv', 'n.csv', 'ne.csv', 'ng.csv', 'nn.csv', 'np.csv', 'nr.csv', 'nw.csv', 'ol.csv', 'ox.csv', 'pa.csv', 'pe.csv', 'ph.csv', 'pl.csv', 'po.csv', 'pr.csv', 'rg.csv', 'rh.csv', 'rm.csv', 's.csv', 'sa.csv', 'se.csv', 'sg.csv', 'sk.csv', 'sl.csv', 'sm.csv', 'sn.csv', 'so.csv', 'sp.csv', 'sr.csv', 'ss.csv', 'st.csv', 'sw.csv', 'sy.csv', 'ta.cs

I now need to collate these files into a single, searchable .CSV file. 

I will define a simple function to iterate through the list of filenames, returning a collated file. 

These postcode files contain some columns that are not needed for the rest of my analysis so I will subset only the columns I need during the collation (postcode, easting and northing). 

In [None]:
# Define a fiuction that accepts a list of filenames, collates them on columns 0, 2 and 3 and returns the combined file
def postcode_combiner(filenames):
  comb_postcodes = pd.concat([pd.read_csv(i, header=None, usecols=[0,2,3]) for i in filenames], ignore_index=True)
  return comb_postcodes

In [None]:
# Call the function and save output to a new variable
postcodes = postcode_combiner(csv_files)

In [None]:
# Check the output
postcodes.head()

Unnamed: 0,0,2,3
0,AB101AB,394235,806529
1,AB101AF,394235,806529
2,AB101AG,394230,806469
3,AB101AH,394235,806529
4,AB101AL,394296,806581


In [None]:
# Check the output
postcodes.shape

(1719485, 3)

This has worked as expected. The last step will be to add column headers and export the file as a .CSV file I can use in further notebooks.  

In [None]:
# Add column headers
postcodes.columns = ["Postcode", "x", "y"]

In [None]:
# Export as a CSV file
postcodes.to_csv('postcodes.csv', index=False)