This code is to look at employment data in the area where SRCDC operates (the Grangetown, Canton, and Riverside areas of Cardiff). The data is from the last UK cencus in 2011.

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np

The data is arranged by output area, OA. These are small areas on about 150 people each. These are combined to make lower super output areas, LSOAs, of about 1500 people. The LSOA codes for the areas we are interested in are in the lists below (taken from https://gov.wales/docs/statistics/lsoamaps/lsoa.htm). Note, south Grangetown and North Riverside are much more affluent than the the South Riverside, North Grangetown, and Canton areas where the SRCDC operates, so we exclude those OAs from the analysis.

In [4]:
grangetown_codes = ['W01001759','W01001760','W01001761','W01001762','W01001764','W01001765',\
                    'W01001766','W01001767','W01001768','W01001946']
canton_codes = ['W01001709', 'W01001710', 'W01001711', 'W01001712', 'W01001713', 'W01001714',\
                'W01001715', 'W01001716', 'W01001717'] 
riverside_codes = ['W01001855', 'W01001856', 'W01001857', 'W01001862']
lsoa_codes = np.concatenate([grangetown_codes, canton_codes, riverside_codes])


south_grangetown=['W01001945','W01001947']
north_riverside = ['W01001858', 'W01001859', 'W01001860','W01001861']

To get the OA codes for each OA the LSOAs, I scrape the webpage for each LSOA. The webpage for each LSOA is of the form http://statistics.data.gov.uk/doc/statistical-geography/[[LSOA_CODE]]. The OA codes are saved as a list for easy looping through them and a dictionary so that we can work backwards and get the LSOA from the code.  

In [82]:
oa_codes_dict = {}
oa_codes_list = []
for idx in range(1,2000):
    if idx%100==0: print(idx) 
    index=str(idx)
    index = ''.join([str(np.zeros(4-len(index),dtype=int)).replace(" ", "")[1:-1],index])
    index = 'W0100' + index
    oa_page = requests.get("http://statistics.data.gov.uk/doc/statistical-geography/" + index)
    soup = BeautifulSoup(oa_page.content, 'html.parser')
    link_form = "a[href*=http://statistics.data.gov.uk/id/statistical-geography/W00]"
    oa_codes_dict[index] = [soup.select(link_form)[i].getText() for i in range(len(soup.select(link_form)))\
                           if len(soup.select(link_form))]
    oa_codes_list=np.concatenate([oa_codes_list,[soup.select(link_form)[i].getText() for i in range(len(soup.select(link_form)))]])
    

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900


In [85]:
np.save('lsoa_to_oa_dict.npy',oa_codes_dict)

To load this dict run:

In [89]:
lsoa_dict=np.load('lsoa_to_oa_dict.npy').item(0)

Invert this dictionary to find the LSOA that any OA is in.

In [94]:
inv_dict = {oa: lsoa for lsoa, oa_list in oa_codes_dict.items() for oa in oa_list}

In [101]:
np.save('oa_to_lsoa_dict.npy',inv_dict)

Now need to map LSOA names to LSOA codes.

In [119]:
lsoa_to_name_dict = {}
for idx in range(1,2000):
    if idx%100==0: print(idx) 
    index=str(idx)
    index = ''.join([str(np.zeros(4-len(index),dtype=int)).replace(" ", "")[1:-1],index])
    index = 'W0100' + index
    oa_page = requests.get("http://statistics.data.gov.uk/doc/statistical-geography/" + index)
    soup = BeautifulSoup(oa_page.content, 'html.parser')
    try:
        lsoa_to_name_dict[index] = soup.select("title")[0].getText().split("|")[1].strip()
    except IndexError:
        continue

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900


In [122]:
np.save('lsoa_to_name_dict.npy',lsoa_to_name_dict)