# Manipulate London Household Income

Part of [london-data](https://github.com/jamesdamillington/london-data), by [jamesdamillington](https://github.com/jamesdamillington)

## Aim
Tidy [Household Income Estimates for Small Areas data](https://data.london.gov.uk/dataset/household-income-estimates-small-areas) from `.xlsx` to simple `.csv` that can be readily joined to [ONS Geographies data](https://github.com/jamesdamillington/london-data/tree/main/data/geographies/census). 

In [1]:
from datetime import date
print(f'Last tested: {date.today()}')

Last tested: 2022-08-25


In [15]:
import pandas as pd
from functools import reduce
from pathlib import Path

Read [Household Income Estimates for Small Areas data](https://data.london.gov.uk/dataset/household-income-estimates-small-areas) (downloaded 2022-09-15) 
> Source: Greater London Authority licensed under the Open Government Licence v.2.0  

In [6]:
econ_path = Path("../data/inputs/economic/")
econ_filepath = econ_path/"gla-household-income-estimates.xlsx"

Function to read a sheet (_Persons_, _Male_, _Female_) from the excel workbook and clean. This works for 2019 and 2020 but not 2018 and earlier!

In [12]:
mean_pd = pd.read_excel(econ_filepath, sheet_name = "LSOA11", usecols="A:P", header=None, skiprows=3)
mean_pd.columns = ["LSOA11CD", "LSOA11NM", "LAD11CD", "LAD11NM",
                   "HHI_mean_2001","HHI_mean_2002","HHI_mean_2003","HHI_mean_2004",
                   "HHI_mean_2005","HHI_mean_2006","HHI_mean_2007","HHI_mean_2008",
                   "HHI_mean_2009","HHI_mean_2010","HHI_mean_2011","HHI_mean_2012"]  #so 2001/02 becomes 2001
mean_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4835 entries, 0 to 4834
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   LSOA11CD       4835 non-null   object
 1   LSOA11NM       4835 non-null   object
 2   LAD11CD        4835 non-null   object
 3   LAD11NM        4835 non-null   object
 4   HHI_mean_2001  4835 non-null   int64 
 5   HHI_mean_2002  4835 non-null   int64 
 6   HHI_mean_2003  4835 non-null   int64 
 7   HHI_mean_2004  4835 non-null   int64 
 8   HHI_mean_2005  4835 non-null   int64 
 9   HHI_mean_2006  4835 non-null   int64 
 10  HHI_mean_2007  4835 non-null   int64 
 11  HHI_mean_2008  4835 non-null   int64 
 12  HHI_mean_2009  4835 non-null   int64 
 13  HHI_mean_2010  4835 non-null   int64 
 14  HHI_mean_2011  4835 non-null   int64 
 15  HHI_mean_2012  4835 non-null   int64 
dtypes: int64(12), object(4)
memory usage: 604.5+ KB


In [13]:
median_pd = pd.read_excel(econ_filepath, sheet_name = "LSOA11", usecols="A:D,R:AC", header=None, skiprows=3)
median_pd.columns = ["LSOA11CD", "LSOA11NM", "LAD11CD", "LAD11NM",
                   "HHI_median_2001","HHI_median_2002","HHI_median_2003","HHI_median_2004",
                   "HHI_median_2005","HHI_median_2006","HHI_median_2007","HHI_median_2008",
                   "HHI_median_2009","HHI_median_2010","HHI_median_2011","HHI_median_2012"]  #so 2001/02 becomes 2001
median_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4835 entries, 0 to 4834
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   LSOA11CD         4835 non-null   object
 1   LSOA11NM         4835 non-null   object
 2   LAD11CD          4835 non-null   object
 3   LAD11NM          4835 non-null   object
 4   HHI_median_2001  4835 non-null   int64 
 5   HHI_median_2002  4835 non-null   int64 
 6   HHI_median_2003  4835 non-null   int64 
 7   HHI_median_2004  4835 non-null   int64 
 8   HHI_median_2005  4835 non-null   int64 
 9   HHI_median_2006  4835 non-null   int64 
 10  HHI_median_2007  4835 non-null   int64 
 11  HHI_median_2008  4835 non-null   int64 
 12  HHI_median_2009  4835 non-null   int64 
 13  HHI_median_2010  4835 non-null   int64 
 14  HHI_median_2011  4835 non-null   int64 
 15  HHI_median_2012  4835 non-null   int64 
dtypes: int64(12), object(4)
memory usage: 604.5+ KB


Write to file

In [16]:
out_path = Path("../data/economic/")

In [17]:
out_mean = out_path/"london-2001-2012-HHI-mean.csv"
mean_pd.to_csv(out_mean,index=False)

In [18]:
out_median = out_path/"london-2001-2012-HHI-median.csv"
median_pd.to_csv(out_median,index=False)

In [19]:
with open(out_path/"metadata_HHI.txt", 'w') as file:
    pd.read_excel(econ_filepath, sheet_name = "Metadata").to_string(file, index=False)