# Small Area Population Estimates


**NOTE - this is a big for a git repo, you'll need to download the following and uninstall into /sources**
`https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/lowersuperoutputareamidyearpopulationestimates/mid2017/sape20dt1mid2017lsoasyoaestimatesformatted.zip`

### Requirements

Extract tabs, Mid-2017 Persons, Mid-2017 Male, Mid-2017 Female

#### Observations & Dimensions

The `observations` should be apparent.

The required dimensions are:

* **Geography** - take this from "area codes"
* **Time** - get the time from laarge bol
* **Age** - "All Ages" as well as all the bolded numbers to the right
* **Gender** - from the tab name

-----
Notes:

* Haven't tried it but it's very likely (read: is) too big for databaker. You'll need to use pandas directly.

In [8]:
import pandas as pd
import numpy as np

filepath = './sources/SAPE20DT1-mid-2017-lsoa-syoa-estimates-formatted.XLS'

In [9]:
tabs = [("Mid-2017 Persons", "All", "2017"), ("Mid-2017 Females", "Females", "2017"),\
        ("Mid-2017 Males", "Males", "2017")]

In [10]:
tidied_sheets =[]   
for tab, gender, year in tabs:
    df = pd.read_excel(io=filepath, sheet_name=tab, header=4, index_col=0)
    df.drop(labels=['Area Names', 'Unnamed: 2'], axis=1, inplace=True)
    df.columns = pd.MultiIndex.from_product([[gender], [year], df.columns])
    tidied_sheets.append(df)

In [11]:
tabs

[('Mid-2017 Persons', 'All', '2017'),
 ('Mid-2017 Females', 'Females', '2017'),
 ('Mid-2017 Males', 'Males', '2017')]

In [12]:
datacube = pd.concat(tidied_sheets, axis=1, join="inner").unstack().reset_index()

In [13]:
datacube

Unnamed: 0,level_0,level_1,level_2,Area Codes,0
0,All,2017,All Ages,E06000047,523662
1,All,2017,All Ages,E01020634,1632
2,All,2017,All Ages,E01020635,1329
3,All,2017,All Ages,E01020636,1725
4,All,2017,All Ages,E01020654,1826
...,...,...,...,...,...
9687871,Males,2017,90+,W01001636,1
9687872,Males,2017,90+,W01001657,8
9687873,Males,2017,90+,W01001658,8
9687874,Males,2017,90+,W01001912,3


In [14]:
datacube.columns = ["Gender", "Time", "Age", 'Geography', "OBS"]

In [15]:
datacube

Unnamed: 0,Gender,Time,Age,Geography,OBS
0,All,2017,All Ages,E06000047,523662
1,All,2017,All Ages,E01020634,1632
2,All,2017,All Ages,E01020635,1329
3,All,2017,All Ages,E01020636,1725
4,All,2017,All Ages,E01020654,1826
...,...,...,...,...,...
9687871,Males,2017,90+,W01001636,1
9687872,Males,2017,90+,W01001657,8
9687873,Males,2017,90+,W01001658,8
9687874,Males,2017,90+,W01001912,3


In [16]:
column_order = ["OBS", "Geography","Time", "Age","Gender"]

In [17]:
datacube = datacube.reindex(columns=column_order)

In [18]:
datacube

Unnamed: 0,OBS,Geography,Time,Age,Gender
0,523662,E06000047,2017,All Ages,All
1,1632,E01020634,2017,All Ages,All
2,1329,E01020635,2017,All Ages,All
3,1725,E01020636,2017,All Ages,All
4,1826,E01020654,2017,All Ages,All
...,...,...,...,...,...
9687871,1,W01001636,2017,90+,Males
9687872,8,W01001657,2017,90+,Males
9687873,8,W01001658,2017,90+,Males
9687874,3,W01001912,2017,90+,Males
