# House affordability

In this footnote we compile data about housing affordability in the UK.

We will use [House Price Index](https://www.gov.uk/government/publications/about-the-uk-house-price-index/about-the-uk-house-price-index#monthly-revision) data based on residential housing transactions from the Land Registry. We can normalise this data to capture affordability using median annual salaries from ASHE that we will be collecting elsewhere.

Our strategy is to: 

* Collect the data [Here](https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-august-2019)
* Process it into the geographies we are interested in (if possible)
* Create indicators.

## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
def make_dirs(name,dirs = ['raw','processed']):
    '''
    Utility that creates directories to save the data
    
    '''
    
    for d in dirs:
        if name not in os.listdir(f'../../data/{d}'):
            os.mkdir(f'../../data/{d}/{name}')
            
def flat_freq(a_list):
    '''
    Return value counts for categories in a nested list
    
    '''
    return(pd.Series([x for el in a_list for x in el]).value_counts())

        

def flatten_list(a_list):
    
    return([x for el in a_list for x in el])

        

In [None]:
def save_data(df,name,path,today=today_str):
    '''
    Utility to save processed data quicker
    
    Arguments:
        df (df) is the dataframe we want to save
        name (str) is the name of the file
        path (str) is the path where we want to save the file
        today (str) is the day when the data is saved
    
    '''
    
    df.to_csv(f'{path}/{today_str}_{name}.csv')
    

In [None]:
#dirs

if 'housing' not in os.listdir('../../data/raw'):
    os.makedirs('../../data/raw/housing')

if 'housing' not in os.listdir('../../data/processed/'):
    os.makedirs('../../data/processed/housing')

## 1. Collect data

We collect the data from the land registry. The [data dictionary](https://www.gov.uk/government/publications/about-the-uk-house-price-index/about-the-uk-house-price-index#data-tables) can be found here.

In [None]:
housing_url = 'http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/UK-HPI-full-file-2019-08.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=full_fil&utm_term=9.30_16_10_19'

hous = pd.read_csv(housing_url)

In [None]:
hous.head()

We have monthly data by LAD

## 2. Process data

In [None]:
hous.shape

In [None]:
#Create a year variable to assess that coverage
hous['year'] = [int(x.split('/')[-1]) for x in hous['Date']]

In [None]:
hous['year'].value_counts().plot.bar(figsize=(10,5))

We seem to have good coverage since around the mid 2000s. Is this linked to geography?

In [None]:
#The first letter in an area code tells us if a field is in Scotland, England or Wales
hous['nation'] = [x[0] for x in hous['AreaCode']]

In [None]:
pd.crosstab(hous['nation'],hous['year']).T.plot()

Yes, nations are added at various points in time. We will focus on the 2010s. Note that 2019 doesn't seem to be complete. 

And what is that K?

In [None]:
hous.loc[hous['nation']=='K'].head()

We found that it is the England and Wales aggregation

In [None]:
#Subset to focus on the most recent period

hous_recent = hous.loc[hous['year']>=2010]

## Transformation

Since the data is available at the LAD (NUTS3) level we need to aggregate into LADS. We will multiply `AveragePrice` by `SalesVolume` for this.

In [None]:
hous_recent['total_sales'] = hous_recent['AveragePrice']*hous_recent['SalesVolume']

In [None]:
house_year = hous_recent.groupby(['RegionName','AreaCode','year'])['total_sales'].sum()

In [None]:
house_year_wide = house_year.reset_index(drop=False).pivot_table(index=['RegionName','AreaCode'],columns='year',values='total_sales',aggfunc='sum')

In [None]:
house_year_wide.head()

In [None]:
house_year_wide[2018].sort_values(ascending=False).head(n=20)

So the data seem to include, in fact, NUTS2 areas. Now we need to pull them out.

**We will get NUTS2 codes for 2018 and 2015 and look for them in the data**

We use the codes from [Open Geography Portal](https://geoportal.statistics.gov.uk/search?collection=Dataset&sort=name&tags=NAC_NUTS2)

In [None]:
#Here is the lookup

lad_nuts_lookup = pd.read_csv('https://opendata.arcgis.com/datasets/2a2548641a294734ba4fdb689b12d955_0.csv')

In [None]:
house_year_nuts = pd.merge(house_year.reset_index(drop=False),
                           lad_nuts_lookup[['LAD16CD','LAD16NM','NUTS318CD','NUTS318NM','NUTS218CD','NUTS218NM']],
                           left_on='AreaCode',
                           right_on='LAD16CD',how='left')

In [None]:
house_year_nuts.head()

In [None]:
print(len(house_year))

print(len(house_year_nuts))

There are some missing locations in the merge. What are they?

In [None]:
house_year_nuts.loc[house_year_nuts['NUTS218NM'].isna()]['RegionName'].value_counts().head()

They are the non-LAD regions in the data. We are ok to drop them

In [None]:
house_year_nuts = house_year_nuts.dropna(axis=0)

And now we calculate the NUTS2 estimates

In [None]:
house_year_nuts_2 = house_year_nuts.groupby(['NUTS218CD','NUTS218NM','year'])['total_sales'].sum()

In [None]:
house_year_nuts_2_wide = house_year_nuts_2.reset_index(drop=False).pivot_table(index=['NUTS218CD','NUTS218NM'],columns='year',values='total_sales').sort_values(
    2018,ascending=False)

In [None]:
#Mean volume of sales per NUTS2 area /year?
house_year_nuts_2_wide.describe().loc['mean'].plot()

We note much less activity in 2019 because the House Price Index data is laggy. We will therefore remove 2019 from the analysis and save the data

In [None]:
house_final = house_year_nuts_2.reset_index(drop=False).query('year != 2019')

## 3. Save data

In [None]:
save_data(house_final,name='nuts_house_prices',path='../../data/processed/housing/')