### `The concept`
Different statistics are normalised in the range 0-1 and then mean averaged. A score of 0 corresponds to something less likely to cause poverty (e.g. extremely low rate  of unemployment).

The highest and lowest values will score 1 and 0, respectively, and everything else will fall somewhere in between. We do this across 6 metrics for economic insecurity and their average score is taken. 

These metrics are: economic inactivity, unemployment, mean weekly wage, median weekly wage, claimants as a proportion of residents aged 16-64 and proportion of households that are fuel poor. In the interest of simplicity, each metric has equal weighting. 

### `Assumptions/Disclaimers`
Where data is unavailable, the default is to use the value for the parent (containing) geography. For example, if Leeds had missing data then the value for Yorkshire & the Humber would be used. 

When the final average is calculated, if the data is for a given statistic is unavailable, that statistic does not contribute to the index score. We drop rows that have fewer than 3 values. This is to avoid places getting an index score with too few data points contributing to the average.

In general, it shouldn't be the case that geographies at the same level could have different statistics contributing to the index score. The output csv file contains a list of the metrics that were per place.

Finally, this is a proof of concept. The aim of this model is not to create a new measure of poverty, but to see how different places rank relative to eachother across a range of different metrics. We are testing out new ways of combining different datasets to provide useful insights. The results of this model should not be seen as a measure of poverty or taken at absolute face value.

### Notes
- We want to create a simple normalisation algorithm for our modelling. 
- We will map highest/lowest values for a category from 0-1. (Or, we could normalise to average value of containing geography.)
- Then calculate a mean average. 
- Easy to add weights in if necessary. 
- We will say that 0 = least likely to cause poverty (e.g. low house price to wage ratio) and 1 = most likely to cause poverty (e.g. high rate of unemployment).

For a set $$X$$ of values containing $$x_i$$ individual values, the normalised values
$$x_{normalised} = \frac{x_i - X_{min}}{X_{max}-X_{min}}$$

In [29]:
import os
import pandas as pd
import numpy as np
os.chdir(r"C:\Users\LukeStrange\Code\jrf-insight")

Load the geography lookup tree

In [30]:
geography_tree = pd.read_csv('data/geo/geography_tree.csv', usecols=['parent', 'child'])
geography_tree.describe()

Unnamed: 0,parent,child
count,1646,1646
unique,86,1644
top,E06000057,E05014284
freq,66,2


Define several functions:
- `transform_data` attempts to reshape the data into a consitent format. It gets the most recent date's value per geography.
- `pivot_and_concatenate_dataframes` - pivots the frames to have one variable per column and geo_codes as the index.
- `fill_gaps_parent_value` - fills any NaN values with the parent value, if it exists.
- `normalise` - normalises the values from 0-1 (with 1 corresponding to something more likely to cause poverty) 
-  `which_variables` - determine which variables (non-NaNs) were used to calculate the score. 

In [31]:
def transform_data(data, variable_name):
    '''
    Select a given variable from the dataset.
    Get the most recent date per geography using idxmax(). 
    **Requires date format to be pandas datetime**.
    '''
    assert isinstance(variable_name, str), 'Variable name must be type: str'
    try:
        filtered_data = data[data.variable_name == variable_name].copy()
    except:
        return print('The variable', variable_name, 'is not present in the data.')
    try:
        filtered_data['date'] = pd.to_datetime(filtered_data['date'])
    except:
        return print('Cannot convet date column to pandas datetime.')
    filtered_data = filtered_data.loc[filtered_data.groupby(['geography_code'])[
        'date'].idxmax()]
    # only using necessary columns
    filtered_data = filtered_data.loc[:, [
        'date', 'geography_code', 'geography_name', 'variable_name', 'value']].set_index('date')
    return filtered_data

In [32]:
def pivot_and_concatenate_dataframes(dataframes):
    '''
    Pivot the frames into a wide format and concatenate
    them into a single dataframe.
    '''
    pivoted_frames = []
    for frame in dataframes:
        colnames = frame.columns.to_list()
        assert 'geography_code' in colnames
        assert 'value' in colnames
        assert 'variable_name' in colnames
        pivoted_frames.append(frame.pivot(index='geography_code', columns='variable_name', values='value'))
    result = pd.concat(pivoted_frames, axis=1)
    return result

In [33]:
def fill_gaps_parent_value(data, lookup):
    ''' 
    Fill any data gaps with the value of the parent, 
    if it exists.
    '''
    # copy the frame so we don't later the thing we iterate over
    data_copy = data.copy()
    for col in data.columns.to_list():
        na_rows = data[data.isnull().any(axis=1)]
        for i, row in na_rows.iterrows():
            try:
                parent_code = lookup[lookup.child == row.name]['parent'].values[0]
            except:
                print('No parent found for', row.name, 'for the variable', col)
                continue
            # if the parent isn't present in the geo tree, go to the next row with na values.
            parents_value = data[data.index == parent_code][col].iloc[0]
            #print(parent_code, parents_value)
            data_copy.at[i, col] = parents_value

    return data_copy

In [34]:
def normalise(data, highest_score_causes_poverty):
    '''
    Takes the range of values and maps the max/min
    to 0-1. A value of 1 means something is more likely
    to cause poverty. This would correspond to 
    highest_score_cause_poverty == True.
    '''
    variable_names = data.columns.to_list()
    assert len(variable_names) == len(highest_score_causes_poverty), 'Length of highest_score_cause_poverty does not match the number of variables'
    normalised_columns = []
    if 'geography_code' in variable_names:
        variable_names.remove('geography_code')
    if 'date' in variable_names:
        variable_names.remove('date')
    i = 0
    for variable in variable_names:
        mx, mn = data[f'{variable}'].max(), data[f'{variable}'].min()
        range = mx - mn 
        if highest_score_causes_poverty[i] == True:
            data[f'Normalised {variable}'] = (data[f'{variable}'] - mn) / range
        else:
            data[f'Normalised {variable}'] = abs((data[f'{variable}'] - mx) / range)
        normalised_columns.append(f'Normalised {variable}')
        i += 1
    data['mean_normalised_score'] = data[normalised_columns].mean(axis=1)
    
    return data

In [35]:
# Read data
labour_market = pd.read_csv('data/labour-market/labour-market.csv')
weekly_earnings = pd.read_csv('data/ashe/weekly-earnings.csv')
claimants = pd.read_csv('data/claimant-count/claimant-count.csv')
fuel_poverty = pd.read_csv('data/fuel-poverty/fuel-poverty.csv')

In [36]:
# Get the individual variables as their own dataframes
economic_inactivity = transform_data(labour_market, '% who are economically inactive - aged 16-64')
unemployment = transform_data(labour_market, 'Unemployment rate - aged 16-64')
mean_weekly_wage = transform_data(weekly_earnings, 'mean_weekly_wage')
median_weekly_wage = transform_data(weekly_earnings, 'median_weekly_wage')
claimants = transform_data(claimants, 'Claimants as a proportion of residents aged 16-64')
fuel_poverty = transform_data(fuel_poverty, 'Proportion of households fuel poor (%)')

variable_key = {
    '% who are economically inactive - aged 16-64': '1',
    'Unemployment rate - aged 16-64': '2',
    'mean_weekly_wage': '3',
    'median_weekly_wage': '4',
    'Claimants as a proportion of residents aged 16-64': '5',
    'Proportion of households fuel poor (%)': '6'
}

In [37]:
# Pivot concatenate the dataframes
data = pivot_and_concatenate_dataframes([economic_inactivity, unemployment, mean_weekly_wage, median_weekly_wage, claimants, fuel_poverty])

In [38]:
# Fill the gaps where possible
data = fill_gaps_parent_value(data, geography_tree)

# we drop rows that have fewer than 3 values. This is to avoid places getting an index score with too few data points contributing to the avrage.
data = data.dropna(thresh=3)

No parent found for E11000001 for the variable % who are economically inactive - aged 16-64
No parent found for E11000002 for the variable % who are economically inactive - aged 16-64
No parent found for E11000003 for the variable % who are economically inactive - aged 16-64
No parent found for E11000006 for the variable % who are economically inactive - aged 16-64
No parent found for E11000007 for the variable % who are economically inactive - aged 16-64
No parent found for E05000670 for the variable % who are economically inactive - aged 16-64
No parent found for E05000671 for the variable % who are economically inactive - aged 16-64
No parent found for E05000672 for the variable % who are economically inactive - aged 16-64
No parent found for E05000673 for the variable % who are economically inactive - aged 16-64
No parent found for E05000674 for the variable % who are economically inactive - aged 16-64
No parent found for E05000675 for the variable % who are economically inactive -

In [39]:
def which_variables(data):
    input_variables = data.apply(lambda x: x.dropna().index.to_list(), axis=1)
    return input_variables
input_variables = which_variables(data)

In [40]:
# Normalise and show the result
data = normalise(data, [True, True, False, False, True, True])

# Add the input variables and map them to the code.
data['input_variables'] = input_variables
data['input_variables'] = data['input_variables'].apply(lambda x: [variable_key.get(item, item) for item in x])


In [41]:
code_to_name = pd.read_csv('data/geo/geography_code_name_only.csv')
code_to_name.rename(columns={'code':'geography_code', 'name': 'geography_name'}, inplace=True)
data = data.merge(code_to_name, on='geography_code').set_index('geography_code')
data[['mean_normalised_score', 'input_variables']].head()

Unnamed: 0_level_0,mean_normalised_score,input_variables
geography_code,Unnamed: 1_level_1,Unnamed: 2_level_1
E06000001,0.551879,"[1, 2, 3, 4, 5, 6]"
E06000002,0.679635,"[1, 2, 3, 4, 5, 6]"
E06000003,0.562172,"[1, 2, 3, 4, 5, 6]"
E06000004,0.416625,"[1, 2, 3, 4, 5, 6]"
E06000005,0.394855,"[1, 2, 3, 4, 5, 6]"


In [42]:
data[['mean_normalised_score', 'input_variables', 'geography_name']].to_csv('playground/modelling/economic_insecurity.csv')