### Notes

- We want to create a normalisation algorithm for our modelling. 
- We will map highest/lowest values for a category from 0-1. (Or, we could normalise to average value of containing geography.)
- Then calculate a mean average. 
- Easy to add weights in if necessary. 
- We will say that 0 = least likely to cause poverty (e.g. low house price to wage ratio) and 1 = most likely to cause poverty (e.g. high rate of unemployment).

In [64]:
import pandas as pd
import numpy as np

array = np.random.randint(10, size=(3, 5))

In [65]:
df = pd.DataFrame({'A':array[0], 'B':array[1], 'C': array[2]}, index=['place 1', 'place 2', 'place 3', 'place 4', 'place 5'])
colnames = df.columns.to_list()
df

Unnamed: 0,A,B,C
place 1,0,7,4
place 2,4,2,0
place 3,4,1,9
place 4,9,2,2
place 5,7,3,6


For a set $$X$$ containing $$x_i$$ values, the normalised values
$$x_{normalised} = \frac{x_i - X_{min}}{X_{max}-X_{min}}$$

In [66]:
def normalise(df, highest_score_wins):
    '''
    Take a dataframe with different metrics as columns and geographies as rows.
    Normalise the values from 0-1 where 1 is likely to cause poverty and 0 is 
    least likely to cause poverty (this will depend on the stat/measure).
    Mean-average the normalised values.

    Params
    ------
        df: dataframe
        highes_score_wins: list of bools.
        
    Returns
    -------
        original dataframe plus normalised value columns and mean-average column.
    '''
    colnames = df.columns.to_list()
    if 'geography_code' in colnames:
        colnames.remove('geography_code')
    norm_cols = []
    i = 0
    for col in colnames:
        mx, mn = df[f'{col}'].max(), df[f'{col}'].min()
        range = mx - mn 
        if highest_score_wins[i] == True:
            df[f'Normalised {col}'] = (df[f'{col}'] - mn) / range
        else:
            df[f'Normalised {col}'] = abs((df[f'{col}'] - mx) / range)
        norm_cols.append(f'Normalised {col}')
        i += 1
    df['mean_norm_score'] = df[norm_cols].mean(axis=1)
    return df

normalise(df, [False, True, True])

Unnamed: 0,A,B,C,Normalised A,Normalised B,Normalised C,mean_norm_score
place 1,0,7,4,1.0,1.0,0.444444,0.814815
place 2,4,2,0,0.555556,0.166667,0.0,0.240741
place 3,4,1,9,0.555556,0.0,1.0,0.518519
place 4,9,2,2,0.0,0.166667,0.222222,0.12963
place 5,7,3,6,0.222222,0.333333,0.666667,0.407407


In [79]:
# data columns to use from place_data.json
usecols = ['geography_code', 'economic_inactivity_16_64', 'percent_in_low_income', 'unemployment_rate_16_64']

# columns to rank
cols_to_norm = ['economic_inactivity_16_64', 'percent_in_low_income', 'unemployment_rate_16_64']

# read the data
data = pd.read_json(r"C:\Users\LukeStrange\Code\jrf-insight\data\interim\place_data.json")

# select only necessary columns and "regions"
data = data.loc[:, usecols]
data = data.loc[data.geography_code.str.startswith(('E120', 'E129', 'E06', 'E07', 'E08'))]
#data.loc[1643, 'unemployment_rate_16_64'] = 3.8
# for this example we want all these metrics a LOW value is less likely to cause poverty
highest_score_wins = [False, False, False]
returned = normalise(data, highest_score_wins)
returned


Unnamed: 0,geography_code,economic_inactivity_16_64,percent_in_low_income,unemployment_rate_16_64,Normalised economic_inactivity_16_64,Normalised percent_in_low_income,Normalised unemployment_rate_16_64,mean_norm_score
1560,E06000001,27.3,,6.2,0.263415,,0.468085,0.365750
1561,E06000002,29.3,,6.8,0.165854,,0.404255,0.285054
1562,E06000003,28.5,,2.4,0.204878,,0.872340,0.538609
1563,E06000004,23.6,,3.6,0.443902,,0.744681,0.594292
1564,E06000005,20.9,,3.1,0.575610,,0.797872,0.686741
...,...,...,...,...,...,...,...,...
1631,E08000037,26.6,,3.8,0.297561,,0.723404,0.510483
1640,E12000001,26.0,21.0,4.7,0.326829,0.0,0.627660,0.318163
1641,E12000002,23.3,19.0,4.1,0.458537,1.0,0.691489,0.716675
1642,E12000003,22.6,19.0,3.6,0.492683,1.0,0.744681,0.745788


#### Tuesday: There is an interesting problem of what to do when there is missing data. Do we use the average(mean or median) value for all geographies contained in the parent geography, or do we use the value of the parent geography itself. We could make this an optional filter in the function and see how the answers change when we use each one.