### Notes

- We want to create a normalisation algorithm for our modelling. 
- We will map highest/lowest values for a category from 0-1. (Or, we could normalise to average value of containing geography.)
- Then calculate a mean average. 
- Easy to add weights in if necessary. 
- We will say that 0 = least likely to cause poverty (e.g. low house price to wage ratio) and 1 = most likely to cause poverty (e.g. high rate of unemployment).

In [14]:
import pandas as pd
import numpy as np

array = np.random.randint(10, size=(3, 5))
nans = [1, np.nan, np.nan, 2, 3]
nans

[1, nan, nan, 2, 3]

In [15]:
df = pd.DataFrame({'A':array[0], 'B':array[1], 'C': array[2], 'D': nans}, index=['place 1', 'place 2', 'place 3', 'place 4', 'place 5'])
colnames = df.columns.to_list()
df

Unnamed: 0,A,B,C,D
place 1,9,8,7,1.0
place 2,5,7,2,
place 3,2,4,7,
place 4,2,7,9,2.0
place 5,9,6,4,3.0


For a set $$X$$ containing $$x_i$$ values, the normalised values
$$x_{normalised} = \frac{x_i - X_{min}}{X_{max}-X_{min}}$$

In [16]:
def normalise(df, highest_score_wins):
    '''
    Take a dataframe with different metrics as columns and geographies as rows.
    Normalise the values from 0-1 where 1 is likely to cause poverty and 0 is 
    least likely to cause poverty (this will depend on the stat/measure).
    Mean-average the normalised values.

    Params
    ------
        df: dataframe
        highes_score_wins: list of bools.
        
    Returns
    -------
        original dataframe plus normalised value columns and mean-average column.
    '''
    colnames = df.columns.to_list()
    if 'geography_code' in colnames:
        colnames.remove('geography_code')
    if 'ancestors' in colnames:
        colnames.remove('ancestors')
    norm_cols = []
    i = 0
    for col in colnames:
        mx, mn = df[f'{col}'].max(), df[f'{col}'].min()
        range = mx - mn 
        if highest_score_wins[i] == True:
            df[f'Normalised {col}'] = (df[f'{col}'] - mn) / range
        else:
            df[f'Normalised {col}'] = abs((df[f'{col}'] - mx) / range)
        norm_cols.append(f'Normalised {col}')
        i += 1
    df['mean_norm_score'] = df[norm_cols].mean(axis=1)
    return df

normalise(df, [False, True, True, False])

Unnamed: 0,A,B,C,D,Normalised A,Normalised B,Normalised C,Normalised D,mean_norm_score
place 1,9,8,7,1.0,0.0,1.0,0.714286,1.0,0.678571
place 2,5,7,2,,0.571429,0.75,0.0,,0.440476
place 3,2,4,7,,1.0,0.0,0.714286,,0.571429
place 4,2,7,9,2.0,1.0,0.75,1.0,0.5,0.8125
place 5,9,6,4,3.0,0.0,0.5,0.285714,0.0,0.196429


In [21]:
# data columns to use from place_data.json
usecols = ['geography_code', 'economic_inactivity_16_64', 'median_weekly_wage', 
                'mean_weekly_wage', 'Claimants as a proportion of residents aged 16-64', 
                'imd_average_score', 'households_low_income_no_savings', 'ancestors']

# columns to rank
cols_to_norm = ['economic_inactivity_16_64', 'median_weekly_wage', 
                'mean_weekly_wage', 'Claimants as a proportion of residents aged 16-64', 
                'imd_average_score', 'households_low_income_no_savings']

# read the data
data = pd.read_json(r"C:\Users\LukeStrange\Code\jrf-insight\data\interim\place_data.json")

# select only necessary columns and "regions"
data = data.loc[:, usecols]
data = data.loc[data.geography_code.str.startswith(('E120', 'E08'))]#, 'E129', 'E06', 'E07', 'E08'))]
#data.loc[1643, 'unemployment_rate_16_64'] = 3.8
# for this example we want all these metrics a LOW value is less likely to cause poverty
highest_score_wins = [False, False, False]
#data = normalise(data, highest_score_wins)
#data.loc[1603, 'percent_in_low_income'] = 75.0
normalise(data, [True, False, False, True, True, True])


Unnamed: 0,geography_code,economic_inactivity_16_64,median_weekly_wage,mean_weekly_wage,Claimants as a proportion of residents aged 16-64,imd_average_score,households_low_income_no_savings,ancestors,Normalised economic_inactivity_16_64,Normalised median_weekly_wage,Normalised mean_weekly_wage,Normalised Claimants as a proportion of residents aged 16-64,Normalised imd_average_score,Normalised households_low_income_no_savings,mean_norm_score
1603,E08000001,26.2,495.4,584.2,5.6,30.691,,"[E47000001, E12000002, E12999901]",0.65942,0.740319,0.536941,0.714286,0.542499,,0.638693
1604,E08000002,18.5,551.4,614.0,4.3,23.682,,"[E47000001, E12000002, E12999901]",0.101449,0.31511,0.381894,0.342857,0.282116,,0.284685
1605,E08000003,27.9,480.7,540.1,6.2,40.005,,"[E47000001, E12000002, E12999901]",0.782609,0.851936,0.766389,0.885714,0.888513,,0.835032
1606,E08000004,24.0,495.6,556.7,6.5,33.155,,"[E47000001, E12000002, E12999901]",0.5,0.7388,0.680021,0.971429,0.634037,,0.704857
1607,E08000005,30.9,472.3,536.8,5.9,34.415,,"[E47000001, E12000002, E12999901]",1.0,0.915718,0.783559,0.8,0.680846,,0.836024
1608,E08000006,25.0,535.8,591.2,5.2,34.21,,"[E47000001, E12000002, E12999901]",0.572464,0.433561,0.50052,0.6,0.67323,,0.555955
1609,E08000007,17.1,550.0,654.6,3.1,20.826,,"[E47000001, E12000002, E12999901]",0.0,0.32574,0.170656,0.0,0.176016,,0.134482
1610,E08000008,21.6,489.1,550.2,4.7,31.374,,"[E47000001, E12000002, E12999901]",0.326087,0.788155,0.71384,0.457143,0.567873,,0.570619
1611,E08000009,22.8,592.9,687.4,3.1,16.088,,"[E47000001, E12000002, E12999901]",0.413043,0.0,0.0,0.0,0.0,,0.082609
1612,E08000010,19.3,499.9,559.6,3.7,25.713,,"[E47000001, E12000002, E12999901]",0.15942,0.70615,0.664932,0.171429,0.357567,,0.4119


#### Tuesday: There is an interesting problem of what to do when there is missing data. Do we use the average(mean or median) value for all geographies contained in the parent geography, or do we use the value of the parent geography itself. We could make this an optional filter in the function and see how the answers change when we use each one.

If data is missing for a set of geographies e.g. ward level, it will be missing for all of that type.

The data will be filled with the following precedence:
- average of all geographies at the same level with the same parent geography.
- the value of the parent geographies, starting smallest and going up through the levels.

In [9]:
data_copy = data.copy()
for col in cols_to_norm:
    na_rows = data_copy[data_copy.isnull().any(axis=1)]
    for i, row in na_rows.iterrows():
        for ancestor in row.ancestors:
            #print(ancestor)
            val = data_copy[data_copy.geography_code == ancestor][f'{col}']
            if val.empty:
                #print(f'No value for geography {ancestor}')
                continue
            else:
                #print('found a value to use')
                data_copy.at[i, f'{col}'] = val.iloc[0]
                found = True
            if found != True:
                print(f'no data exists at any level for the measure: {col}')
data_copy

Unnamed: 0,geography_code,economic_inactivity_16_64,median_weekly_wage,mean_weekly_wage,Claimants as a proportion of residents aged 16-64,imd_average_score,households_low_income_no_savings,ancestors
1603,E08000001,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1604,E08000002,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1605,E08000003,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1606,E08000004,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1607,E08000005,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1608,E08000006,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1609,E08000007,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1610,E08000008,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1611,E08000009,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"
1612,E08000010,23.3,504.5,579.2,4.2,,400000.0,"[E47000001, E12000002, E12999901]"


In [6]:
data_copy2 = data.copy()
data_copy2['ancestors'] = data_copy2['ancestors'].apply(tuple)
for col in cols_to_norm:
    #df = data.loc[f'{col}'].copy()
    df_dropped = data_copy2.dropna(subset=[f'{col}'])
    mean_avg = df_dropped.groupby("ancestors")[f'{col}'].mean(numeric_only=True).reset_index()
    null_indices = data_copy2[data_copy2.isnull().any(axis=1)].index
    null_ancestors = data_copy2[data_copy2.isnull().any(axis=1)]['ancestors']
    for idx, null_value in zip(null_indices, null_ancestors):
        #print(data_copy2.loc[idx, f'{col}'])
        if pd.isna(data_copy2.loc[idx, f'{col}']):
            print(data_copy2.loc[idx, f'{col}'])
            #print(mean_avg.loc[mean_avg['ancestors'] == null_value])
        #print(mean_avg)
        #print(mean_avg.loc[mean_avg['ancestors']==null_ancestor, f'{col}'])
        #data_copy2.loc[idx, f'{col}'] = mean_avg.loc[mean_avg['ancestors']==null_ancestor, f'{col}']

    if col == 'percent_in_low_income':
        break
    #print(mean_avg)
data_copy2


nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


Unnamed: 0,geography_code,economic_inactivity_16_64,percent_in_low_income,unemployment_rate_16_64,ancestors
1603,E08000001,26.2,,6.0,"(E47000001, E12000002, E12999901)"
1604,E08000002,18.5,,2.7,"(E47000001, E12000002, E12999901)"
1605,E08000003,27.9,,6.0,"(E47000001, E12000002, E12999901)"
1606,E08000004,24.0,,3.1,"(E47000001, E12000002, E12999901)"
1607,E08000005,30.9,,2.8,"(E47000001, E12000002, E12999901)"
1608,E08000006,25.0,,7.1,"(E47000001, E12000002, E12999901)"
1609,E08000007,17.1,,5.1,"(E47000001, E12000002, E12999901)"
1610,E08000008,21.6,,3.7,"(E47000001, E12000002, E12999901)"
1611,E08000009,22.8,,4.5,"(E47000001, E12000002, E12999901)"
1612,E08000010,19.3,,5.9,"(E47000001, E12000002, E12999901)"


In [7]:
['1', '2', '3'] == ['1', '2', '3']

True