### `Creating a multi-variable rank: Introduction`

We are interested in `triangulating` data for the JRF project. This means using multiple data sources to come up with new insights and information. Our first, `very simple` example of this is a multi-variable rank. Much of our data can be ranked relative to the same statistics for different places at the same level of geography (in practice we could also compare to say, the average value of the containing goegraphy).

Whether a high rank is good or bad is arbitrary, and will change depending the the context of each statistic used. For example, in the context of poverty, low levels of unemployment is good, but low wages are bad. We will therefore assume the following:
- A high rank (e.g. 1st out of 10) corresponds a statistic that is `least likely to cause poverty`. For example, an unemployment rate of 0% would rank 1st.
- A low rank (e.g. 10th out of 10) corresponds to a statistic that is `more likey to cause poverty`. For example, an house-price-to-wage ratio of 100,000 would rank 10th. 
- Each statistic is `equally likely` to cause poverty.
- We can create an aggregated rank using a `mean-average` of the ranked statistics.

**Disclaimer**: this is a proof of concept and inital attempt at triangulating poverty data. In practice, we know that some factors will affect poverty more than others. As we develop our model we will incorporate more features to make the ranking more accurate.

### `Creating a multi-variable rank: Toy model`

In [1]:
import pandas as pd
import numpy as np

# create some dummy data
array = np.random.randint(10, size=(3, 5))

In [2]:
# Creating a lowest score wins rank. Set ascending = False for highest score wins.
df = pd.DataFrame({'A':array[0], 'B':array[1], 'C': array[2]})
colnames = df.columns.to_list()
ascent_array = [True, True, True] # the ascent array decides how to rank the items in each column.

def multivariable_rank(df, colnames, ascent_array, method='min'):
    rank_colnames = []
    i = 0
    assert len(colnames) == len(ascent_array)
    for col in colnames:
        # method decides the rank of tied elements
        # ascending is the part we will need to change depending on the statistic in future
        df[f'Rank {col}'] = df[f'{col}'].rank(method=method, ascending=ascent_array[i])
        rank_colnames.append(f'Rank {col}')
        i += 1
    df['mean_rank'] = df[rank_colnames].mean(axis=1)
    df['overall_rank'] = df['mean_rank'].rank(method='min', ascending=True)
    return df

# we then calculate the mean of the ranks, and rank the mean.
df = multivariable_rank(df, colnames, ascent_array)
df

Unnamed: 0,A,B,C,Rank A,Rank B,Rank C,mean_rank,overall_rank
0,4,6,2,2.0,4.0,3.0,3.0,4.0
1,9,3,0,5.0,1.0,1.0,2.333333,1.0
2,7,5,0,4.0,3.0,1.0,2.666667,3.0
3,4,4,2,2.0,2.0,3.0,2.333333,1.0
4,2,8,7,1.0,5.0,5.0,3.666667,5.0


### `Applying the toy model to poverty data`

In [3]:
# data columns to use from place_data.json
usecols = ['geography_code', 'economic_inactivity_16_64', 'percent_in_low_income', 'unemployment_rate_16_64']

# columns to rank
cols_to_rank = ['economic_inactivity_16_64', 'percent_in_low_income', 'unemployment_rate_16_64']

# read the data
data = pd.read_json(r"C:\Users\LukeStrange\Code\jrf-insight\data\interim\place_data.json")

# select only necessary columns and "regions"
data = data.loc[:, usecols]
data = data.loc[data.geography_code.str.startswith('E120')]

# for this example we want all these metrics a LOW value is less likely to cause poverty so we need to set ascending to True.
ascent_array = [True, True, True]

data = multivariable_rank(data, colnames=cols_to_rank, ascent_array=ascent_array)
data


Unnamed: 0,geography_code,economic_inactivity_16_64,percent_in_low_income,unemployment_rate_16_64,Rank economic_inactivity_16_64,Rank percent_in_low_income,Rank unemployment_rate_16_64,mean_rank,overall_rank
1640,E12000001,26.0,21.0,4.7,3.0,3.0,3.0,3.0,3.0
1641,E12000002,23.3,19.0,4.1,2.0,1.0,2.0,1.666667,2.0
1642,E12000003,22.6,19.0,3.6,1.0,1.0,1.0,1.0,1.0


### `Results`

Based on the rate of economic inactivity, rate on unemployment and percentage of people in low income poverty, `Yorkshire & the Humber (E12000003)` is the least impoverished region in North England, while `North West (E12000001)` is the most.