# Cogo Data Science technical interview task: Visualise and cluster

From the introductory document:

>## What will the format be?

>We have prepared a coding task that we would like you to prepare for the interview. You will find the data and the data dictionary attached. The data consists of 2070 rows of Cogo app users (1 row per user), their demographic data as well as their sustainability values (badges) and the climate actions they committed to in the app (actions). If a person has selected a sustainability badge, or committed to a climate action, it is marked as 1 , otherwise 0.

>We don’t expect you to spend more than 2 hours on the task!

>This is an exploratory exercise and we want you to have fun with the data. We’d like you to work in Python, maybe Jupyter Notebook so you can take us through the code and results on the day of the interview.

>### Here are a couple of suggestions you could be looking at:
>- Exploratory Analysis: Spend some time exploring the data, using some data
visualisation and descriptive statistics. Tell us about interesting things you come
across.
>- Clustering: Can you identify any clusters in the Badge data, or Action data? Can you
infer one from the other? If you find any clusters in the data, how would you interpret
those?
>- Application: Based on your findings, what are some recommendations you would
take to the product team?

## What is in this notebook?

This notebook contains an analysis of the relationships between the demographic, action and badge variables provided in `Interview2_DataChallenge.csv`.

If I have time I might do some unsupervised machine learning to provide some deeper insight.

In [86]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px

from lib import USER_DATA_PATH

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
user_data = pd.read_pickle(USER_DATA_PATH)
user_data

Unnamed: 0,person_id,gender,age,action_no_red_meat,action_vegan,action_vegetarian,action_composting,action_using_renewable_energy,action_buying_secondhand,action_meat_free_monday,...,badge_carb_con,badge_vegan,badge_inv,badge_carb_neutral,badge_sus_sourced,badge_cruel_free,badge_fair_trade,badge_org,badge_liv_wage,badge_sup_charities
0,0008afd9-12df-491b-8883-a559cb451daf,Female,"[18.0, 29.0]",0,0,0,0,0,1,0,...,1,1,0,1,0,1,0,0,0,0
1,000c11c0-85fc-4a77-bf3c-69be1ee99b4b,Female,"[31.0, 40.0]",0,1,0,0,0,0,0,...,0,1,0,0,1,1,0,1,0,0
2,00218abc-126b-4bc7-b7c7-f65999f70d1d,Male,"[31.0, 40.0]",0,1,0,1,1,0,0,...,1,1,0,1,1,1,1,1,1,0
3,002d9ea9-d9c7-4461-a019-92375df223f9,Female,"[18.0, 29.0]",0,1,0,1,1,1,0,...,1,1,0,1,1,1,1,1,1,1
4,0097a8dd-df9a-49dd-92e8-c84509e23a06,Male,"[31.0, 40.0]",1,0,0,0,1,1,0,...,1,0,0,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2065,ffd86f00-4f08-487d-bfea-838c805a8dbc,Other/Unknown,,0,1,0,0,0,1,0,...,1,1,0,1,1,1,0,0,1,0
2066,ffe84c90-d1ee-44ea-872d-d2357293ceab,Female,"[18.0, 29.0]",0,0,0,0,0,1,0,...,1,0,0,1,1,1,0,0,0,1
2067,ffeab5eb-2ab5-4840-9605-b99fd92167aa,Female,"[18.0, 29.0]",0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
2068,fffb6c15-e0a7-4c96-a784-73fdb58fd83f,Other/Unknown,,0,1,0,1,1,1,0,...,0,0,0,1,1,0,1,1,0,0


## Visualise the relationships between variables

If this data was primarily continuous then tradition would dictate that I would use a `seaborn.pairplot` at this stage, to visualise the relationships between the columns across all combinations of pairs.

Instead, I'm going to (try to) compute the ratio of conditional probabilities which, like a correlation for continuous variables, is a measure of the association between binary variables. I'll then visualise that with a heatmap. Hopefully that will be a bit more informative.

**I have my fingers crossed that this information will prove useful for serving action suggestions to users based on their demographic information and badges.**

First we will encode both the `gender` and `age` with a one-hot encoding so that they are binary variables as well.

In [56]:
# here we replace the NA values in `age` with 'age_unknown' because having NAs in the column names doesn't feel like a good idea
# we'll then turn those intervals into strings so I can index against them a bit more easily
gender_age_onehot = pd.concat([
    pd.get_dummies(user_data.gender, prefix = 'gender'),
    pd.get_dummies(user_data.age.astype('object').fillna('unknown').astype(str), prefix = 'age')],
    axis = 1)
gender_age_onehot

Unnamed: 0,gender_Female,gender_Male,gender_Other/Unknown,"age_[18.0, 29.0]","age_[31.0, 40.0]","age_[41.0, 55.0]","age_[56.0, 100.0]",age_unknown
0,1,0,0,1,0,0,0,0
1,1,0,0,0,1,0,0,0
2,0,1,0,0,1,0,0,0
3,1,0,0,1,0,0,0,0
4,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...
2065,0,0,1,0,0,0,0,1
2066,1,0,0,1,0,0,0,0
2067,1,0,0,1,0,0,0,0
2068,0,0,1,0,0,0,0,1


In [57]:
# just to check that worked, the sum in axis = 1 should all equal 2
assert (gender_age_onehot.sum(axis = 1) == 2).all()

Now we can concatenate our one-hot encodings with the actions and badges to get our fully binaryized data.

We will also remove the `badge_inv` column because it is not going to provide us with any insight into behaviour (in this sample nobody chose that badge).

In [71]:
action_names = [n for n in user_data.columns if n.startswith('action')]
badge_names = [n for n in user_data.columns if n.startswith('badge')]

user_data_onehot = pd.concat([gender_age_onehot, user_data[action_names + badge_names]], axis = 1)
user_data_onehot = user_data_onehot.drop('badge_inv', axis = 1)

## The conditional probability ratio: an example

As an example I'll compute the probability of `action_vegan=1` given `age_[18.0, 29.0]=0` and the conditional probability of `action_vegan=1` given `age_[18.0, 29.0]=1`.

In $LateX$ style:

Let the events:

$ (\text{action_vegan} = 1) = A $

$ (\text{age_[18.0, 29.0]} = 1) = B $

and

$ (\text{age_[18.0, 29.0]} = 0) = \neg B $

Then the two conditional probabilities are:

$ P(A | \neg B) = \frac{P(A \cap \neg B)}{P(\neg B)}$

and:

$ P(A | B) = \frac{P(A \cap B)}{P(B)}$

The ratio:

$ \frac{P(A | B)}{P(A | \neg B)} $

then tells us how much _more likely_ a user of age under 30 is to commit to a vegan diet, compared to someone over 30.

Interpretation:
* If the conditional probability ratio is equal to 1 then the two conditional probabilities are equal, then we know that a user is equaly likely to choose $A$ given $\neg B$ or $B$ 
* If the conditional probability ratio is above 1 then we know the user is more likely to choose $A$ if they also chose $B$
* If the conditional probability ratio is below 1 then we know the user is more likely to choose $A$ if they also chose $\neg B$

In [72]:
under_30 = user_data_onehot['age_[18.0, 29.0]']
action_vegan = user_data_onehot['action_vegan']

contingency_table = pd.crosstab(under_30, action_vegan)
conditional_probs = contingency[1] / contingency.sum(axis = 1)
conditional_prob_ratio = conditional_prob[1] / conditional_prob[0]
conditional_prob_ratio

1.5738093194885439

We see that a user who is under 30 is 1.6 times _more_ likely to commit to becoming vegan than a user who is over 30.

I'm going to apply this to identify how likely a user is to choose an action given their badges and demographics.

In [99]:
def compute_conditional_prob_ratio(event_A, given_event_B):
    """ Compute the conditional probability ratio: P(event_A = 1 | given_event_B = 1) / P(event_A =1 | given_event_B = 0)
    """
    contingency_table = pd.crosstab(given_event_B, event_A)
    conditional_prob = contingency_table[1] / contingency_table.sum(axis = 1)
    ratio = conditional_prob[1] / conditional_prob[0]
    if np.isinf(ratio):
        return pd.NA
    return ratio

# test the example above
compute_conditional_prob_ratio(user_data_onehot['action_vegan'], user_data_onehot['age_[18.0, 29.0]'])

1.5738093194885439

In [91]:
# I would love to use the corr() method for this, but it always returns a symmetric dataframe, and this metric is not symmetric
conditional_prob_ratios = pd.DataFrame(
    index = pd.Index(user_data_onehot.drop(action_names, axis = 1).columns, name = 'given_event'),
    columns = pd.Index(action_names, name = 'event'))

for event_name, event_col in user_data_onehot[action_names].iteritems():
    for given_event_name, given_event_col in user_data_onehot.drop(action_names, axis = 1).iteritems():
        conditional_prob_ratios.loc[given_event_name, event_name] = compute_conditional_prob_ratio(event_col, given_event_col)

conditional_prob_ratios

event,action_no_red_meat,action_vegan,action_vegetarian,action_composting,action_using_renewable_energy,action_buying_secondhand,action_meat_free_monday,action_use_public_transport,action_drive_electric_vehicle
given_event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
gender_Female,0.972124,1.207328,1.113053,1.011926,0.683404,1.49068,0.961441,0.776549,0.690265
gender_Male,1.156738,0.799802,0.919125,1.115696,1.470917,0.606157,0.420146,1.043204,1.035361
gender_Other/Unknown,0.862654,0.982379,0.943889,0.850629,1.047798,0.961113,2.135807,1.341844,1.556088
"age_[18.0, 29.0]",0.797117,1.573809,1.302752,0.698585,0.553818,1.317455,0.382987,0.727675,0.389826
"age_[31.0, 40.0]",1.189593,0.726655,0.742515,1.309763,1.541526,0.792265,0.649701,0.98068,2.227545
"age_[41.0, 55.0]",1.65081,0.525159,0.733206,1.553754,1.485705,0.762517,2.898879,1.380419,1.46162
"age_[56.0, 100.0]",1.556851,0.304274,0.94825,1.970052,2.408364,0.651537,0.0,0.0,0.0
age_unknown,0.800054,0.903209,0.988242,0.928571,1.118347,0.927829,2.049261,1.395242,1.366174
badge_reduce_waste,1.050621,0.940462,1.296698,1.12518,0.969389,1.374928,0.673368,0.750865,0.351322
badge_coop,1.09398,1.031276,0.979061,1.171997,1.196973,1.258278,0.968085,1.316596,1.584139


I don't have a lot of time to interpret this, so let's summarise the highest 10 and lowest 10 as these are good candidates for describing the strongest relationships and potentially feeding into a prediction model of what actions people might choose.

In [96]:
conditional_prob_ratios_melted = (
    pd.melt(
        conditional_prob_ratios,
        value_name = 'cond_prob',
        ignore_index = False)
    .reset_index()
    .sort_values(by = 'cond_prob')
    .loc[:, ['event', 'given_event', 'cond_prob']] 
)

### Strongest positive relationships

The largest ratio of $ \frac{P(A | B)}{P(A | \neg B)} $

In [98]:
conditional_prob_ratios_topten = conditional_prob_ratios_melted.iloc[-10:, :]
conditional_prob_ratios_topten

Unnamed: 0,event,given_event,cond_prob
66,action_composting,"age_[56.0, 100.0]",1.970052
127,action_meat_free_monday,age_unknown,2.049261
139,action_meat_free_monday,badge_sup_charities,2.071704
122,action_meat_free_monday,gender_Other/Unknown,2.135807
177,action_drive_electric_vehicle,badge_org,2.176735
164,action_drive_electric_vehicle,"age_[31.0, 40.0]",2.227545
173,action_drive_electric_vehicle,badge_carb_neutral,2.229656
86,action_using_renewable_energy,"age_[56.0, 100.0]",2.408364
125,action_meat_free_monday,"age_[41.0, 55.0]",2.898879
32,action_vegan,badge_vegan,9.786578


Perhaps unsurprisingly, a user who chooses `badge_vegan` is almost 10 times more likely to choose `action_vegan` than a user who does not choose `badge_vegan`.

Users in the age bracket `[41 , 55]` are close to 3 times more likely to choose `action_meat_free_monday` than those outside of this bracket. 

People in the oldest age bracket `[56, 100]` are 2.4 times more likely to commit to `action_using_renewable_energy`.

People who chose `badge_carb_neutral` are 2.2 times more likely to commit to `action_drive_electric_vehicle`.

Indeed, the top ten is really dominated by the actions `action_meat_free_monday` and `action_drive_electric_vehicle`.

Let's explore this further by digging into the combinations of the `given_event` columns.

**They are already doing it, versus they want to do it**

In [107]:
event_name = 'action_drive_electric_vehicle'
event = user_data_onehot[event_name]

# get the top three `given_event`s from the `conditional_prob_ratios` frame
given_event_names = conditional_prob_ratios_topten.loc[
    conditional_prob_ratios_topten.event == event_name,
    'given_event']

# we'll drop `age_unknown` here as it is mutually exclusive with `age_[41.0, 55.0]`
# given_event_names = given_event_names.iloc[1:]

# now the `given_event` becomes when _all_ of these events are true for the user 
given_event = user_data_onehot[given_event_names].all(axis = 1)

compute_conditional_prob_ratio(event, given_event)

3.555147058823529

In [108]:
pd.crosstab(given_event, event)

action_drive_electric_vehicle,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1918,16
True,132,4


In [112]:
conditional_prob_ratios_bottom_ten = conditional_prob_ratios_melted.iloc[:10].copy()
conditional_prob_ratios_bottom_ten['cond_prob'] = 1. / conditional_prob_ratios_bottom_ten['cond_prob']
conditional_prob_ratios_bottom_ten

Unnamed: 0,event,given_event,cond_prob
146,action_use_public_transport,"age_[56.0, 100.0]",inf
166,action_drive_electric_vehicle,"age_[56.0, 100.0]",inf
126,action_meat_free_monday,"age_[56.0, 100.0]",inf
132,action_meat_free_monday,badge_vegan,9.748495
26,action_vegan,"age_[56.0, 100.0]",3.286517
168,action_drive_electric_vehicle,badge_reduce_waste,2.84639
123,action_meat_free_monday,"age_[18.0, 29.0]",2.611055
163,action_drive_electric_vehicle,"age_[18.0, 29.0]",2.565247
12,action_no_red_meat,badge_vegan,2.384981
121,action_meat_free_monday,gender_Male,2.380123


56 100 people don't seem to want to do anything.

Two interpretations here:
* They don't want to do it
* They are doing it anyway

For a vegan, every Monday is a meat-free Monday
On the otherhand, I'm not super surprised that gender_Male are 2.4 times less likely than the other genders to choose meat free Monday.
Is that because these guys are already vegetarian, or because they don't want to stop eating meat?

In [None]:
It would be useful to have information about what users are already doing, rather than what their values are.

That would certainly make these trends more interpretable.