In [1]:
import os
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
from sklearn.utils import resample

pio.templates.default = "plotly_white"

In [2]:
try:
    _ = first_run
except NameError:
    first_run = True
    os.chdir(os.getcwd().rsplit("/", 1)[0])
    from _aux import functions as func

# Load data

In [3]:
default = (
    pd.read_csv(
        "../data/train/X_train.csv",
        index_col=0,
        usecols=["row_id", "age", "name_in_email"],
    )
    .join(pd.read_csv("../data/train/y_train.csv", index_col=0))
    .query("default == 1")
)

not_default = (
    pd.read_csv(
        "../data/train/X_train.csv",
        index_col=0,
        usecols=["row_id", "age", "name_in_email"],
    )
    .join(pd.read_csv("../data/train/y_train.csv", index_col=0))
    .query("default == 0")
)

## 1. Age
Be it due to less financial stability, more impulsive behaviour or dimmed ability to weight consequences, common knowledge tells us that younger customers should be more likely to default their payments than their older counterparts. However, time and time again, common knowledge has prooved to be rather flimsy ally when making decisions and predictions. Next, we test the hypothesis that "customers who default are the same age as those who don't" against the alternative hypothesis that "customers who default are more likely to be younger".


In [4]:
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x=not_default.age.sample(1000, replace=True, random_state=42),
        name="not_default",
        histfunc="count",
        # histnorm='probability',
        xbins=dict(start=18, end=100, size=5),
    )
)

fig.add_trace(
    go.Histogram(
        x=default.age.sample(1000, replace=True, random_state=42),
        name="default",
        histfunc="count",
        # histnorm='probability',
        xbins=dict(start=18, end=100, size=5),
    )
)

fig.update_layout(title="Are youngsters more likely to default?", barmode="overlay")

fig.update_traces(opacity=0.6)
fig.show()

Although comparing histograms for a random sample of each label is suggestive towards customers who default being younger, it is not rigorous enough to draw any conclusion. Hence, we employ a bootstrap analysis of the difference of the mean between the two groups to yield more robust evidence.

The rationale behind our test is the following. Give that two samples are randomly drawn from the population*, what is the distribution of the difference of their means? With such distribution in hand, we can then look at the difference observed between the mean of the "default" sample and the "not default" sample to judge how likely the result is to happen by random chance. The more unlikely te result is to happen by chance, the more confident we are that it is significant.

\*refering to the original sample, not the actual population from which it is drawn.

In [5]:
num_iterations = 100_000
sample1 = []
sample2 = []
combined = np.concatenate((default.age, not_default.age), axis=0)

for i in range(num_iterations):
    np.random.seed(i)
    combined = np.concatenate(
        (
            default.age.sample(1_000, replace=True),
            not_default.age.sample(1_000, replace=True),
        ),
        axis=0,
    )
    sample1.append(resample(combined, n_samples=500))
    sample2.append(resample(combined, n_samples=500))

diff_bootstrap_means = np.mean(sample1, axis=1) - np.mean(sample2, axis=1)

observed_difference = np.mean(
    default.age.sample(1_000, replace=True, random_state=42)
) - np.mean(not_default.age.sample(1_000, replace=True, random_state=42))

p_value = (
    diff_bootstrap_means[diff_bootstrap_means < observed_difference].shape[0]
    / num_iterations
)

We choose to combine fixed-size random subsamples of the "default" and "not default" samples to balance both classes. The rationale is that if we combine them as they are, "default" will make up less than 15% of observations, which would lead to an unfair comparassion of difference in means. Bootstrap samples would mostly be composed of "not default" observations, which could make the observed difference artificially more unlikely.

In [6]:
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x=diff_bootstrap_means,
        name="sample_difference",
        histfunc="count",
    )
)

fig.add_vline(
    x=observed_difference,
    line_width=3,
    line_color="red",
    line_dash="dash",
    annotation_text=f"Observed",
)

fig.update_layout(
    title="Distribution of difference in means for 2 bootstrap samples",
    barmode="overlay",
)

fig.update_traces(opacity=0.75)
fig.update_xaxes(range=[-5, 5])
fig.show()

As we can see, the observed difference is very unlikely to happen by chance even in a balance population* sample. This should give us confidence to reject te null hypothesis that "customers who default are the same age as those who don't" in favor of the alternate hypothesis that "customers who default are more likely to be younger". Indeed, common sense does prevail at times. We will definitely use this variable as feature for our models.

## 2. Name in email
Right off the bat, there isn't much intuition on how a person's good or bad taste for email handles relates to the likelyhood of them defaulting payments. So, it seems to be a good idea to look at how observations are distributed among the possible classes.

In [7]:
default_email_dist = default.name_in_email.value_counts()
not_default_email_dist = not_default.name_in_email.value_counts()

fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=not_default_email_dist.index.to_list(),
        y=(not_default_email_dist.values / np.sum(not_default_email_dist)),
        name="not_default",
    )
)

fig.add_trace(
    go.Bar(
        x=default_email_dist.index.to_list(),
        y=(default_email_dist.values / np.sum(default_email_dist)),
        name="default",
    )
)

fig.update_layout(
    title="Distribution of observations across 'name_in_email' classes",
    barmode="group",
    yaxis_title="Percentage",
    yaxis_tickformat="%",
)

fig.update_traces(opacity=0.75)
fig.show()

The plot above looks at what percentage of all observations fall into each category, considering the "default" and "not_default" samples independently. It still seems that there is no clear distinction between how defaulting and non-defaulting clients choose their email. However, we can change the perspective with which we look at the problem and pose it as follows:
- Are the difference in default proportions across categories due to chance, or are they significantly different?


In [11]:
cat_profile = (
    pd.concat(
        [
            default[["name_in_email", "default"]],
            not_default[["name_in_email", "default"]],
        ]
    )
    .groupby("name_in_email")
    .agg(
        default=("default", "sum"),
        not_default=("default", func.complement),
        count=("default", "count"),
    )
    .transform(lambda s: s.astype(int))
)

cat_profile

Unnamed: 0_level_0,default,not_default,count
name_in_email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,93,6919,7012
F+L,365,28727,29092
F1+L,67,5188,5255
Initials,0,17,17
L,21,913,934
L1+F,132,11421,11553
Nick,114,5854,5968
no_match,238,11911,12149


In [9]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=cat_profile.index.to_list(),
        y=(cat_profile["not_default"] / cat_profile["count"]),
        name="not_default",
    )
)

fig.add_trace(
    go.Bar(
        x=cat_profile.index.to_list(),
        y=(cat_profile["default"] / cat_profile["count"]),
        name="default",
    )
)

fig.update_layout(
    title="Distribution of observations across 'name_in_email' classes",
    barmode="relative",
    yaxis_title="Percentage",
    yaxis_tickformat="%",
)

fig.update_traces(opacity=0.75)
fig.show()

It stills looks like there is no significant difference between the proportion in classes, but it also could just be that the imbalance between "default" and "not_default" is hiding such difference. Thus, we employ K Proportions Theory[\[pdf\]][1] to test the null hypothesis that "the probability with which defaults happen is the same across all categories" against the alternate hypothesis that "the probability with which defaults happen differs across some categories".

[1]: https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf

In [10]:
_, result_df = func.test_k_prop(cat_profile)

result_df

Using 7 degrees of freedom
Reject null hypothesis with 53.08205304843534 > 14.067140449340167


Unnamed: 0_level_0,default,not_default,count,expected_default,expected_not_default,chi_default,chi_not_default
name_in_email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
F,93,6919,7012,100.338427,6911.661573,0.536709,0.007792
F+L,365,28727,29092,416.292859,28675.707141,6.319968,0.091749
F1+L,67,5188,5255,75.196582,5179.803418,0.893444,0.01297
Initials,0,17,17,0.243262,16.756738,0.243262,0.003531
L,21,913,934,13.365101,920.634899,4.361484,0.063317
L1+F,132,11421,11553,165.318005,11387.681995,6.714873,0.097482
Nick,114,5854,5968,85.399278,5882.600722,9.578551,0.139054
no_match,238,11911,12149,173.846485,11975.153515,23.674183,0.343684


Results show that the probability with which defaults happen differs across some categories, specially higher when the category is "no_match". This opens the possibility of creating a binary feature (match or no match) instead of having a categorical one. Let's examine it:

---

Despite the rather counter intuitive result, our exploration has provided some evidence that "name_in_email" is a suitable candidate as feature for our models. Furthermore, results suggest that common sense prevails when it comes to defaults and age, thus making "age" also a good candidate for feature. Hence, we move forward with 2 candidates for features from "personal" variables:
- name_in_email
- age

Next, we move onto exploring variables with regards to account status.