# Even: Data Assignment

## Config

Setting up the environment for the analysis.


**Note**: This notebook uses `Python3.9` via `miniconda`. The environment can be created with `conda create -n even python=3.9`

In [None]:
%load_ext nb_black

In [None]:
import pytz

import datetime as dt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
IST_TZ = pytz.timezone("Asia/Kolkata")
DATETIME_NOW_IST = dt.datetime(year=2023, month=3, day=12, tzinfo=IST_TZ)

In [None]:
def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)

    percentile_.__name__ = "percentile_%s" % n
    return percentile_

## Data Preparation

* In this section we're preparing the data for analyses. 
* We're also trying to validate the sanity of data and see how closely it resembles real-life facts.

In [None]:
sign_up_info_df = pd.read_csv("data/data_science_task_dataset.csv", index_col=[0])
sign_up_info_df.reset_index(drop=True, inplace=True)
sign_up_info_df.reset_index(names=["id"], inplace=True)

In [None]:
sign_up_info_df.head()

In [None]:
# assuming times are in IST
sign_up_info_df["signup_time"] = pd.to_datetime(
    sign_up_info_df["signup_time"]
).dt.tz_localize(IST_TZ)
sign_up_info_df["payment_time"] = pd.to_datetime(
    sign_up_info_df["payment_time"]
).dt.tz_localize(IST_TZ)

sign_up_info_df["plan_months"] = sign_up_info_df["plan_months"].astype(int)
sign_up_info_df["payment_amount"] = sign_up_info_df["payment_amount"].astype(float)
sign_up_info_df["is_early_bird"] = sign_up_info_df["is_early_bird"].astype(bool)

Where ever paytime is missing, payment amount should be `NA`, i.e, the payment is still pending. `0` would be that the customer paid `0`.

In [None]:
sign_up_info_df.loc[sign_up_info_df["payment_time"].isna(), "payment_amount"] = np.nan

In [None]:
sign_up_info_df["signup_to_payment_time"] = (
    sign_up_info_df["payment_time"] - sign_up_info_df["signup_time"]
)
sign_up_info_df["num_members"] = sign_up_info_df["ages"].apply(
    lambda x: len(x.split(", "))
)

In [None]:
# splitting each family member to a unique column
sign_up_info_exploded_df = sign_up_info_df.assign(
    **{
        "plans": sign_up_info_df["plans"].str.split(", "),
        "genders": sign_up_info_df["genders"].str.split(", "),
        "ages": sign_up_info_df["ages"].str.split(", "),
    }
).explode(column=["ages", "genders", "plans"])

### Validating with an external source

* We can verify whether the ages in this sample align with those in real life. 
* Age is one factor in insurance that has a range of permissible values.
* Are there any outliers?
* E.g. are people over 100 and below 0 looking for life insurance?

In [None]:
sign_up_info_exploded_df["ages"] = sign_up_info_exploded_df["ages"].astype(int)

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.countplot(
    x="ages",
    data=sign_up_info_exploded_df,
    order=np.arange(
        start=sign_up_info_exploded_df["ages"].min(),
        stop=sign_up_info_exploded_df["ages"].max(),
        step=1,
    ),
    ax=ax,
)
ax.set_title("Count Plot of ages")
ax.set_xlabel("Age (in years)")
ax.set_ylabel("Count of people")
plt.show()

In [None]:
sign_up_info_exploded_df["age_bins"] = pd.cut(
    sign_up_info_exploded_df["ages"],
    bins=[0, 17, 35, 50, 65],
    include_lowest=True,
    precision=0,
)
sign_up_info_exploded_df["age_bins"].value_counts(normalize=True, dropna=False) * 100

From the above we can learn that:

1. The age limit for insurance of 65 years is line with the [IRDAI](https://www.tataaig.com/knowledge-center/health-insurance/age-limit-for-health-insurance).
2. A significant part of the population purchasing premiums at Even is between 17-35.

Hence an early perspective doesn't show any concering signs.

## Attributing month-wise conversion

* Sign-up data spans the first seven months of the year 2022.
* Are we seeing a drop in conversion over any month?
* What are we doing differently in those months we see a drop?
* Finally, can we use insights from the well-performing months and see how they could be extended towards improving sub-par months?

In [None]:
sign_up_info_df["signup_day"] = sign_up_info_df["signup_time"].dt.dayofyear

day_wise_conv_df = (
    (
        sign_up_info_df[sign_up_info_df["payment_time"].notnull()]
        .groupby(["signup_day"])["id"]
        .count()
        / sign_up_info_df.groupby(["signup_day"])["id"].count()
    )
    .reset_index()
    .rename(columns={"id": "signup_to_payment_conversion"})
)


In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.lineplot(
    data=day_wise_conv_df, x="signup_day", y="signup_to_payment_conversion", ax=ax
)
plt.grid()

The above looks like organic growth after the introductory offer was removed.

In [None]:
sign_up_info_df["signup_month"] = sign_up_info_df["signup_time"].dt.month
sign_up_info_exploded_df["signup_month"] = sign_up_info_exploded_df[
    "signup_time"
].dt.month

month_wise_conv_df = (
    (
        sign_up_info_df[sign_up_info_df["payment_time"].notnull()]
        .groupby(["signup_month"])["id"]
        .count()
        / sign_up_info_df.groupby(["signup_month"])["id"].count()
    )
    .reset_index()
    .rename(columns={"id": "signup_to_payment_conversion"})
)

In [None]:
fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.barplot(
    data=month_wise_conv_df, x="signup_month", y="signup_to_payment_conversion", ax=ax
)
plt.grid()

Our conversion caps out at 20% with months 1, 2, 6, and 7 performing the best. Is there a difference composition of the following among months:

* User demographics: 
    - Age, Family Size, Gender
    - Are customers of a particular demographic finding Even's value position more useful?
* Plans: 
    - The selected plan and/or plan months
    - Are we offering more lucrative plans over certain months?
* Prices:
    - Are the prices better over certain months?
    - **Note**: Since we only have pricing data for those who made a purchase, our insights suffer partially due to survivorship bias. 

### Distribution of Ages accross months

How does the age data look accross different age bins? i.e. is there a disproportionate volume of a certain age group over certain months?

In [None]:
sign_up_info_exploded_df["age_bins"] = pd.cut(
    sign_up_info_exploded_df["ages"],
    bins=[0, 10, 20, 30, 40, 50, 65],
    include_lowest=True,
    precision=0,
)

age_bins_composition_by_sign_up_month_df = sign_up_info_exploded_df.pivot_table(
    index=["signup_month"],
    columns=["age_bins"],
    values="id",
    aggfunc="count",
    margins=True,
)
age_bins_composition_by_sign_up_month_df.div(
    age_bins_composition_by_sign_up_month_df.iloc[:, -1], axis=0
)

* From the above, the distribution of age bins looks consistent accross months. 
* The differences are with 1 percentage point of each other. 
* The only deviation seems to be the composition of the younger ages in month 1.

In [None]:
ages_by_signup_month = sns.FacetGrid(
    data=sign_up_info_exploded_df,
    col="signup_month",
    col_wrap=3,
    height=4,
    sharex="col",
    sharey="row",
)
ages_by_signup_month.map(
    sns.histplot,
    "ages",
    kde=True,
    color="green",
    stat="density",
)
ages_by_signup_month.fig.subplots_adjust(top=0.9)
ages_by_signup_month.fig.suptitle(
    "Histogram of the distribution of payment amount by months"
)
plt.show()

* We can see that through the table and the distribution plots above that the ages are distributed similarly. 
* Interesting point: 
    - The distribution of ages is slightly right-skewed with a thicker tail on the right. 
    - That means insurance is still popular among the older ages, and seldomly do people look for insurance for kids on Even.
    - *Could insurance for kids (< 18 years of age) be a new market to explore?*

### Distribution by Gender

In [None]:
gender_composition_by_sign_up_month_df = sign_up_info_exploded_df.pivot_table(
    index=["signup_month"],
    columns=["genders"],
    values="id",
    aggfunc="count",
    margins=True,
)

gender_composition_by_sign_up_month_df.div(
    gender_composition_by_sign_up_month_df.iloc[:, -1], axis=0
).drop(gender_composition_by_sign_up_month_df.tail(1).index)

All the months follow the same distribution among genders. There's a 50-50 split.

### Distribution by Plans

In [None]:
plan_composition_by_sign_up_month_df = sign_up_info_exploded_df.pivot_table(
    index=["signup_month"],
    columns=["plans"],
    values="id",
    aggfunc="count",
    margins=True,
)

plan_composition_by_sign_up_month_df.div(
    plan_composition_by_sign_up_month_df.iloc[:, -1], axis=0
).drop(plan_composition_by_sign_up_month_df.tail(1).index)

In [None]:
plan_months_composition_by_sign_up_month_df = sign_up_info_exploded_df.pivot_table(
    index=["signup_month"],
    columns=["plan_months"],
    values="id",
    aggfunc="count",
    margins=True,
)

plan_months_composition_by_sign_up_month_df.div(
    plan_months_composition_by_sign_up_month_df.iloc[:, -1], axis=0
).drop(plan_months_composition_by_sign_up_month_df.tail(1).index)

* There's a similar trend across plans with a slight deviation for the first two months. 
    - `PLUS` being overwhelmingly popular among customers. Possibly due to tax savings under 80D.
* Furthermore, we see a similar trend across plan months as well.
    - Customers usually prefer taking a 12-month plan. These might be those investing for the long run.
    - Those selecting the 3-month plan are either those trying out or looking to file a claim.

### Pricing

Could January (1) and Feburary (2) have high conversion due to the **Early Bird** offer?

In [None]:
early_bird_sign_up_info_df = sign_up_info_df[sign_up_info_df["is_early_bird"] == True]
early_bird_sign_up_info_df["payment_time"].isna().sum()

In [None]:
early_bird_sign_up_info_df["signup_time"].min(), early_bird_sign_up_info_df[
    "signup_time"
].max()

In [None]:
sign_up_info_df.loc[
    (
        sign_up_info_df["signup_time"]
        >= dt.datetime(year=2022, month=1, day=1, tzinfo=IST_TZ)
    )
    & (
        sign_up_info_df["signup_time"]
        < dt.datetime(year=2022, month=3, day=1, tzinfo=IST_TZ)
    )
]["payment_amount"].isna().sum()

An early bird is anyone who signed up in the month of January or Feburary and has paid for their plan. Anyone who signed up in January and Feburary isn't automatically considered an early bird.

Is conversion higher in the first two months to due to a reduced *Early Bird* price?

In [None]:
sign_up_info_df[sign_up_info_df["payment_time"].notnull()].groupby(["signup_month"])[
    "payment_amount"
].agg(["mean", percentile(25), "median", percentile(75), percentile(90), "count"])

* From the above, it's clear that months 1 and 2 paid lesser prices on average compared to other months. The conversion can be easily attribute to:
    - Even being a new product and customers are willing to try it
    - Due to lower prices
    - Due to the fact that claims were easier to file initally
* Interestingly, even after prices increased in months 6 and 7, the conversion recovered to its initial numbers.
    - *Why is this the case?*
    - Typically when a product goes through its life cycle ([Source](https://corporatefinanceinstitute.com/resources/management/product-life-cycle/)), there's plateau and then a drop in conversion.
    - There either seems to be a pivot in the product, or customers are looking for something else.

In [None]:
paid_sign_up_exploded_info_df = sign_up_info_exploded_df.loc[
    sign_up_info_exploded_df["payment_time"].notnull()
]

Could this be because prices for a certain plan month or plan was lower?

In [None]:
plan_month_pricing_info_df = (
    paid_sign_up_exploded_info_df.groupby(["signup_month", "plan_months"])[
        "payment_amount"
    ]
    .agg(["mean"])
    .reset_index()
)

fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.lineplot(
    data=plan_month_pricing_info_df,
    x="signup_month",
    y="mean",
    hue="plan_months",
    ax=ax,
)
ax.set_title("Mean prices paid by plan months")
ax.set_xlabel("Sign up month")
ax.set_ylabel("Mean price paid for plan")
plt.grid()
plt.show()

It is clear that the prices for all the plan months increased from the third month onwards. However, there doesn't seem to be a decline in prices in any plans in the last two months.

In [None]:
plan_pricing_info_df = (
    paid_sign_up_exploded_info_df.groupby(["signup_month", "plans"])["payment_amount"]
    .agg(["mean"])
    .reset_index()
)

fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.lineplot(
    data=plan_pricing_info_df,
    x="signup_month",
    y="mean",
    hue="plans",
    ax=ax,
)
ax.set_title("Mean prices paid by plans")
ax.set_xlabel("Sign up month")
ax.set_ylabel("Mean price paid for plan")
plt.grid()
plt.show()

In [None]:
plan_and_months_pricing_info_df = (
    paid_sign_up_exploded_info_df.groupby(["signup_month", "plans", "plan_months"])[
        "payment_amount"
    ]
    .agg(["mean"])
    .reset_index()
)

fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.lineplot(
    data=plan_and_months_pricing_info_df,
    x="signup_month",
    y="mean",
    style="plans",
    hue="plan_months",
    ax=ax,
    palette=sns.color_palette("hls", 3),
)
ax.set_title("Mean prices paid by plans")
ax.set_xlabel("Sign up month")
ax.set_ylabel("Mean price paid for plan")
plt.grid()
plt.show()

* Although the pricing increased for the `PLUS` campaign, 
    - the pricing for the `LITE` campaign seems to have dropped.
    - specifically, the pricing for the `LITE, 6-month` plan.
* However, the proportion of people signing for `LITE` doesn't seem to have increased.


### Alternative Reasons

* Although the higher conversion due to starting months can be attributed to pricing, the later months don't seem to follow the same trend. 
* Customers are still converting and paying for higher prices in the later months. 
* Hence it's inconclusive why conversion improved later on.

This could be due to:

1. Web page improvements:
    - Clear and compelling messaging
    - Easier sign-up process
2. Marketing Channels:
    - Increased promotion on social media
    - Trust indicators through celebrities, customer reviews, or social proof.
    - Better tutorials on how Even works
3. Competitor analysis:
    - Competitors might've increased prices, but Even held the same prices accross the later months
4. Additional Benefits:
    -  Even could've rolled out plans with more benefits.
5. (Unlikely) Measurement error:
    - We've lost customer data over certain months.

## Can we start a abandoned campaign?

Can we target customers who're dropping off through other means? When would be the best time to intervene?

In [None]:
sign_up_info_df.loc[:, "did_payment"] = True
sign_up_info_df.loc[sign_up_info_df["payment_time"].isna(), "did_payment"] = False

In [None]:
(
    sign_up_info_df.groupby(["did_payment"])["id"].count() / sign_up_info_df.shape[0]
).reset_index().rename(columns={"id": "percentage"})

In [None]:
(
    sign_up_info_df.groupby(["did_payment", "signup_month"])["id"].count()
    / sign_up_info_df.groupby(["signup_month"])["id"].count()
).reset_index().rename(columns={"id": "percentage"}).pivot_table(
    index="signup_month", columns="did_payment", values="percentage"
)

Over 80% of customers don't sign up for Even. Furthermore, as described previously, these drop offs are higher for months 3, 4, 5 and 6.

In [None]:
sign_up_info_df.loc[
    sign_up_info_df["signup_to_payment_time"].notnull(), "signup_to_payment_time"
].describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99])

75% of the customers typically take a week to pay after signing. We can run experiments for customers that sign up and then drop off after a week.

## User Sign Up to Payment Time

In [None]:
sign_up_info_df["signup_to_payment_time_days"] = sign_up_info_df[
    "signup_to_payment_time"
] / pd.to_timedelta(1, unit="D")
payment_sign_up_info_df = sign_up_info_df.loc[
    sign_up_info_df["signup_to_payment_time"].notnull()
].reset_index(drop=True)

### Tabular Format

Given that the range of the distribution is really wide and over 50% of the data resides within 1 day. A table would be suitable.

In [None]:
payment_to_sign_up_time_df = payment_sign_up_info_df.groupby(["signup_month"])[
    "signup_to_payment_time"
].agg(
    [
        "mean",
        percentile(25),
        "median",
        percentile(75),
        percentile(95),
        percentile(99),
    ]
)

payment_to_sign_up_time_df.style.background_gradient().set_properties(
    **{"font-size": "11px"}
)

In [None]:
payment_sign_up_info_df["signup_to_payment_time_sec"] = payment_sign_up_info_df[
    "signup_to_payment_time"
] / np.timedelta64(1, "s")
payment_sign_up_info_df["log_signup_to_payment_time_sec"] = np.log(
    payment_sign_up_info_df["signup_to_payment_time_sec"]
)

fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.histplot(
    data=payment_sign_up_info_df,
    x="log_signup_to_payment_time_sec",
    ax=ax,
)
ax.set_xticks([4, 6, 8, 10, 12, 14, 16])
ax.set_xticklabels(
    [
        "1 min",
        "3 mins 30 secs",
        "15 mins 50 secs",
        "2 hrs 20 mins",
        "1 day",
        "7 days",
        "40 days",
    ]
)
plt.legend()
plt.xlabel("Log of Signup to payment time")
plt.grid()
plt.show()

## Mining the Underlying Price

### Background

As discussed earlier, the above data suffers from survivorship bias and hence represents the price a customer is willing to pay for a plan instead of predicting the price they should be paying.

### Objective of a pricing model

* Charge a premium that is proportional to the rate at which an individual or family would claim.
* The overall sum of premiums shouldn't be less that the estimated cost of covering all individuals (including existing ones).
* A customer's premium shouldn't exceed their coverage. Otherwise, it's pointless for a customer to get insurance. If it does, then they should be ineligible for insurance.

A model that's mining the underlying pricing might not be bound by the above constraints.

### Factors

Setting the above aside, the factors that should majorly contribute to determining prices are as follows:
* Age
* Gender
* Plan type and duration
* Size of plan (Number of members)

Additional factors not considered by data include:
* Newness to the platform - for new customer offers
* Customer reactivation journey 
    - for those customers who had previous plans
    - can we increase their coverage?
* Location
* Claims history: those with a history of filing claims may be seen as having a higher risk
* Credit Score: A lower credit score may indicate a higher risk of filing claims.
* Coverage cost: As treatments get more expensive over time.

### Feature Engineering

Given the factors currently mentioned in the dataset.

### Age

* This is simple for individuals, as higher age would lead to higher prices
* However, this is a complicated affair for families.
    - in a family of four, how would you consider the ages of all family members?
    - one such way is to create a vector of ages, and the model can learn the weights
    - a vector of ages would also incorporate the size of the family
    
### Gender

* This factor should be coupled with age to determine pricing. As gender alone wouldn't be able to determine coverage.
* For example:
    - Women around menstrual ages might require additional coverage for pregnancy
    - Men around certain ages might experience male-patterned baldness that requires additional coverage

### Plan and Plan Duration

* Since both plan and plan months are categorical, we can one-hot encode this variable.
* The algorithm should be able to account for volume pricing:
    - every additional 3 or 6 months shouldn't necessarily increase the price by 100%, but around 80%.
    - a slightly higher price for a better service creates a price anchor - which is why customers tend to chose `PLUS` more often than `LITE`.
    
### Model Selection

Given the above features, we require a model that weights factors such as a vector of age while also incorporating non-linearly separable boundaries for gender and age. An ensemble method through boosting might work well. An overfitted model might work well as the possibility of receiving values outside the above three factors is rare. Furthermore, the current objective is to learn how pricing can be mined from pre-existing factors.