# Even: Data Assignment

## Config

Setting up the environment for the analysis.


* This notebook uses `Python3.9` via `miniconda`. The environment can be created with `conda create -n even python=3.9`

In [None]:
# black formatter
%load_ext nb_black

In [None]:
import pytz

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
IST_TZ = pytz.timezone("Asia/Kolkata")

## Data Preparation & Exploration

In [None]:
sign_up_info_df = pd.read_csv("data/data_science_task_dataset.csv", index_col=[0])
sign_up_info_df.reset_index(drop=True, inplace=True)
sign_up_info_df.reset_index(names=["id"], inplace=True)

In [None]:
# assuming times are in IST
sign_up_info_df["signup_time"] = pd.to_datetime(
    sign_up_info_df["signup_time"]
).dt.tz_localize(IST_TZ)
sign_up_info_df["payment_time"] = pd.to_datetime(
    sign_up_info_df["payment_time"]
).dt.tz_localize(IST_TZ)

sign_up_info_df["plan_months"] = sign_up_info_df["plan_months"].astype(int)
sign_up_info_df["payment_amount"] = sign_up_info_df["payment_amount"].astype(float)
sign_up_info_df["is_early_bird"] = sign_up_info_df["is_early_bird"].astype(bool)

Where ever paytime is missing, payment amount should be `NA`, i.e, it's still pending.

In [None]:
sign_up_info_df.loc[sign_up_info_df["payment_time"].isna(), "payment_amount"] = np.nan

* Each row represents a sign up with multiple family members:
    - represented by multiple values for `genders`, `ages`, `plans`.
    - other rows are common among family members.
    - the person who originally signed up is the first member.
* When `payment_time` is null, the customer hasn't paid for the plan yet.

The permissible categories are as follows:

In [None]:
# splitting each family member to a unique column
sign_up_info_exploded_df = sign_up_info_df.assign(
    **{
        "plans": sign_up_info_df["plans"].str.split(", "),
        "genders": sign_up_info_df["genders"].str.split(", "),
        "ages": sign_up_info_df["ages"].str.split(", "),
    }
).explode(column=["ages", "genders", "plans"])

In [None]:
sign_up_info_exploded_df.groupby(["plans", "plan_months"])["id"].count()

In [None]:
sign_up_info_exploded_df["genders"].value_counts(dropna=False)

In [None]:
sign_up_info_exploded_df["ages"] = sign_up_info_exploded_df["ages"].astype(int)

fig, ax = plt.subplots(figsize=(16, 6), dpi=120)
sns.countplot(
    x="ages",
    data=sign_up_info_exploded_df,
    order=np.arange(
        start=sign_up_info_exploded_df["ages"].min(),
        stop=sign_up_info_exploded_df["ages"].max(),
        step=1,
    ),
    ax=ax,
)
ax.set_title("Count Plot of ages")
ax.set_xlabel("Age (in years)")
ax.set_ylabel("Count of people")
plt.show()

In [None]:
sign_up_info_exploded_df["age_bins"] = pd.cut(
    sign_up_info_exploded_df["ages"],
    bins=[0, 17, 35, 50, 65],
    include_lowest=True,
    precision=0,
)
sign_up_info_exploded_df["age_bins"].value_counts(normalize=True, dropna=False) * 100

1. The age limit for insurance of 65 years is line with the [IRDAI](https://www.tataaig.com/knowledge-center/health-insurance/age-limit-for-health-insurance).
2. A major part of the population purchasing premiums at even are between 17-35.

The minimum sign up time to the maximum paytime signifies the range this data is for.

In [None]:
print(
    f"""From: {sign_up_info_df["signup_time"].min()}"""
    + f"""\nTo: {sign_up_info_df["payment_time"].max()}"""
    + f"""\nRange: {sign_up_info_df["payment_time"].max() - sign_up_info_df["signup_time"].min()}"""
)

i.e., roughly a year of data.

In [None]:
sign_up_info_df.sample(5)

## Crucial Points

what do you think are 2 crucial data breakdowns or plots to be shown if you were presenting this data to the wider team? If you think multiple options are possible, feel free to say why you picked those 2.

Possible Questions

- [ ] Are certain plans more popular among certain ages? Families?
- [ ] Are certain plan moonths popular among certain age? Families?
- [ ] What's the age group that's signing up the most for even?
- [ ] Which group signs up the fastest at even?
- [ ] When is Even receiving the most sign ups?
- [ ] Are early bird sign ups more?
- [ ] What's the typical price people are looking for insurane at sign up? (compare it with market rates)

## User Sign Up to Payment Time

Consider the fields `signup_time` and `payment_time`. They stand for the time a given user (who then may add multiple family members) signed up and then paid, respectively. In a single plot, how can you best show the distribution of time "deltas" between the sign up time and payment time (i.e. how long it takes for people to pay once they have signed up)? What is the best way to condense the relevant information and insights? Remember it needs to be a single, static plot, which ideally should not need to be magnified to make sense.

## Mining the Underlying Price

You are given the payment amounts but you don't know what the underlying price function is, and what its inputs are (though you can assume they are a subset of the given fields). If you had to treat this as a prediction problem, what kind of model would you use? **PLEASE DO NOT ACTUALLY ATTEMPT MODELLING**. Base your answer on any data exploration you did (and feel free to show plots/stats), but what we are looking is simply a discussion of what may be some of the modelling challenges here and how to pick a model which can overcome them.