# Survival Analysis
What is the motivation for conducting this analysis?


## Conditions and Assumptions

Users who never logged any transactions are classified as **not-activated** and not included in calculation.

study period: ['2018-06-01', '2025-02-01')

`survival_time = t1 - t0`

`churned`: `is_agree` is `False` 

churn event: user blocked/unfollowed. Users who had not blocked/unfollowed and had not logged any transactions after 365 days are by defaulted **churned**.

right censored: same as `churned` is `False`

if `churned` and `user_ts > last_entry`, make end time = `user_ts`

if not `churned`, make end time = `tsl[1]` end of observation period

? what do it mean when `user_ts` < `last_entry` but `churned` is `True`

## Data Source


In [None]:
import pandas as pd

td = pd.read_feather('../data/tidy.feather')

# study period
tsl = pd.to_datetime(['2018-06-01', '2025-02-01'])

# Initial Setup and Analysis

In [None]:
td['days_since'] = tsl[1] - td.last_entry

# set churned flag
td['churned'] = ~td.is_agree
i = td.is_agree & (td.days_since > pd.Timedelta(days=365))
td.loc[i, 'churned'] = True
td.groupby('churned').size()

# churned vs.
# semi-churned, when users blocked but still makes entries past the user_ts flag
td['churned_'] = ~td.is_agree & (td.user_ts < td.last_entry)

In [None]:
# data set highly censored -> bias observations
td.groupby('is_agree').size()

In [None]:
# time elapsed since the last transaction entry until the
# observation cut off period tsl[1]
td['days_since'] = tsl[1] - td.last_entry
x = td['days_since'].dt.total_seconds() / 3600 / 24 # days
_ = x.plot.hist()

In [None]:
x = td.loc[~td.churned, 'days_since'].dt.total_seconds() / 3600 / 24 # days
_ = x.plot.hist()

In [None]:
# data issue?
td.query("user_ts.isna()").shape

Calculate `survival_time`

In [None]:
# Is left censoring necessary? No.
(td.user_ts < tsl[0]).sum()

In [None]:
# calculate start time
td['t0'] = td[['user_ts', 'first_entry']].min(axis=1)

calculate end time and survival_time

`tsl[1]` is the observation end time  

if is_agree is False, set the user end time to larger of 
`last_entry` or user record timestamp `ts` from acc_user table

In [None]:
td['t1'] = tsl[1]
td.loc[~td.is_agree, 't1'] = td.loc[~td.is_agree, ['user_ts', 'last_entry']].max(axis=1, skipna=True)
td['survival_time'] = td.t1 - td.t0

In [None]:
# how many have churned beyond the observation period
td.loc[~td.is_agree & (td.t1 >= tsl[1])].shape[0]

In [None]:
# what portion of users have churned?
(~td.is_agree).sum() / td.shape[0]

In [None]:
# how many have churned before the observation period
td[~td.is_agree & (td.t1 < tsl[1])].shape

In [None]:
# ... and what is the percentage?
td[~td.is_agree & (td.t1 < tsl[1])].shape[0] / td.shape[0]

In [None]:
# how many have supposedly blocked but still made entries
# vs. true churned(?) user_ts > last_entry
(td[(~td.is_agree) & (td.user_ts < td.last_entry)].shape[0],
td[(~td.is_agree) & (td.user_ts > td.last_entry)].shape[0])

Calculating Churn

In [None]:
# is_agree not set and churned flag not set
td.loc[~td.is_agree & td.churned_,
       ['user_ts', 'last_entry', 'tenure', 'days_active',
        'days_since', 'survival_time', 't1', 't0',
        'churned_', 'user_id']]

In [None]:
td.churned.sum() / td.shape[0], (~td.is_agree).sum() / td.shape[0]

In [None]:
td[['tenure', 'days_active', 'days_since', 'survival_time', 'nbr_entry']].describe()

In [None]:
# spread between tenure and active days
(td.tenure.dt.days - td.days_active).describe()

_**Oberservations:**_
- 75% of users had been active for 12 separate days or less (`days_active`)
- 75% of users had not made any entries for >1143 days (`days_since`)

In [None]:
# what do you observe from the top quartile...

t0 = td.days_since.quantile(.25) # 1146 days since last entry
print('days_since =', t0.days)
(td.loc[td.days_since < t0, ['days_since', 'tenure', 'days_active', 'nbr_entry']].describe())

In [None]:
td[td.user_id == 'U000046b3786c997220a07872c5191c37']

# Kaplan-Meier Estimator

In [None]:
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

kmf = KaplanMeierFitter()
kmf.fit(durations=td.survival_time.dt.total_seconds() / 3600 / 24, # in days
        event_observed=td.churned)

ax = kmf.plot_survival_function()
ax.set_title("Kaplan-Meier Survival Curve")
ax.set_xlabel("days")
ax.set_ylabel("Survival Probability")
plt.show()

In [None]:
median_time = kmf.median_survival_time_
print(f"Median Survival Time: {median_time} days")

In [None]:
# Find duration for a given probability
target_prob = 0.5  # survival probability 
closest_time = kmf.survival_function_.index[kmf.survival_function_["KM_estimate"] <= target_prob][0]

print(f"Duration corresponding to survival probability {target_prob}: {closest_time}")

# Cox Proportional Hazards Model

In [None]:
from lifelines import CoxPHFitter

td['duration'] = td.survival_time.dt.total_seconds() / 3600 / 24  # days
df = td[[
    'duration',
    'churned',
    'days_active',
    'nbr_entry',
    'fq_median'
]]

cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='churned')
cph.print_summary()

In [None]:
td.info()