# Retention Exploratory Analysis

---

In [1]:
import pandas as pd
import numpy as np
import ficast as fc
from plotly import graph_objects as go
from scipy.stats import linregress

Load dependent data.

In [2]:
data = pd.read_csv('data/ke_data.csv')

In [3]:
data.head()

Unnamed: 0,First Loan Local Disbursement Month,Months Since First Loan Disbursed,Count First Loans,Count Borrowers,Count Loans,Total Amount,Total Interest Assessed,Total Rollover Charged,Total Rollover Reversed,Default Rate Amount 7D,...,Default Rate Amount 51D,Ftbs,Borrower Retention,Loans Per Borrower,Loan Size,Interest Rate,Total Revenue Of Originations,Loans Per Original,Originations Per Original,Revenue Per Original
0,2020-09,0,7801,7801,13156,48361000,6540240,681325,81520,0.155382,...,0.113031,10314,0.756351,1.68645,3675.965339,0.135238,0.147641,1.275548,4688.869498,692.267307
1,2020-09,17,0,366,373,7424000,1062480,0,274,,...,,10314,0.035486,1.019126,19903.48525,0.143114,0.143077,0.036164,719.798332,102.986814
2,2020-09,2,0,3661,4310,31461000,4297310,401077,30617,0.139719,...,0.094792,10314,0.354954,1.177274,7299.535963,0.136592,0.148367,0.417879,3050.319953,452.566415
3,2020-09,10,0,1362,1562,28365000,3750870,279970,0,0.109661,...,0.074028,10314,0.132054,1.146843,18159.41101,0.132236,0.142106,0.151445,2750.145433,390.812488
4,2020-09,14,0,1054,1181,24426000,3562270,239453,0,0.108947,...,0.045235,10314,0.102191,1.120493,20682.47248,0.145839,0.155642,0.114505,2368.237347,368.598313


In [4]:
# instantiate model
ps = fc.Model(data, market='ke', fcast_method='powerslope')

# generate features
ps.generate_features()

# forecast data and save as model attribute
ps.forecast = ps.forecast_features(ps.data)

# backtest data and save as attribute
ps.backtest, ps.backtest_report = ps.backtest(ps.data, months=6)

Backtesting 6 months.
8 cohorts will be backtested.


### Exploratory Analysis

The goal of this exploratory analysis is to deep dive **Count Borrowers** and its derived metrics: **retention** and **survability**. Any insights gleaned from this deep dive will feed into a performance evaluation of our forecasting methodologies, and hopefully drive their optimization.

#### Count Borrowers

In [8]:
ps.plot_cohorts('Count Borrowers', data='raw')

The plot above shows *Count Borrowers* for each cohort vs *Month Since Disbursement*. The count has a monotonic decrease over time which looks similar across cohorots. However, because the starting count is different for each cohort, we can't compare the rate at which we're losing customers across cohorts. For that, we use retention.

#### Is retention changing over time?

Over time, cohorts have seen a drop in retention rates. There's a very clear drop in the knee of the curves going from earlier cohorts to more recent cohorts. This is more clearly visible if we pick an individual *Month Since Disbursement* and look at vertical slices of this data.

In [9]:
ps.plot_cohorts('borrower_retention', data='raw')

In [10]:
month = 1

df = ps.data[ps.data['Months Since First Loan Disbursed']==month][['cohort', 'borrower_retention']]
df['month'] = np.arange(1, len(df)+1)

slope, intercept, r, p, se = linregress(x=df.month, y=df.borrower_retention)
fit = slope*df['month'] + intercept

fig = go.Figure([
    go.Scatter(name='data', x=df.month, y=df.borrower_retention, mode='markers'),
    go.Scatter(name='regression', x=df.month, y=fit, mode='lines')
])

fig.update_layout(xaxis=dict(title='Cohort Month'), yaxis=dict(title='Retention'))

fig.show()
print(f'R-squared: {round(r**2, 3)}')
print(f'Slope: {round(slope, 3)}')

R-squared: 0.826
Slope: -0.008


In [11]:
print(f'{round(100*slope*12, 2)}%')

-9.01%


The plot above shows retention vs cohort month. This is a vertical slice of the previous plot at **1 month after disbursement**. Cohort month 1 corresponds to the earliest cohort in the data, **2020-09**. Here we can more clearly see the drop in retention from earlier to more recent cohorts. We can multiply the slope by 12 months to see how much retention has dropped in a year since the first cohort. In 1 year, from 2020-09 to 2021-09, retention at **1 month after disbursement** is 9 percentage points lower. This trend holds for later months after disbursement. This can be checked by changing the month variable in the cell above.

#### Is survival changing over time?

In [12]:
ps.plot_cohorts('borrower_survival', data='raw')

The survival curves above show spread, but is there a trend between earlier and more recent cohorts as we just saw in retention?

In [13]:
month = 2

df = ps.data[ps.data['Months Since First Loan Disbursed']==month][['cohort', 'borrower_survival']]
df['month'] = np.arange(1, len(df)+1)

slope, intercept, r, p, se = linregress(x=df.month, y=df.borrower_survival)
fit = slope*df['month'] + intercept

fig = go.Figure([
    go.Scatter(name='data', x=df.month, y=df.borrower_survival, mode='markers'),
    go.Scatter(name='regression', x=df.month, y=fit, mode='lines')
])

fig.update_layout(xaxis=dict(title='Cohort Month'), yaxis=dict(title='Survival'))

fig.show()
print(f'R-squared: {round(r**2, 3)}')
print(f'Slope: {round(slope, 3)}')

R-squared: 0.276
Slope: -0.004


The data in the plot above is again a vertical slice of the previous survival plot. Here we're looking at **2 months after disbursement** because the survival in the 1st month is exactly the same as retention. What we see is very different compared to retention. Firstly, the fit is much worse. While there does seem to be a general downward trend, the data has a lot more spread and a weaker dependence on cohort month. The slope is also smaller. When we look at later months, the trend is even weaker.

#### Is the initial cohort size changing month-to-month?

In [14]:
first_loans = ps.data[ps.data['Count First Loans'] != 0][['First Loan Local Disbursement Month', 
                                                          'Count Borrowers']]
first_loans = first_loans.set_index('First Loan Local Disbursement Month')

fig = go.Figure(go.Bar(x=first_loans.index, y=first_loans['Count Borrowers']))
fig.update_layout(xaxis=dict(title='Cohort'), yaxis=dict(title='Count Borrowers'))

fig.show()

To avoid confounding factors, I want to make sure there isn't a significant difference in the sample size of each cohort. The plot above shows the initial count of borrowers for each cohort. While there is some variation in borrower count month to month, and some potential seasonality, there doesn't appear to be a consistent trend with time. 

#### Conclusions & Discussion

1. Overall retention (regardless of Month Since First Disbursement) is decreasing over time. We see that with earlier cohorts, retention was higher across more or all months since first disbursement.
2. Survability doesn't show the same change.
3. Cohort sizes vary but there's no trend between earlier and later cohorts.

#### What does all of this mean with respect to forecasting retention & survibility?

Because we don't see a significant change in surviability from cohort to cohort, we can have confidence that the survival curve shows a similar behavior across cohorts. This makes whatever forecasting methodology we apply, consistently valid within each cohort and over time.

Because we see a strong change in retention from cohort to cohort however, it's not valid to apply one model across all cohorts. In other words, we should model each cohort individually.

## Forecasting Analysis