# pLTV Python Model - MVP

## TO DO LIST

1. Add all forecasting methods to model. - DONE
2. Add back testing to model. - DONE
3. Change plotting method for backtest_report.
4. **COMPLETE MVP**
5. Revisit loan size forecast, errors seem large.
6. Refactor code into modules.
7. Create functions for each forecast method (for plug and play).
8. Explore why it seems that retention in previous month changes with time for all cohorts.

#### Resources

Github repo
https://github.com/kliao-tala/pLTV

Fader & Hardie paper on sBG model
https://drive.google.com/file/d/1tfMiERon1HgWo8dDJddwSzwc_AXK7LCA/view?usp=sharing
https://faculty.wharton.upenn.edu/wp-content/uploads/2012/04/Fader_hardie_jim_07.pdf

YT tutorial to pull look data
https://www.youtube.com/watch?v=EKwtLBnwXHk&list=PLXwS3L4W3KR2fhnQa-sLyajPZ9-L00KGX&index=1

Looker Python SDK & API
https://inventure.looker.com/extensions/marketplace_extension_api_explorer::api-explorer/4.0

---

In [2]:
import numpy as np

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

from scipy.optimize import curve_fit, minimize
from scipy.special import logsumexp

import plotly
from plotly import graph_objects as go
import plotly.io as pio

# change default plotly theme
pio.templates.default = "plotly_white"

import sbg

In [3]:
inputs = pd.read_csv('data/pltv_inputs.csv')
inputs = inputs.set_index('market')
data = pd.read_csv('data/ke_data.csv')
pltv_expected = pd.read_csv('data/pltv_expected.csv')

## Data Munging

In [4]:
# fix date inconsistencies
data = data.replace({'2021-9': '2021-09', '2021-8': '2021-08', '2021-7': '2021-07', 
              '2021-6': '2021-06', '2021-5': '2021-05', '2021-4': '2021-04', '2020-9': '2020-09'})

# sort by months since first disbursement
data = data.sort_values(['First Loan Local Disbursement Month', 
                         'Months Since First Loan Disbursed'])

# remove all columns calculated through looker
data = data.loc[:,:"Default Rate Amount 51D"]

In [5]:
data.head()

Unnamed: 0,First Loan Local Disbursement Month,Months Since First Loan Disbursed,Count First Loans,Count Borrowers,Count Loans,Total Amount,Total Interest Assessed,Total Rollover Charged,Total Rollover Reversed,Default Rate Amount 7D,Default Rate Amount 30D,Default Rate Amount 51D
0,2020-09,0,7801,7801,13156,48361000,6540240,681325,81520,0.155382,0.121192,0.113031
15,2020-09,1,0,4481,5697,34490000,4660880,416544,32387,0.130661,0.101823,0.095738
2,2020-09,2,0,3661,4310,31461000,4297310,401077,30617,0.139719,0.103958,0.094792
9,2020-09,3,0,3050,3599,30482000,4178400,343062,17629,0.125111,0.089089,0.084399
10,2020-09,4,0,2549,2985,29303000,3964590,300262,920,0.11372,0.08675,0.081658


## pLTV Model

### sBG probalistic model

sBG model assumptions:
1. The propensity of one customer to drop out is independent of the behavior of every other customer.
2. Individual customer retention rates are unchanged over time.
3. Observed retention increase with time due to aggregate results and heterogenity in customer behaviors.
4. Model applies for customer relationships in "discrete-time" and "contractual" settings.

In [6]:
# initial guesses @ alpha and beta
alpha = 1
beta = 1

In [7]:
def p(t, alpha, beta):
    """
    Probability that a customer fails to take out another loan (probability to churn).
    For the derivation of this equation, see the original Fader & Hardie paper. This 
    recursion formula takes two constants, alpha and beta, which are fit to actual data.
    It then allows you to compute the probability of churn for a given time period, t.
    
    Parameters
    ----------
    t : int
        Time period.
    alpha : float
        Fitting parameter.
    beta : float
        Fitting parameter.
    
    Returns
    -------
    P : float
        Probability of churn.
    """
    
    eps = 1e-50
    
    if alpha + beta < eps:
        if t==1:
            return alpha/(eps)
        else:
            return p(t-1, alpha, beta) * (beta+t-2)/(eps+t-1)
    else:
        if t==1:
            return alpha/(alpha + beta)
        else:
            return p(t-1, alpha, beta) * (beta+t-2)/(alpha+beta+t-1)
    
def s(t, alpha, beta):
    """
    Survival function: the probability that a customer has survived to time t.
    For the derivation of this equation, see the original Fader & Hardie paper. This 
    recursion formula takes two constants, alpha and beta, which are fit to actual data.
    It also requires computation of P (probability of a customer churning).
    
    Parameters
    ----------
    t : int
        Time period.
    alpha : float
        Fitting parameter.
    beta : float
        Fitting parameter.
    
    Returns
    -------
    S : float
        Probability of survival.
    """
    
    if t==1:
        return 1 - p(t, alpha, beta)
    else:
        return s(t-1, alpha, beta) - p(t, alpha, beta)
    
def log_likelihood(params, c):
    """
    Computes the *negative* log-likelihood of the probability distribution of customers
    still being active at time t. For a derivation of the log-likelihood, see Appendix A
    in the original Fader & Hardie paper. The function computes the log-likelihood at 
    every time step, t, leading up to the last time period T. The final value is simply
    the sum of the log-likelihood computed at each time step. In the end, we return the 
    negative of the log-likelihood so that we can use scipy's minimize function to optimize
    for the values of alpha and beta.
    
    Parameters
    ----------
    params : array
        Array containing alpha and beta values.
    c : array
        Array containing borrower count for a given cohort.
    
    Returns
    -------
    ll : float
        log-likelihood value
    """
        
    alpha, beta = params
    eps = 1e-50
    
    # initialize log-likelihood (ll) value at 0
    ll=0
    
    # for each time period in the *actual* data, compute ll and add it to the running total
    for t in c[1:].index:
        # if P is less than epsilon, replace it with epsilon.
        if p(t, alpha, beta) < eps:
            ll += (c[t-1]-c[t])*np.log(eps)
        else:
            ll += (c[t-1]-c[t])*np.log(p(t, alpha, beta))
    
    # add the final term which is associated with customers who are still active at the end
    # of the final period.
    
    # replace the argument of the np.log() function with epsilon if smaller than epsilon.
    if s((len(c)-1)-1, alpha, beta)-p(len(c)-1, alpha, beta) < eps:
        ll += c.iloc[-1]*np.log(eps)
    else:
        ll += c.iloc[-1]*np.log(s((len(c)-1)-1, alpha, beta)-p(len(c)-1, alpha, beta))
    
    return -ll

In [8]:
# KSH to USD conversion factor
ksh_usd = 0.00925
late_fee = 0.08 # % fee on defaults

In [9]:
# model parameters
min_months = 4

In [90]:
class Model:
    """
    sBG model class containing all functionality for creating, analyzing, and backtesting
    the sBG model.
    
    Parameters
    ----------
    data : pandas DataFrame
        Raw data pulled from Looker.
        
    Methods
    -------
    clean_data
        Performs all data cleaning steps and returns the cleaned data.
        
    borrower_retention(cohort_data)
        Computes borrower retention.
    
    """
    
    def __init__(self, data, market):
        self.data = data
        self.market = market
        self.forecast_cols = ['Count Borrowers', 'borrower_retention', 'borrower_survival', 'loan_size', 
                'loans_per_borrower', 'Count Loans', 'Total Amount', 'interest_rate', 'default_rate_7dpd',
                'default_rate_51dpd', 'default_rate_365dpd', 'loans_per_original', 
                'origination_per_original', 'revenue_per_original', 'cm$_per_original',
                'opex_per_original', 'ltv_per_original', 'cm%_per_original']
        
        self.clean_data()
        
        
    def clean_data(self):
        # fix date inconsistencies
        self.data = self.data.replace({'2021-9': '2021-09', '2021-8': '2021-08', \
                                       '2021-7': '2021-07', '2021-6': '2021-06', \
                                       '2021-5': '2021-05', '2021-4': '2021-04', \
                                       '2020-9': '2020-09'})

        # sort by months since first disbursement
        self.data = self.data.sort_values(['First Loan Local Disbursement Month', 
                                 'Months Since First Loan Disbursed'])

        # remove all columns calculated through looker
        self.data = self.data.loc[:,:"Default Rate Amount 51D"]
        
        # add more convenient cohort column
        self.data['cohort'] = self.data['First Loan Local Disbursement Month']
        
        
    # --- DATA FUNCTIONS --- #
    def borrower_retention(self, cohort_data):
        return cohort_data['Count Borrowers']/cohort_data['Count Borrowers'].max()

    
    def borrower_survival(self, cohort_data):
        return cohort_data['Count Borrowers']/cohort_data['Count Borrowers'].shift(1)
    
    
    def loans_per_borrower(self, cohort_data):
        return cohort_data['Count Loans']/cohort_data['Count Borrowers']
    
    
    def loan_size(self, cohort_data, to_usd):
        df = cohort_data['Total Amount']/cohort_data['Count Loans']
        if to_usd:
            df *= ksh_usd
        return df
    
    
    def interest_rate(self, cohort_data):
        return cohort_data['Total Interest Assessed']/cohort_data['Total Amount']
    
    
    def default_rate(self, cohort_data, period=7):
        if period==7:
            return cohort_data['Default Rate Amount 7D'].fillna(0)
        
        elif period==51:
            default_rate = cohort_data['Default Rate Amount 51D']

            recovery_rate_51 = float(inputs.loc['ke', 'recovery_7-30'] + \
                                     inputs.loc['ke', 'recovery_30-51'])

            ## fill null 51dpd values with 7dpd values based on recovery rates
            derived_51dpd = (cohort_data['Count Loans']*(cohort_data['default_rate_7dpd']) - \
                cohort_data['Count Loans']*(cohort_data['default_rate_7dpd'])*recovery_rate_51)/ \
                cohort_data['Count Loans']
            
            return default_rate.fillna(derived_51dpd)
        
        elif period==365:
            # get actual data if it exists
            default_rate = np.nan*cohort_data['Default Rate Amount 51D']

            recovery_rate_365 = float(inputs.loc['ke', 'recovery_51_'])

            ## fill null 365dpd values with 51dpd values based on recovery rates
            derived_365dpd = (cohort_data['Count Loans']*(cohort_data['default_rate_51dpd']) - \
                cohort_data['Count Loans']*(cohort_data['default_rate_51dpd'])* \
                recovery_rate_365)/cohort_data['Count Loans']

            return default_rate.fillna(derived_365dpd)
        
        
    def loans_per_original(self, cohort_data):
        return cohort_data['Count Loans']/cohort_data['Count Borrowers'].max()
    
    
    def origination_per_original(self, cohort_data, to_usd):
        df = cohort_data['Total Amount']/cohort_data['Count Borrowers'].max()
        if to_usd:
            df *= ksh_usd
        return df
    
    
    def revenue_per_original(self, cohort_data, to_usd):
        interest_revenue = cohort_data['origination_per_original']*cohort_data['interest_rate']
        
        # 0.08 is the % fee we charge to defaulted customers
        revenue = interest_revenue + (cohort_data['origination_per_original'] + interest_revenue) * \
            cohort_data['default_rate_7dpd']*0.08
        
        # note that origination_per_original is already in USD so no conversion is necessary
        return revenue
    
    
    def credit_margin(self, cohort_data):
        return cohort_data['revenue_per_original'] - \
                (cohort_data['origination_per_original'] + cohort_data['revenue_per_original'])* \
                cohort_data['default_rate_365dpd']
    
    
    def opex_per_original(self, cohort_data):
        opex_cost_per_loan = float(inputs.loc['ke', 'opex cost per loan'])
        cost_of_capital = float(inputs.loc['ke', 'cost of capital'])/12
        
        return opex_cost_per_loan*cohort_data['loans_per_original'] + \
            cost_of_capital*cohort_data['origination_per_original']
    
    
    def ltv_per_original(self, cohort_data):
        return cohort_data['cm$_per_original'] - cohort_data['opex_per_original']
    
    
    def credit_margin_percent(self, cohort_data):
        return cohort_data['ltv_per_original']/cohort_data['revenue_per_original']
        
        
    def generate_features(self, to_usd=True):
        """
        Generate all features required for pLTV model.
        """
        cohorts = []

        # for each cohort
        for cohort in self.data.loc[:,'First Loan Local Disbursement Month'].unique():
            # omit the last two months of incomplete data
            cohort_data = self.data[self.data['First Loan Local Disbursement Month']==cohort].iloc[:-2,:]

            # call data functions to generate calculated features
            cohort_data['borrower_retention'] = self.borrower_retention(cohort_data)
            cohort_data['borrower_survival'] = self.borrower_survival(cohort_data)
            cohort_data['loans_per_borrower'] = self.loans_per_borrower(cohort_data)
            cohort_data['loan_size'] = self.loan_size(cohort_data, to_usd)
            cohort_data['interest_rate'] = self.interest_rate(cohort_data)
            cohort_data['default_rate_7dpd'] = self.default_rate(cohort_data, period=7)
            cohort_data['default_rate_51dpd'] = self.default_rate(cohort_data, period=51)
            cohort_data['default_rate_365dpd'] = self.default_rate(cohort_data, period=365)
            cohort_data['loans_per_original'] = self.loans_per_original(cohort_data)
            cohort_data['origination_per_original'] = self.origination_per_original(cohort_data, to_usd)
            cohort_data['revenue_per_original'] = self.revenue_per_original(cohort_data, to_usd)
            cohort_data['cm$_per_original'] = self.credit_margin(cohort_data)
            cohort_data['opex_per_original'] = self.opex_per_original(cohort_data)
            cohort_data['ltv_per_original'] = self.ltv_per_original(cohort_data)
            cohort_data['cm%_per_original'] = self.credit_margin_percent(cohort_data)
            
            # reset the index and append the data
            cohorts.append(cohort_data.reset_index(drop=True))

        self.cohorts = cohorts
        self.data = pd.concat(cohorts, axis=0)
    
    
    def plot_cohorts(self, param, data='raw'):
        """
        Generate scatter plot for a specific paramter.
        
        Parameters
        ----------
        
        """
        
        if data == 'raw' or data == 'forecast' or data == 'backtest':
            curves = []

            if data == 'forecast':
                for cohort in self.forecast.cohort.unique():
                    c_data = self.forecast[self.forecast.cohort==cohort]
                    for dtype in c_data.data_type.unique():
                        output = c_data[c_data.data_type==dtype][param]

                        output.name = cohort + '-' + dtype

                        curves.append(output)

            elif data == 'backtest':
                for cohort in self.backtest.cohort.unique():
                    c_data = self.backtest[self.backtest.cohort==cohort]

                    # append raw data
                    output = self.data[self.data.cohort==cohort][param]
                    output.name = cohort + '-actual'

                    curves.append(output)

                    # append forecast
                    output = c_data[c_data.data_type=='forecast'][param]
                    output.name = cohort + '-forecast'

                    curves.append(output)

            elif data == 'raw':
                for cohort in self.data.cohort.unique():
                    output = self.data[self.data.cohort==cohort][param]

                    output.name = cohort

                    curves.append(output)

            traces = []

            for cohort in curves:
                if 'forecast' in cohort.name:
                    traces.append(go.Scatter(name=cohort.name, x=cohort.index, y=cohort, mode='lines',
                                            line=dict(width=3, dash='dash')))
                else:
                    if cohort.notnull().any():
                        traces.append(go.Scatter(name=cohort.name, x=cohort.index, y=cohort, mode='markers+lines',
                                            line=dict(width=2)))

            fig = go.Figure(traces)
            fig.update_layout(title=f'{param} - {data.upper()}',
                              xaxis=dict(title='Month Since Disbursement'),
                              yaxis=dict(title=param))

            fig.show()
        
        elif data == 'backtest_report':
            curves = []
            for cohort in self.backtest_report.cohort.unique():
                c_data = self.backtest_report[self.backtest_report.cohort==cohort]
                output = c_data[param]

                output.name = cohort

                curves.append(output)
                
            traces = []
            for cohort in curves:
                traces.append(go.Bar(name=cohort.name, x=cohort.index, y=cohort))

            metric = param.split('-')[1].upper()
            fig = go.Figure(traces)
            fig.update_layout(title=f'{self.backtest_months} Month Backtest - {metric}',
                              xaxis=dict(title='Month Since Disbursement'),
                              yaxis=dict(title=param))

            fig.show()
        
        
    # --- FORECAST FUNCTIONS --- #
    def forecast_features(self, data, method='', months=24, to_usd=True):
        """
        Generates a forecast of "Count Borrowers" out to the input number of months.
        The original and forecasted values are returned as a new dataframe, set as
        a new attribute of the model, *.forecast*. 
        
        Parameters
        ----------
        months : int
            Number of months to forecast to.
        """
        
        # initialize alpha and beta, optimized later by model
        alpha = beta = 1
        
        # list to hold individual cohort forecasts
        forecast_dfs = []

        # range of desired time periods
        times = list(range(1, months+1))
        times_dict = {i:i for i in times}
        
        for cohort in data.cohort.unique():
            # data for current cohort
            c_data = data[data.cohort==cohort]
            
            # starting cohort size
            n = c_data.loc[0, 'Count Borrowers']

            # only for cohorts with at least 4 data points
            if len(c_data) >= min_months:
                c = c_data['Count Borrowers']
                
                def power_law(t, A, B):
                    pass

                # define bounds for alpha and beta (must be positive)
                bounds = ((0,1e5), (0,1e5))
                
                # use scipy's minimize function on log_likelihood to optimize alpha and beta
                results = minimize(log_likelihood, np.array([alpha,beta]), args=c, bounds=bounds)

                
                # list to hold forecasted values 
                forecast = []
                for t in times:
                    forecast.append(n*s(t, results.x[0], results.x[1]))

                # convert list to dataframe
                forecast = pd.DataFrame(forecast, index=times, columns=['Count Borrowers'])

                # null df used to extend original cohort df to desired number of forecast months
                dummy_df = pd.DataFrame(np.nan, index=range(0,months+1), columns=['null'])

                # create label column to denote actual vs forecast data
                c_data['data_type'] = 'actual'

                # extend cohort df
                c_data = pd.concat([c_data, dummy_df], axis=1).drop('null', axis=1)
                
                # fill missing values in each col
                c_data.cohort = c_data.cohort.ffill()
                c_data['First Loan Local Disbursement Month'] = \
                    c_data['First Loan Local Disbursement Month'].ffill()
                c_data['Months Since First Loan Disbursed'] = \
                    c_data['Months Since First Loan Disbursed'].fillna(times_dict).astype(int)
                c_data['Count First Loans'] = c_data['Count First Loans'].ffill()
                

                # label forecasted data
                c_data.data_type = c_data.data_type.fillna('forecast')

                # fill in the forecasted data
                c_data['Count Borrowers'] = c_data['Count Borrowers'].fillna(forecast['Count Borrowers'])
                
                # add retention & survival features
                c_data['borrower_retention'] = m.borrower_retention(c_data)
                c_data['borrower_survival'] = m.borrower_survival(c_data)
                
                
                # forecast loan size
                for i in c_data[c_data.loan_size.isnull()].index:
                    c_data.loc[i, 'loan_size'] = c_data.loc[i-1, 'loan_size'] * \
                    pltv_expected.loc[i,'loan_size']/pltv_expected.loc[i-1, 'loan_size']
                
                
                # forecast loans_per_borrower
                for i in c_data[c_data.loans_per_borrower.isnull()].index:
                    c_data.loc[i, 'loans_per_borrower'] = c_data.loc[i-1, 'loans_per_borrower'] * \
                    pltv_expected.loc[i,'loans_per_borrower']/pltv_expected.loc[i-1, 'loans_per_borrower']
                
                
                # forecast Count Loans
                c_data['Count Loans'] = c_data['Count Loans'].fillna((c_data['loans_per_borrower'])*c_data['Count Borrowers'])
                
                
                # forecast Total Amount
                c_data['Total Amount'] = c_data['Total Amount'].fillna((c_data['loan_size']/ksh_usd)*c_data['Count Loans'])
                
                
                # forecast Interest Rate
                for i in c_data[c_data.interest_rate.isnull()].index:
                    c_data.loc[i, 'interest_rate'] = c_data.loc[i-1, 'interest_rate'] * \
                    pltv_expected.loc[i,'interest_rate']/pltv_expected.loc[i-1, 'interest_rate']
                
                
                # forecast default rate 7dpd
                default = c_data.default_rate_7dpd.dropna()
                default.index = np.arange(1, len(default)+1)
                
                def func(t, A, B):
                    return A*(t**B)

                params, covs = curve_fit(func, default.index, default)
                    
                t = list(range(1, months+2))
                fit = func(t, params[0], params[1])
                fit = pd.Series(fit, index=t).reset_index(drop=True)
                
                c_data['default_rate_7dpd'] = c_data['default_rate_7dpd'].fillna(fit)
                
                
                # derive 51dpd and 365 dpd from 7dpd
                c_data['default_rate_51dpd'] = m.default_rate(c_data, period=51)
                c_data['default_rate_365dpd'] = m.default_rate(c_data, period=365)
                
                # compute remaining columns from forecasts
                c_data['loans_per_original'] = m.loans_per_original(c_data)
                c_data['origination_per_original'] = m.origination_per_original(c_data, to_usd)
                c_data['revenue_per_original'] = m.revenue_per_original(c_data, to_usd)
                c_data['cm$_per_original'] = m.credit_margin(c_data)
                c_data['opex_per_original'] = m.opex_per_original(c_data)
                c_data['ltv_per_original'] = m.ltv_per_original(c_data)
                c_data['cm%_per_original'] = m.credit_margin_percent(c_data)
                
                
                
                # add the forecasted data for the cohort to a list, aggregating all cohort forecasts
                forecast_dfs.append(c_data)

        return pd.concat(forecast_dfs)
    
    
    def backtest(self, data, months=4, metric='mape'):
        """
        Backtest forecasted values against actuals.
        
        Parameters
        ----------
        
        
        """
        
        # print the number of cohorts that will be backtested.
        cohort_count = 0
        for cohort in data.cohort.unique():
            if len(data[data.cohort==cohort]) - months >= min_months:
                cohort_count += 1
        
        self.backtest_months = months
        print(f'Backtesting {months} months.')
        print(f'{cohort_count} cohorts will be backtested.')
        
        def compute_error(actual, forecast, metric='mape'):
            """
            Test forecast performance against actuals using method defined by metric.
            """
            
            # root mean squared error
            if metric=='rmse':
                error = np.sqrt((1/len(actual))*sum((forecast[:len(actual)] - actual)**2))
            # mean absolute error
            elif metric=='mae':
                error = np.mean(abs(forecast[:len(actual)] - actual))
            # mean error
            elif metric=='me':
                error = np.mean(forecast[:len(actual)] - actual)
            # mean absolute percent error
            elif metric=='mape':
                error = (1/len(actual))*sum(abs(forecast[:len(actual)] - actual)/actual)
            # mean percent error
            elif metric=='mpe':
                error = (1/len(actual))*sum((forecast[:len(actual)] - actual)/actual)
            return error
        
        
        # --- Generate limited data --- #
        
        metrics = ['rmse', 'me', 'mape', 'mpe']
        
        limited_data = []
        backtest_report = []
        for cohort in data.cohort.unique():
            # data for current cohort
            c_data = data[data.cohort==cohort]

            # only backtest if remaining data has at least 4 data points
            if len(c_data) - months >= min_months:
                # limit data
                c_data = c_data.iloc[:len(c_data)-months,:]
                
                # forecast the limited data
                c_data = m.forecast_features(c_data)
                
                # get forecast overlap with actuals
                actual = self.data[self.data['First Loan Local Disbursement Month']==cohort]

                start = c_data[c_data.data_type=='forecast'].index.min()
                stop = actual.index.max()

                # compute errors
                backtest_report_cols = []
                errors = []
                for col in self.forecast_cols:
                    # error
                    c_data.loc[start:stop, f'error-{col}'] = c_data.loc[start:stop, col] - actual.loc[start:stop, col]
                    # % error
                    c_data.loc[start:stop, f'%error-{col}'] = (c_data.loc[start:stop, col] - actual.loc[start:stop, col])/ \
                        actual.loc[start:stop, col]
                    
                    for metric in metrics:
                        error = compute_error(actual.loc[start:stop, col], c_data.loc[start:stop, col],
                                             metric=metric)
                    
                        backtest_report_cols += [f'{col}-{metric}']
                    
                        errors.append(error)
                        
                backtest_report.append(pd.DataFrame.from_dict({cohort: errors}, orient='index', 
                                                                 columns=backtest_report_cols))
                limited_data.append(c_data)
                
        backtest_data = pd.concat(limited_data)
        backtest_report = pd.concat(backtest_report, axis=0)
        backtest_report['cohort'] = backtest_report.index
        
        self.backtest_data = backtest_data
        self.backtest_report = backtest_report
                
        return backtest_data, backtest_report
    

### Run Model

In [91]:
# create a model object
m = Model(data, market='ke')

# generate features
m.generate_features()

# generate forecasts and save as model attribute
m.forecast = m.forecast_features(m.data)

# backtest 
m.backtest, m.backtest_report = m.backtest(m.data, months=6)

Backtesting 6 months.
8 cohorts will be backtested.


In [89]:
# visualize cohorts for a given feature
m.plot_cohorts('borrower_retention', data='forecast')

In [42]:
inputs

Unnamed: 0_level_0,opex cost per loan,cost of capital,recovery_7-30,recovery_30-51,recovery_51_,max_monthly_borrower_retention
market,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ke,1.32,0.13,0.2,0.06,0.12,0.95
ph,1.08,0.13,0.06,0.03,0.04,0.97
mx,1.37,0.13,0.077,0.016,0.02,0.97


In [48]:
c1 = m.data[m.data.cohort=='2021-08']['Count Borrowers']/ \
    m.data[m.data.cohort=='2021-08']['Count Borrowers'].max()

c1.index = np.arange(1, len(c1)+1)

c1

1    1.000000
2    0.483835
3    0.389948
4    0.302697
5    0.229705
Name: Count Borrowers, dtype: float64

In [88]:
def power_fit(t, a, b):
    return a*t**b
    
# fit actuals and extract a & b params
popt, pcov = curve_fit(power_fit, c1.index, c1)

# generate the full range of times to forecast over
times = np.arange(1, 25)

a = popt[0]
b = popt[1]

# if there is less than 6 months of actuals, scale data. 
if 6 - len(c1) > 0:
    b = b + .2*(6 - len(c1)-1)

# get max survival from inputs
max_survival = inputs.loc['ke', 'max_monthly_borrower_retention']

# take the slope of the power fit between the current and previous time periods
power_slope = power_fit(times, a=0.98974725, b=-.90360369)/power_fit(times-1, a=0.98974725, b=-.90360369)
# first value will be nan, replace with 1
power_slope[0] = 1

# apply max survival condition
power_slope_capped = np.array([i if i < max_survival else max_survival for i in power_slope])
# only need values for times we're going to forecast for.
power_slope_capped = power_slope_capped[len(c1):]
power_slope_capped = pd.Series(power_slope_capped, index=[t for t in times[len(c1):]])

c1_fcast = c1.copy()
for t in times[len(c1):]:
    c1_fcast.loc[t] = c1_fcast[t-1] * power_slope_capped[t]
    
fig = go.Figure([
    go.Scatter(x=c1.index, y=c1, mode='markers+lines'),
    go.Scatter(x=times, y=c1_fcast)
])
fig.show()


divide by zero encountered in power



Liang's adjust power fit.

In [104]:
# code up Liang's power forecast
# given a single cohort data, return the forecasted feature

def power_fcast(c_data, param='borrower_retention'):
    
    c = c_data[param]
    c.index = np.arange(1, len(c)+1)
    
    def power_fit(t, a, b):
        return a*t**b

    # fit actuals and extract a & b params
    popt, pcov = curve_fit(power_fit, c.index, c)

    # generate the full range of times to forecast over
    times = np.arange(1, 25)

    a = popt[0]
    b = popt[1]

    # if there is less than 6 months of actuals, scale data. 
    if 6 - len(c) > 0:
        b = b + .2*(6 - len(c)-1)

    # get max survival from inputs
    max_survival = inputs.loc['ke', 'max_monthly_borrower_retention']

    # take the slope of the power fit between the current and previous time periods
    power_slope = power_fit(times, a=0.98974725, b=-.90360369)/power_fit(times-1, a=0.98974725, b=-.90360369)
    # first value will be nan, replace with 1
    power_slope[0] = 1

    # apply max survival condition
    power_slope_capped = np.array([i if i < max_survival else max_survival for i in power_slope])
    # only need values for times we're going to forecast for.
    power_slope_capped = power_slope_capped[len(c):]
    power_slope_capped = pd.Series(power_slope_capped, index=[t for t in times[len(c):]])

    c_fcast = c.copy()
    for t in times[len(c):]:
        c_fcast.loc[t] = c_fcast[t-1] * power_slope_capped[t]
    
    return c_fcast.reset_index(drop=True)

In [105]:
c_data = m.data[m.data.cohort=='2021-02']
c_data

Unnamed: 0,First Loan Local Disbursement Month,Months Since First Loan Disbursed,Count First Loans,Count Borrowers,Count Loans,Total Amount,Total Interest Assessed,Total Rollover Charged,Total Rollover Reversed,Default Rate Amount 7D,...,default_rate_7dpd,default_rate_51dpd,default_rate_365dpd,loans_per_original,origination_per_original,revenue_per_original,cm$_per_original,opex_per_original,ltv_per_original,cm%_per_original
0,2021-02,0,9331,9331,14950,50418000,6739290,776812,0,0.16989,...,0.16989,0.1343,0.118184,1.602186,49.980334,7.45088,0.663446,2.656339,-1.992893,-0.267471
1,2021-02,1,0,5004,6525,39642000,5406950,613134,0,0.170855,...,0.170855,0.136472,0.120095,0.699282,39.297878,5.970415,0.533915,1.348779,-0.814864,-0.136484
2,2021-02,2,0,3728,4444,34882000,4836720,504590,0,0.15965,...,0.15965,0.129584,0.114034,0.476262,34.579198,5.297619,0.750298,1.003274,-0.252975,-0.047753
3,2021-02,3,0,3100,3681,34823000,4812110,400478,1333,0.125845,...,0.125845,0.100928,0.088817,0.394491,34.520711,5.165905,1.641065,0.894703,0.746362,0.144478
4,2021-02,4,0,2589,2990,33393000,4613900,370717,0,0.12257,...,0.12257,0.096024,0.084502,0.320437,33.103124,4.943292,1.728311,0.781594,0.946717,0.191515
5,2021-02,5,0,2271,2604,33427000,4622690,391779,0,0.128788,...,0.128788,0.090115,0.079301,0.27907,33.136829,4.971186,1.949191,0.727354,1.221836,0.245784
6,2021-02,6,0,2034,2344,33123000,4620360,320267,0,0.106834,...,0.106834,0.07674,0.067531,0.251206,32.835468,4.900032,2.351711,0.687309,1.664402,0.339672
7,2021-02,7,0,1821,2043,31164000,4307950,345493,0,0.122152,...,0.122152,0.097034,0.08539,0.218948,30.893473,4.614182,1.5822,0.62369,0.95851,0.207731
8,2021-02,8,0,1673,1918,31267000,4313410,352118,0,0.124799,...,0.124799,0.089373,0.078648,0.205551,30.995579,4.628115,1.82637,0.607113,1.219256,0.263446
9,2021-02,9,0,1475,1683,28778000,4123540,269868,0,0.103689,...,0.103689,0.069172,0.060871,0.180367,28.528186,4.358298,2.356465,0.547139,1.809326,0.415145


In [107]:
c = c_data['borrower_retention']
c_fcast = power_fcast(c_data)

fig = go.Figure([
    go.Scatter(x=c.index, y=c, mode='markers+lines'),
    go.Scatter(x=times, y=c_fcast)
])
fig.show()


divide by zero encountered in power



In [92]:
# visualize cohorts for a given feature
m.plot_cohorts('borrower_retention-mpe', data='backtest_report')