# Project Idea 5: Survival Time Analysis
- A loan which gets repaid within a few months has no value to the investor.
- It is negative to the investors beacause of the **opportunity cost** of investing.
- A loan which defaults quickly after issuance is even worse. It often induces
a larger loss.
- Thus it is very important for the loan investor to have an accurate expectation on 
the loan-performance (in terms of duration or net PnL).
It is even more important when the loan is sold by the original investor to
the secondary market, a correct expectation on the remaining term of the loan and
the default rate is crucial for a correct pricing of the loan.
https://www.lendingclub.com/investing/investor-education/are-lendingclub-notes-liquid
- Build models to predict loan duration (last_payment_date - issuance_date)
for charged off/default loans and for good loans which terminate with 'fully paid' status.
- Build models to predict total profits/losses (or total principal/interest)

- Show evidences to link the loan-duration and the PnL, which may justify
training a joint model predicting multi-regression-targets.

- Modeling the survival of loans is a very important topic across different domains.
In health care/health insurance it is usually called survival analysis.
In modeling product life expectancy, engineers use a different term reliability analysis.
The same type of thought can be used on financial instrument default or marketing/customer retention.
- Note that there is an add-on survival analysis package 'scikit-survival' on top of **sklearn**.
There are a ton of packages in R dedicated to survival analysis, visit
https://cran.r-project.org/web/views/Survival.html for a survey.
- You can either start from scratch or adapt the traditional approach of
survival/hazard analysis. In the latter approach, you need to read research papers effectively.

## 1. Importing Data


In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
from datetime import datetime as dt
from datetime import timedelta

In [2]:
accepted = pd.read_csv('accepted_2007_to_2018Q4.csv')


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
accepted.shape

(2260701, 151)

In [4]:
accepted.sample(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
2253903,91159542,,20000.0,20000.0,20000.0,60 months,12.79,452.92,C,C1,...,,,Cash,N,,,,,,
1694935,99420568,,20175.0,20175.0,20175.0,36 months,13.49,684.55,C,C2,...,,,Cash,N,,,,,,
1935516,1134180,,4500.0,4500.0,4500.0,36 months,7.51,140.0,A,A3,...,,,Cash,N,,,,,,
1457632,142054292,,1000.0,1000.0,1000.0,36 months,26.31,40.46,E,E4,...,,,Cash,N,,,,,,
227875,54513928,,12000.0,12000.0,12000.0,36 months,6.89,369.93,A,A3,...,,,Cash,Y,Apr-2017,COMPLETE,Mar-2017,3403.53,40.0,1.0


## 2. Data Pre-Processing


In [5]:
accepted['issue_date'] = pd.to_datetime(accepted['issue_d'])
accepted['issue_d'] = accepted['issue_date'].apply(lambda x: str(x)[:7])
accepted['last_pymnt_date'] = pd.to_datetime(accepted['last_pymnt_d'])
accepted['paid_months'] = (accepted['last_pymnt_date'].dt.year - accepted['issue_date'].dt.year) * 12 + accepted['last_pymnt_date'].dt.month - accepted['issue_date'].dt.month


In [6]:
accepted[['issue_date','loan_status','paid_months','last_pymnt_date']].sample(10)

Unnamed: 0,issue_date,loan_status,paid_months,last_pymnt_date
254426,2015-06-01,Fully Paid,17.0,2016-11-01
1576464,2018-04-01,Current,11.0,2019-03-01
2076627,2017-11-01,Current,16.0,2019-03-01
1567305,2018-05-01,Current,10.0,2019-03-01
1359783,2018-12-01,Current,3.0,2019-03-01
2007044,2016-07-01,Charged Off,4.0,2016-11-01
1790963,2013-10-01,Fully Paid,20.0,2015-06-01
1584321,2018-04-01,Current,11.0,2019-03-01
1214248,2014-08-01,Fully Paid,36.0,2017-08-01
321276,2015-04-01,Fully Paid,36.0,2018-04-01


In [7]:
accepted.isna().sum()

id                             0
member_id                2260701
loan_amnt                     33
funded_amnt                   33
funded_amnt_inv               33
                          ...   
settlement_percentage    2226455
settlement_term          2226455
issue_date                    33
last_pymnt_date             2460
paid_months                 2460
Length: 154, dtype: int64

In [8]:
accepted['loan_status'].value_counts(dropna=False)

Fully Paid                                             1076751
Current                                                 878317
Charged Off                                             268559
Late (31-120 days)                                       21467
In Grace Period                                           8436
Late (16-30 days)                                         4349
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     40
NaN                                                         33
Name: loan_status, dtype: int64

In [9]:
loans = accepted.loc[accepted['loan_status'].isin(['Fully Paid', 'Charged Off', 'Default'])]


In [10]:
loans.shape

(1345350, 154)

In [11]:
loans['loan_status'].value_counts(normalize=True,dropna=False)

Fully Paid     0.80035
Charged Off    0.19962
Default        0.00003
Name: loan_status, dtype: float64

In [12]:
missing_fractions = loans.isnull().mean().sort_values(ascending=False)


In [13]:
missing_fractions.head()

member_id                                     1.000000
next_pymnt_d                                  0.999970
orig_projected_additional_accrued_interest    0.997204
hardship_type                                 0.995722
hardship_reason                               0.995722
dtype: float64

In [14]:
keep_list = ['addr_state', 'annual_inc', 'application_type', 'dti', 'earliest_cr_line', 'emp_length', 'emp_title',
             'fico_range_high', 'fico_range_low', 'grade', 'home_ownership', 'id', 'initial_list_status', 
             'installment', 'int_rate', 'issue_d', 'loan_amnt', 'loan_status', 'mort_acc', 'open_acc', 'pub_rec', 
             'pub_rec_bankruptcies', 'purpose', 'revol_bal', 'revol_util', 'sub_grade', 'term', 'title', 'total_acc',
             'verification_status', 'zip_code','paid_months']

In [15]:
loans = loans[keep_list]

In [16]:
loans.shape

(1345350, 32)

In [17]:
loans.isna().sum()

addr_state                  0
annual_inc                  0
application_type            0
dti                       374
earliest_cr_line            0
emp_length              78516
emp_title               85791
fico_range_high             0
fico_range_low              0
grade                       0
home_ownership              0
id                          0
initial_list_status         0
installment                 0
int_rate                    0
issue_d                     0
loan_amnt                   0
loan_status                 0
mort_acc                47281
open_acc                    0
pub_rec                     0
pub_rec_bankruptcies      697
purpose                     0
revol_bal                   0
revol_util                857
sub_grade                   0
term                        0
title                   16659
total_acc                   0
verification_status         0
zip_code                    1
paid_months              2313
dtype: int64

In [18]:
#dropping rows with na for paid_months and zipcode
loans.dropna(axis=0, subset=['paid_months'],inplace=True)
loans.dropna(axis=0, subset=['zip_code'],inplace=True)


In [19]:
loans.drop('id', axis=1, inplace=True)

In [20]:
loans.drop('grade', axis=1, inplace=True)

In [21]:
loans.shape

(1343036, 30)

In [22]:
loans['emp_title'].describe()

count     1257498
unique     377868
top       Teacher
freq        21236
Name: emp_title, dtype: object

In [23]:
loans.drop('emp_title', axis=1, inplace=True)

In [24]:
loans['emp_length'].replace(to_replace='10+ years', value='10 years', inplace=True)


In [25]:
loans['emp_length'].replace('< 1 year', '0 years', inplace=True)


In [26]:
def emp_length_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])

In [27]:
loans['emp_length'] = loans['emp_length'].apply(emp_length_to_int)


In [28]:
loans['emp_length'].value_counts(dropna=False).sort_index()


0.0     107868
1.0      88345
2.0     121536
3.0     107403
4.0      80387
5.0      83980
6.0      62615
7.0      59523
8.0      60594
9.0      50857
10.0    441647
NaN      78281
Name: emp_length, dtype: int64

In [29]:
loans.drop('zip_code', axis=1, inplace=True)

In [30]:
loans.groupby('addr_state')['loan_status'].value_counts(normalize=True).loc[:,'Charged Off'].sort_values()

addr_state
DC    0.130836
ME    0.137148
VT    0.138543
IA    0.142857
OR    0.143031
NH    0.144831
WV    0.154320
CO    0.154557
WA    0.156230
SC    0.161133
KS    0.166385
WY    0.166838
MT    0.167627
UT    0.169676
CT    0.172622
RI    0.177524
IL    0.179870
WI    0.182357
GA    0.182664
ID    0.187796
MA    0.189220
CA    0.194341
AZ    0.195106
AK    0.195543
DE    0.195548
MN    0.195987
TX    0.196830
VA    0.198030
HI    0.200919
MI    0.201825
ND    0.203252
OH    0.204275
NC    0.206237
PA    0.206658
KY    0.208597
NJ    0.210017
MD    0.211462
MO    0.211702
SD    0.211803
NM    0.212381
TN    0.212805
FL    0.213047
IN    0.213249
NV    0.217215
NY    0.219013
LA    0.230580
OK    0.232942
AL    0.235046
AR    0.239581
NE    0.249930
MS    0.259428
Name: loan_status, dtype: float64

In [31]:
loans.columns

Index(['addr_state', 'annual_inc', 'application_type', 'dti',
       'earliest_cr_line', 'emp_length', 'fico_range_high', 'fico_range_low',
       'home_ownership', 'initial_list_status', 'installment', 'int_rate',
       'issue_d', 'loan_amnt', 'loan_status', 'mort_acc', 'open_acc',
       'pub_rec', 'pub_rec_bankruptcies', 'purpose', 'revol_bal', 'revol_util',
       'sub_grade', 'term', 'title', 'total_acc', 'verification_status',
       'paid_months'],
      dtype='object')

In [32]:
loans.drop('revol_bal',axis=1,inplace=True)

In [33]:
loans['fico_score'] = 0.5*loans['fico_range_low'] + 0.5*loans['fico_range_high']


In [34]:
loans.drop(['fico_range_high', 'fico_range_low'], axis=1, inplace=True)


In [35]:
loans.drop('title',axis=1,inplace=True)

In [36]:
loans.shape

(1343036, 25)

In [37]:
# loans = pd.get_dummies(loans, columns=['sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type'], drop_first=True)


In [38]:
loans.shape

(1343036, 25)

In [39]:
loans['term'] = loans['term'].apply(lambda s: np.int8(s.split()[0]))


In [40]:
loans.isna().sum()

addr_state                  0
annual_inc                  0
application_type            0
dti                       370
earliest_cr_line            0
emp_length              78281
home_ownership              0
initial_list_status         0
installment                 0
int_rate                    0
issue_d                     0
loan_amnt                   0
loan_status                 0
mort_acc                47194
open_acc                    0
pub_rec                     0
pub_rec_bankruptcies      697
purpose                     0
revol_util                847
sub_grade                   0
term                        0
total_acc                   0
verification_status         0
paid_months                 0
fico_score                  0
dtype: int64

In [41]:
loans['earliest_cr_line'] = loans['earliest_cr_line'].apply(lambda s: int(s[-4:]))


In [42]:
loans['issue_d'] = pd.to_datetime(loans['issue_d'])


In [43]:
loans.drop('issue_d', axis=1, inplace =True)

In [44]:
loans.isna().sum()

addr_state                  0
annual_inc                  0
application_type            0
dti                       370
earliest_cr_line            0
emp_length              78281
home_ownership              0
initial_list_status         0
installment                 0
int_rate                    0
loan_amnt                   0
loan_status                 0
mort_acc                47194
open_acc                    0
pub_rec                     0
pub_rec_bankruptcies      697
purpose                     0
revol_util                847
sub_grade                   0
term                        0
total_acc                   0
verification_status         0
paid_months                 0
fico_score                  0
dtype: int64

In [45]:
def acc_paid (row):
    if row['loan_status']=='Fully Paid':
        return 1
    if row['loan_status'] != 'Fully Paid':
        return 0

In [46]:
loans['loan_status'] = loans.apply (lambda row: acc_paid(row), axis=1)


addr_state                  0
annual_inc                  0
application_type            0
dti                       370
earliest_cr_line            0
emp_length              78281
home_ownership              0
initial_list_status         0
installment                 0
int_rate                    0
loan_amnt                   0
loan_status                 0
mort_acc                47194
open_acc                    0
pub_rec                     0
pub_rec_bankruptcies      697
purpose                     0
revol_util                847
sub_grade                   0
term                        0
total_acc                   0
verification_status         0
paid_months                 0
fico_score                  0
dtype: int64

## 3. Creating Dummy Variables

In [59]:
loans = pd.get_dummies(loans, columns=['sub_grade', 'home_ownership', 'verification_status',
                                       'purpose', 'addr_state', 'initial_list_status', 'application_type'],
                       drop_first=True)

In [60]:
loans.shape

(1294684, 123)

In [61]:
loans.isna().sum()

annual_inc                    0
dti                           0
earliest_cr_line              0
emp_length                    0
installment                   0
                             ..
addr_state_WI                 0
addr_state_WV                 0
addr_state_WY                 0
initial_list_status_w         0
application_type_Joint App    0
Length: 123, dtype: int64

## 4. Train/Test Split

In [62]:
#running random forest before dummifying categorical columns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics


In [63]:
loans.emp_length.fillna(5, inplace=True)

In [64]:
loans = loans.dropna()

In [65]:
X = loans.drop('loan_status', axis=1)
Y = loans['loan_status']

In [66]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [55]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2) 
X_train_res, Y_train_res = sm.fit_sample(X_train, Y_train.ravel())

## Random Forest Classifier

In [67]:
clf = RandomForestClassifier(max_depth=2, random_state=0, class_weight='balanced')
clf.fit(X_train, Y_train)


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [68]:
y_pred=clf.predict(X_test)

In [69]:
clf.score(X_train, Y_train)

0.6540432405950491

In [70]:
print("Accuracy:",metrics.accuracy_score(Y_test, y_pred))

Accuracy: 0.654173725431636


In [71]:
Y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on test set: {:.3f}'.format(clf.score(X_test, Y_test)))

Accuracy of Random Forest classifier on test set: 0.654


In [72]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, Y_pred)
print(confusion_matrix)
#89% precision

[[ 56098  22059]
 [112262 197987]]


In [60]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, y_pred)
print(confusion_matrix)
#the True positive over True/False Positive is important 



[[ 47144  31013]
 [ 86873 223376]]


In [59]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import GridSearchCV


In [60]:
from sklearn.linear_model import SGDClassifier


In [64]:
loans.shape

(1294684, 123)

In [393]:
loans.shape

(1218136, 123)

In [443]:
loans1 = loans.drop('paid_months',axis=1)

In [444]:
loans1.shape

(1218136, 122)

## Logistic Regression

In [88]:
X.shape

(1294684, 122)

In [90]:
Y

0          1
1          1
2          1
4          1
5          1
          ..
2260688    1
2260690    1
2260691    0
2260692    1
2260697    0
Name: loan_status, Length: 1294684, dtype: int64

In [70]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [71]:
LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [100]:
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X, Y, test_size=0.3, random_state=42)
logreg = LogisticRegression(C=1,class_weight='balanced')
logreg.fit(X3_train, Y3_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1, class_weight='balanced', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [101]:
Y_pred = logreg.predict(X3_test)
print('Accuracy of logistic regression classifier on test set: {:.3f}'.format(logreg.score(X3_test, Y3_test)))

Accuracy of logistic regression classifier on test set: 0.662


In [102]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y3_test, Y_pred)
print(confusion_matrix)

[[ 53317  24840]
 [106538 203711]]


In [71]:
from sklearn.neighbors import KNeighborsClassifier


In [85]:
kne = KNeighborsClassifier()

In [86]:
kne.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [88]:
kne.score(X_train,Y_train)

0.8319367787809039

In [89]:
Y1_pred = kne.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.3f}'.format(kne.score(X_test, Y_test)))

Accuracy of logistic regression classifier on test set: 0.773


In [92]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, Y1_pred)
print(confusion_matrix)

[[ 12700  65457]
 [ 22618 287631]]


## Neural Network using Keras

In [93]:
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [135]:
# define the keras model
model = Sequential()
model.add(Dense(10, input_dim=122, activation='sigmoid'))
model.add(Dense(10, activation='sigmoid'))
model.add(Dense(5, activation='sigmoid'))
model.add(Dense(3, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

In [136]:
...
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [137]:
# fit the keras model on the dataset
model.fit(X_train, Y_train, epochs=10, batch_size=200)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x1a8c7c1610>

In [138]:
# evaluate the keras model
_, accuracy = model.evaluate(X_train, Y_train)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 80.02


In [139]:
Y2_pred = model.predict_classes(X_test)

In [114]:
print("X=%s, Predicted=%s" % (X_test, Y2_pred))

X=         annual_inc    dti  earliest_cr_line  emp_length  installment  \
1823815     70000.0  15.86              1992         5.0       337.19   
1835593     42000.0  21.86              2004         3.0       492.43   
373510     125000.0  27.16              1989        10.0       785.49   
1719040     60000.0  19.33              2008         5.0       361.48   
1170722     50000.0  20.60              1998         5.0       206.76   
...             ...    ...               ...         ...          ...   
1681213     83283.0  19.91              2002        10.0      1187.57   
541602      72000.0  22.20              1994         5.0       597.95   
562359     120000.0  14.24              1998         7.0       361.38   
1449984     85000.0  17.10              2005        10.0       495.58   
1095331     90000.0   9.88              2008         2.0       672.84   

         int_rate  loan_amnt  mort_acc  open_acc  pub_rec  ...  addr_state_TX  \
1823815     13.05    10000.0       3.0  

In [148]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, Y1_pred)
print(confusion_matrix)

[[ 12700  65457]
 [ 22618 287631]]


In [149]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test, Y2_pred)
print(confusion_matrix)

[[     0  78157]
 [     0 310249]]
