# <center>Standard Bank Tech Impact Challege: Xente Credit Scoring Challege</center>

### <center>My 7Th Position Code Solution</center>

#### Summary Of The Challenge:

This challenge was hosted on <a href ="https://zindi.africa/competitions/sbtic-xente-credit-scoring-challenge">Zindi</a>. The objective of this challenge is to create a machine learning model to predict which individuals are most likely to default on their loans, based on their loan repayment behaviour and ecommerce transaction activity.

The resulting models and solutions will help Xente refine their credit decision processes, and enable them to more adequately assess the creditworthiness of new and existing clients. For Xente, this may result in improved profitability and financial sustainability; while for Xente’s cliente, increased creditworthiness would enhance their access to credit and contribute to an improved livelihood.

This challenge is hosted by Xente, in association with Standard Bank

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
masked = pd.read_csv('unlinked_masked_final.csv')
sub = pd.read_csv('sample_submission.csv')

In [3]:
train.dropna(axis = 0, inplace = True)
test.dropna(axis = 0, inplace = True)

In [4]:
train.head(2)

Unnamed: 0,CustomerId,TransactionStartTime,Value,Amount,TransactionId,BatchId,SubscriptionId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,TransactionStatus,IssuedDateLoan,AmountLoan,Currency,LoanId,PaidOnDate,IsFinalPayBack,InvestorId,DueDate,LoanApplicationId,PayBackId,ThirdPartyId,IsThirdPartyConfirmed,IsDefaulted
15,CustomerId_233,2018-10-22 16:04:25,5000.0,-5000.0,TransactionId_2632,BatchId_775,SubscriptionId_4,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2018-10-22 16:04:22,5375.0,UGX,LoanId_317,2018-10-30 06:49:57,1.0,InvestorId_3,2018-11-21 16:03:32,LoanApplicationId_1629,PayBackId_1719,ThirdPartyId_1010,0.0,0.0
17,CustomerId_305,2018-10-23 13:12:23,500.0,-500.0,TransactionId_1297,BatchId_2016,SubscriptionId_1,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2018-10-23 13:12:21,543.0,UGX,LoanId_1619,2018-10-23 13:18:42,1.0,InvestorId_2,2018-11-22 13:12:16,LoanApplicationId_136,PayBackId_725,ThirdPartyId_1566,0.0,0.0


In [5]:
print(train.columns)
print(test.columns)

Index(['CustomerId', 'TransactionStartTime', 'Value', 'Amount',
       'TransactionId', 'BatchId', 'SubscriptionId', 'CurrencyCode',
       'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory',
       'ChannelId', 'TransactionStatus', 'IssuedDateLoan', 'AmountLoan',
       'Currency', 'LoanId', 'PaidOnDate', 'IsFinalPayBack', 'InvestorId',
       'DueDate', 'LoanApplicationId', 'PayBackId', 'ThirdPartyId',
       'IsThirdPartyConfirmed', 'IsDefaulted'],
      dtype='object')
Index(['CustomerId', 'TransactionStartTime', 'Value', 'Amount',
       'TransactionId', 'BatchId', 'SubscriptionId', 'CurrencyCode',
       'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory',
       'ChannelId', 'TransactionStatus', 'IssuedDateLoan', 'LoanId',
       'InvestorId', 'LoanApplicationId', 'ThirdPartyId'],
      dtype='object')


In [6]:
#New Features by Aggregating
train['customer_subscriptions'] = train['SubscriptionId'].map(train.groupby('SubscriptionId')['CustomerId'].count())
test['customer_subscriptions'] = test['SubscriptionId'].map(test.groupby('SubscriptionId')['CustomerId'].count())

train['customer_providers'] = train['ProviderId'].map(train.groupby('ProviderId')['CustomerId'].count())
test['customer_providers'] = test['ProviderId'].map(test.groupby('ProviderId')['CustomerId'].count())

train['customer_products'] = train['ProductId'].map(train.groupby('ProductId')['CustomerId'].count())
test['customer_products'] = test['ProductId'].map(test.groupby('ProductId')['CustomerId'].count())

train['customer_loanids'] = train['LoanId'].map(train.groupby('LoanId')['CustomerId'].count())
test['customer_loanids'] = test['LoanId'].map(test.groupby('LoanId')['CustomerId'].count())

In [9]:
train['TransactionStartTime'] = pd.to_datetime(train['TransactionStartTime'])
test['TransactionStartTime'] = pd.to_datetime(test['TransactionStartTime'])

In [10]:
train['year'] =  train['TransactionStartTime'].dt.year
test['year'] =  test['TransactionStartTime'].dt.year

train['month'] =  train['TransactionStartTime'].dt.month
test['month'] =  test['TransactionStartTime'].dt.month

train['day'] =  train['TransactionStartTime'].dt.day
test['day'] =  test['TransactionStartTime'].dt.day

In [11]:
train.head(2)

Unnamed: 0,CustomerId,TransactionStartTime,Value,Amount,TransactionId,BatchId,SubscriptionId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,TransactionStatus,IssuedDateLoan,AmountLoan,Currency,LoanId,PaidOnDate,IsFinalPayBack,InvestorId,DueDate,LoanApplicationId,PayBackId,ThirdPartyId,IsThirdPartyConfirmed,IsDefaulted,customer_subscriptions,customer_providers,customer_products,customer_loanids,year,month,day
15,CustomerId_233,2018-10-22 16:04:25,5000.0,-5000.0,TransactionId_2632,BatchId_775,SubscriptionId_4,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2018-10-22 16:04:22,5375.0,UGX,LoanId_317,2018-10-30 06:49:57,1.0,InvestorId_3,2018-11-21 16:03:32,LoanApplicationId_1629,PayBackId_1719,ThirdPartyId_1010,0.0,0.0,1,1479,386,1,2018,10,22
17,CustomerId_305,2018-10-23 13:12:23,500.0,-500.0,TransactionId_1297,BatchId_2016,SubscriptionId_1,UGX,256,ProviderId_1,ProductId_7,airtime,ChannelId_1,1,2018-10-23 13:12:21,543.0,UGX,LoanId_1619,2018-10-23 13:18:42,1.0,InvestorId_2,2018-11-22 13:12:16,LoanApplicationId_136,PayBackId_725,ThirdPartyId_1566,0.0,0.0,143,1479,386,1,2018,10,23


In [12]:
train.drop(['CurrencyCode', 'CountryCode', 'TransactionStatus', 'IssuedDateLoan', 'PayBackId','IsThirdPartyConfirmed',
           'AmountLoan', 'Currency', 'LoanId','PaidOnDate', 'IsFinalPayBack','DueDate','ProviderId', 'ChannelId'], axis = 1, inplace = True)
test.drop(['CurrencyCode', 'CountryCode', 'TransactionStatus', 'IssuedDateLoan', 'LoanId','ProviderId', 'ChannelId'], axis = 1, inplace = True)

In [13]:
#no of customers Transactions 
m1 = train.groupby('CustomerId')['TransactionId'].count()
m2 = test.groupby('CustomerId')['TransactionId'].count()
train['no_trans'] = train['CustomerId'].map(m1)
test['no_trans'] = test['CustomerId'].map(m2)

#train['no_trans1'] = train['CustomerId'].map(train['CustomerId'].value_counts().to_dict())

In [14]:
train = pd.get_dummies(train, columns = ['ProductCategory', 'ProductId', 'InvestorId'])
test = pd.get_dummies(test, columns = ['ProductCategory', 'ProductId', 'InvestorId'])

In [15]:
train.drop(['CustomerId', 'TransactionStartTime','BatchId', 'SubscriptionId', 
            'LoanApplicationId', 'ThirdPartyId','ProductCategory_movies'], axis = 1, inplace = True)
test.drop(['CustomerId', 'TransactionStartTime','BatchId', 'SubscriptionId', 
            'LoanApplicationId', 'ThirdPartyId',], axis = 1, inplace = True)

In [16]:
train.head(2)

Unnamed: 0,Value,Amount,TransactionId,IsDefaulted,customer_subscriptions,customer_providers,customer_products,customer_loanids,year,month,day,no_trans,ProductCategory_airtime,ProductCategory_data_bundles,ProductCategory_financial_services,ProductCategory_retail,ProductCategory_tv,ProductCategory_utility_bill,ProductId_ProductId_1,ProductId_ProductId_10,ProductId_ProductId_13,ProductId_ProductId_15,ProductId_ProductId_16,ProductId_ProductId_17,ProductId_ProductId_18,ProductId_ProductId_2,ProductId_ProductId_3,ProductId_ProductId_4,ProductId_ProductId_5,ProductId_ProductId_6,ProductId_ProductId_7,ProductId_ProductId_8,ProductId_ProductId_9,InvestorId_InvestorId_1,InvestorId_InvestorId_2,InvestorId_InvestorId_3
15,5000.0,-5000.0,TransactionId_2632,0.0,1,1479,386,1,2018,10,22,40,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
17,500.0,-500.0,TransactionId_1297,0.0,143,1479,386,1,2018,10,23,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0


### Interactions

In [17]:
train['prov_prod'] = train['customer_providers'] + train['customer_products']

test['prov_prod'] = test['customer_providers'] + test['customer_products']

train['3int'] = train['customer_providers'] + train['customer_products'] + train['customer_subscriptions']
test['3int'] = test['customer_providers'] + test['customer_products'] + test['customer_subscriptions']

In [16]:
train.head(2)

Unnamed: 0,Value,Amount,TransactionId,IsDefaulted,customer_subscriptions,customer_providers,customer_products,customer_loanids,year,month,day,no_trans,ProductCategory_airtime,ProductCategory_data_bundles,ProductCategory_financial_services,ProductCategory_retail,ProductCategory_tv,ProductCategory_utility_bill,ProductId_ProductId_1,ProductId_ProductId_10,ProductId_ProductId_13,ProductId_ProductId_15,ProductId_ProductId_16,ProductId_ProductId_17,ProductId_ProductId_18,ProductId_ProductId_2,ProductId_ProductId_3,ProductId_ProductId_4,ProductId_ProductId_5,ProductId_ProductId_6,ProductId_ProductId_7,ProductId_ProductId_8,ProductId_ProductId_9,InvestorId_InvestorId_1,InvestorId_InvestorId_2,InvestorId_InvestorId_3,prov_prod,3int
15,5000.0,-5000.0,TransactionId_2632,0.0,1,1479,386,1,2018,10,22,40,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1865,1866
17,500.0,-500.0,TransactionId_1297,0.0,143,1479,386,1,2018,10,23,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1865,2008


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, recall_score, accuracy_score, roc_auc_score, auc, roc_curve, classification_report
from sklearn.preprocessing import StandardScaler, RobustScaler, Normalizer, MinMaxScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline

sd = StandardScaler()
nm = Normalizer()
minmax = MinMaxScaler()
rb = RobustScaler()
poly = PolynomialFeatures()

In [18]:
tain_trans = train['TransactionId']
test_tranns = test['TransactionId']
X = train.drop(['TransactionId', 'IsDefaulted','ProductId_ProductId_16','ProductId_ProductId_17',
                'ProductId_ProductId_2','InvestorId_InvestorId_3',], axis = 1)
y = train.IsDefaulted

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.20, random_state = 9)

In [19]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

In [20]:
xgb = XGBClassifier(base_score = 0.5, n_estimators =200,colsample_bylevel=0.8,gamma=0.04,max_depth=10,
                    learning_rate = 0.04,scale_pos_weight = 5, min_child_weight = 6,seed = 12)
cb = CatBoostClassifier()

In [40]:
cb_model=CatBoostClassifier(n_estimators=800,eval_metric='AUC',max_depth=5,learning_rate=0.1,od_wait=50, 
                              subsample=0.9,bootstrap_type='Bernoulli',metric_period=20,
                     #l2_leaf_reg=5,#bagging_temperature=0.85,random_strength=100,
                     use_best_model=True)

In [41]:
cb_model.fit(x_train,y_train, eval_set=(x_test, y_test))



0:	test: 0.9104238	best: 0.9104238 (0)	total: 25.1ms	remaining: 20s
20:	test: 0.9764225	best: 0.9764225 (20)	total: 591ms	remaining: 21.9s
40:	test: 0.9791801	best: 0.9802831 (36)	total: 1.13s	remaining: 20.9s
60:	test: 0.9816619	best: 0.9817538 (55)	total: 1.7s	remaining: 20.6s
80:	test: 0.9829488	best: 0.9833165 (75)	total: 2.23s	remaining: 19.8s
100:	test: 0.9832246	best: 0.9836842 (87)	total: 2.74s	remaining: 19s
120:	test: 0.9846953	best: 0.9848791 (113)	total: 3.25s	remaining: 18.2s
140:	test: 0.9854306	best: 0.9860741 (131)	total: 3.79s	remaining: 17.7s
160:	test: 0.9858902	best: 0.9860741 (131)	total: 4.32s	remaining: 17.1s
180:	test: 0.9866256	best: 0.9868094 (169)	total: 4.85s	remaining: 16.6s
200:	test: 0.9869014	best: 0.9869014 (200)	total: 5.38s	remaining: 16s
220:	test: 0.9873610	best: 0.9874529 (215)	total: 5.9s	remaining: 15.5s
240:	test: 0.9876367	best: 0.9876367 (240)	total: 6.45s	remaining: 15s
260:	test: 0.9870852	best: 0.9876367 (240)	total: 6.98s	remaining: 14.4s


<catboost.core.CatBoostClassifier at 0xc0a3ac8>

In [23]:
pipe2 = make_pipeline(xgb)

pipe2.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('xgbclassifier', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=1, gamma=0.04, learning_rate=0.04,
       max_delta_step=0, max_depth=10, min_child_weight=6, missing=None,
       n_estimators=200, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=5, seed=12, silent=True, subsample=1))])

In [23]:
pred = (cb_model.predict_proba(x_test)[:,1]).astype('float')

In [24]:
roc_auc_score(y_test, pred)

0.9858902472653736

In [26]:
fpr, tpr, thresholds = roc_curve(y_test, pred,pos_label = 1)
auc(fpr, tpr)

0.9876367313172165

In [39]:
#cross validation
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, ShuffleSplit
kfold = KFold(n_splits = 5, shuffle = True, random_state = 0)
skfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
shuffle = ShuffleSplit(test_size = 0.5,train_size = 0.5, n_splits = 5)

val = cross_val_score(pipe2, x_train, y_train,scoring = 'roc_auc', cv = skfold)
print(val.mean())
val

0.971653778320445


array([0.970194  , 0.97592593, 0.98174603, 0.96245421, 0.96794872])

In [44]:
#Grid Search
from sklearn.model_selection import GridSearchCV
params = {'learning_rate': [0.1,0.02,0.03,0.04,0.05,0.25], 'max_depth': [10, 20, 30, 40,100]}
grid = GridSearchCV(xgb, params, scoring = 'roc_auc', cv = skfold )
grid.fit(X,y)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
       error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=1, gamma=0.04, learning_rate=0.04,
       max_delta_step=0, max_depth=10, min_child_weight=6, missing=None,
       n_estimators=200, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=5, seed=12, silent=True, subsample=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'learning_rate': [0.1, 0.02, 0.03, 0.04, 0.05, 0.25], 'max_depth': [10, 20, 30, 40, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [45]:
print("best parameter: {}".format(grid.best_params_))
print("best score: {}".format(grid.best_score_))
print("estimator: \n{}".format(grid.best_estimator_))

best parameter: {'learning_rate': 0.1, 'max_depth': 10}
best score: 0.9820286033862308
estimator: 
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=1, gamma=0.04, learning_rate=0.1, max_delta_step=0,
       max_depth=10, min_child_weight=6, missing=None, n_estimators=200,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=5, seed=12, silent=True,
       subsample=1)


## Predicting On Test

In [32]:
xtest = test.drop(['TransactionId','ProductCategory_ticket','ProductId_ProductId_11','ProductId_ProductId_14',
                   'ProductId_ProductId_19',],axis = 1)

In [33]:
print(xtest.shape)
print(X.shape)

(478, 32)
(1479, 32)


In [34]:
pred_test = cb_model.predict_proba(xtest)[:,1]
pred_test1 = pipe2.predict_proba(xtest)[:,1]

In [35]:
d ={'TransactionId': test['TransactionId'], 'IsDefaulted': pred_test}
sub = pd.DataFrame(data = d)
d1 ={'TransactionId': test['TransactionId'], 'IsDefaulted': pred_test}
sub1 = pd.DataFrame(data = d1)
sub.head()

Unnamed: 0,TransactionId,IsDefaulted
0,TransactionId_925,0.039724
1,TransactionId_1080,0.000863
2,TransactionId_2315,0.001206
3,TransactionId_1466,0.000454
4,TransactionId_337,0.000388


In [36]:
sub.to_csv('cb_model.csv', index = False)
sub1.to_csv('stack.csv', index = False)

### Applying Stacking

In [51]:
sub7 = pd.read_csv('cb_model.csv')
sub8 = pd.read_csv('stack.csv')

In [52]:
g = sub7.drop(['TransactionId'], axis = 1)
h = sub8.drop(['TransactionId'], axis = 1)

In [53]:
stack = (0.95 * h + 0.05 * g )

In [54]:
stack['TransactionId'] = sub1['TransactionId']
stack = stack[['TransactionId', 'IsDefaulted']]
stack.head().round(2)

Unnamed: 0,TransactionId,IsDefaulted
0,TransactionId_925,0.04
1,TransactionId_1080,0.0
2,TransactionId_2315,0.0
3,TransactionId_1466,0.0
4,TransactionId_337,0.0


In [55]:
stack.to_csv('stack1.csv', index = False)