# Micro-Credit Defaulter Model
Problem Statement: A Microfinance Institution (MFI) is an organization that offers financial services to low income populations. MFS becomes very useful when targeting especially the unbanked poor families living in remote areas with not much sources of income. The Microfinance services (MFS) provided by MFI are Group Loans, Agricultural Loans, Individual Business Loans and so on. Many microfinance institutions (MFI), experts and donors are supporting the idea of using mobile financial services (MFS) which they feel are more convenient and efficient, and cost saving, than the traditional high-touch model used since long for the purpose of delivering microfinance services. Though, the MFI industry is primarily focusing on low income families and are very useful in such areas, the implementation of MFS has been uneven with both significant challenges and successes. Today, microfinance is widely accepted as a poverty-reduction tool, representing $70 billion in outstanding loans and a global outreach of 200 million clients. We are working with one such client that is in Telecom Industry. They are a fixed wireless telecommunications network provider. They have launched various products and have developed its business and organization based on the budget operator model, offering better products at Lower Prices to all value conscious customers through a strategy of disruptive innovation that focuses on the subscriber. They understand the importance of communication and how it affects a person’s life, thus, focusing on providing their services and products to low income families and poor customers that can help them in the need of hour. They are collaborating with an MFI to provide micro-credit on mobile balances to be paid back in 5 days. The Consumer is believed to be defaulter if he deviates from the path of paying back the loaned amount within the time duration of 5 days. For the loan amount of 5 (in Indonesian Rupiah), payback amount should be 6 (in Indonesian Rupiah), while, for the loan amount of 10 (in Indonesian Rupiah), the payback amount should be 12 (in Indonesian Rupiah). The sample data is provided to us from our client database. It is hereby given to you for this exercise. In order to improve the selection of customers for the credit, the client wants some predictions that could help them in further investment and improvement in selection of customers. Exercise: Build a model which can be used to predict in terms of a probability for each loan transaction, whether the customer will be paying back the loaned amount within 5 days of insurance of loan. In this case, Label ‘1’ indicates that the loan has been payed i.e. Non- defaulter, while, Label ‘0’ indicates that the loan has not been payed i.e. defaulter.
Points to Remember:

There are no null values in the dataset.
There may be some customers with no loan history.
• The dataset is imbalanced. Label ‘1’ has approximately 87.5% records, while, label ‘0’ has approximately 12.5% records.
• For some features, there may be values which might not be realistic. You may have to observe them and treat them with a suitable explanation.
• You might come across outliers in some features which you need to handle as per your understanding. Keep in mind that data is expensive and we cannot lose more than 7-8% of the data.
Find Enclosed the Data Description File and The Sample Data for the Modeling Exercise.



In [None]:
#Import

#Generic
import numpy as np,pandas as pd, matplotlib.pyplot as plt, seaborn as sns, joblib
from matplotlib.ticker import FormatStrFormatter

#Statistics
from scipy.stats import zscore

#Scaler
from sklearn.preprocessing import StandardScaler,MinMaxScaler

#Skewness
from sklearn.preprocessing import PowerTransformer

#Train Test Split
from sklearn.model_selection import train_test_split

#Resample
from sklearn.utils import resample

#Feature Selection
from sklearn.feature_selection import SelectKBest,chi2,f_classif, VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Decomposition
from sklearn.decomposition import PCA

#Cross Validation
from sklearn.model_selection import cross_val_score

#Hypertune Parameters
from sklearn.model_selection import GridSearchCV

#Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb

#Classification Metrics
from sklearn.metrics import classification_report,confusion_matrix,f1_score,accuracy_score,recall_score,precision_score
from sklearn.metrics import auc,roc_curve

In [None]:
df=pd.read_csv("Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project/Data file.csv")

In [None]:
#Check head
df.head()

In [None]:

#Delete first column of index
df=df.drop('Unnamed: 0',axis=1)

In [None]:
#Check info
df.info()

In [None]:
df.describe()

In [None]:
sum(df.duplicated())

In [None]:
#Drop Duplicates
df=df.drop_duplicates()

In [None]:
#Check Object type columns
df.select_dtypes('object').columns

In [None]:
#Drop pcircle
df=df.drop('pcircle',axis=1)

In [None]:
#Change dtype of pdate to datetime64
df['pdate']=pd.to_datetime(df['pdate'])

In [None]:
#Check count of Label
plt.figure(figsize=(10,6))
sns.countplot(df['label'],palette='Set2')
plt.ylim(0,200000)
plt.savefig('Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project//1.Unsampled_Label.jpeg',dpi=300)
plt.show()
#As we can see the label is highly imbalanced

In [None]:

#Plot heatmap to check correlation
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(),cmap='viridis')
plt.savefig('Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project//2.Correlation_Heatmap.jpeg',dpi=300)
plt.show()


In [None]:
#Plot barplot to check correlation
plt.figure(figsize=(10,6))
df.corr()['label'].drop('label').sort_values().plot(kind='bar')
plt.savefig('Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project//3.Correlation_Barplot.jpeg',dpi=300)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='label',y='cnt_ma_rech30',data=df,palette='viridis')
plt.ylim(0,20)
plt.savefig('Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project//4.Box_Plot.jpeg',dpi=300)
plt.show()

In [None]:
#Check value counts of pdate
df['pdate'].value_counts()

In [None]:
#Extract new column of month
df['pmonth']=df['pdate'].dt.month

In [None]:
#No need to extract year as there is only one distinct value
df['pdate'].dt.year.nunique()

In [None]:
#Drop pdate column
df=df.drop(['pdate'],axis=1)


In [None]:
df=df.drop('msisdn',axis=1)

In [None]:
#Check Skewness and Detect Outlier
q1=df.quantile(q=0.25)
q3=df.quantile(q=0.75)
#Create IQR Range
IQR=q3-q1
lower_bound = q1 - 1.5*IQR
higer_bound = q3 + 1.5*IQR

In [None]:
#Create function for Outlier Detection
def remove_outlier(df,col,inp):
    if inp==False:
        df_copy=df.copy()
        raw=df_copy[col].shape[0]
        df_copy=df_copy[(df_copy[col]>=lower_bound[col]) & (df_copy[col]<=higer_bound[col])]
        prcsd=df_copy[col].shape[0]
        
        percent_change=(((raw-prcsd)/raw)*100)
        outliers=raw-prcsd
        
        print("{} outliers are detected for column {} with percent change being {}".format(outliers,col,percent_change))
    elif inp==True:
        raw=df[col].shape[0]
        df=df[(df[col]>=lower_bound[col]) & (df[col]<=higer_bound[col])]
        prcsd=df[col].shape[0]
        
        percent_change=(((raw-prcsd)/raw)*100)
        outliers=raw-prcsd
        
        print("{} outliers are detected for column {} with percent change being {}".format(outliers,col,percent_change))

In [None]:
#Run Function for each column
for x in df.columns:
    remove_outlier(df,str(x),False)

In [None]:
pt=PowerTransformer()
for x in df.columns.drop('label'):
    if abs(df.loc[:,x].skew())>0.55:
        df.loc[:,x]=pt.fit_transform(df.loc[:,x].values.reshape(-1,1))

In [None]:
df.skew()

In [None]:
#Resample the data as it is highly imbalanced
df_minority=df[df['label']==0]
df_majority=df[df['label']==1]

df_minority_upsampled=resample(df_minority,replace=True,n_samples=50000,random_state=101)

df_upsampled=pd.concat([df_majority,df_minority_upsampled],axis=0)

In [None]:
df_minority=df_upsampled[df_upsampled['label']==0]
df_majority=df_upsampled[df_upsampled['label']==1]

df_majority_downsampled=resample(df_majority,replace=False,n_samples=150000,random_state=101)

df_downsampled=pd.concat([df_minority,df_majority_downsampled],axis=0)

In [None]:
#As the data is expensive we cannot afford to loose more than 7-8% of data so we cannot directly downsample data to make the class balanced
#Also, we cannot upsample the data from 25000 to 150000 as it would lead to many redundant data
#Hence we have only downsampled 3-4% of data and we have upsampled that downsampled data.

In [None]:
#Check resmapled data
df_downsampled['label'].value_counts()

In [None]:
#Check count of Label
plt.figure(figsize=(10,6))
sns.countplot(df_downsampled['label'],palette='Set2')
plt.ylim(0,200000)
plt.savefig('Desktop/QUERY/internship fliprobo/Micro-Credit-Project/Micro Credit Project//5.Sampled_Data.jpeg',dpi=300)
plt.show()
#As we can see now the data is enough balanced to train a model

In [None]:
vif_df=pd.DataFrame()
vif_df['Features']=df_downsampled.columns
vif_df['VIF']=[variance_inflation_factor(df_downsampled.values,x) for x in range(len(df_downsampled.columns))]

In [None]:
vif_df.sort_values(by='VIF')

In [None]:
df_downsampled_copy=df_downsampled.copy()

In [None]:
df_downsampled_copy=df_downsampled_copy.drop(['medianamnt_loans30','medianamnt_ma_rech90','amnt_loans90','sumamnt_ma_rech30','sumamnt_ma_rech90',
                         'cnt_ma_rech90','daily_decr90','amnt_loans30'],axis=1)

In [None]:
df_downsampled_copy.shape

In [None]:

#Scale data
scale=MinMaxScaler()
X=df_downsampled_copy.drop('label',axis=1)
y=df_downsampled_copy['label']

In [None]:
X=scale.fit_transform(X)

In [None]:
# #Remove Features with variance less than 0.001
# select=VarianceThreshold(threshold=0.001)
# X=select.fit_transform(X)

In [None]:

#Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
#Create Function which will evaluate model's performance
model_list=[]
score_list=[]
def model_sel(mod):
    model=mod
    model.fit(X_train,y_train)
    predict=model.predict(X_test)
    f1score=f1_score(y_test,predict)
    model_list.append(str(mod))
    score_list.append(round(f1score,3))
    print("****************** Metrics *********************")
    print()
    print("Accuracy of the model is {}".format(accuracy_score(y_test,predict)))
    print("Recall of the model is {}".format(recall_score(y_test,predict)))
    print("Precision of the model is {}".format(precision_score(y_test,predict)))
    print("F1 score of the model is {}".format(f1score))
    print()
    print("************** Confusion Matrix ****************")
    print()
    print(confusion_matrix(y_test,predict))
    print()
    print("*********** Classification Report **************")
    print()
    print(classification_report(y_test,predict))

In [None]:
#Run model for Logistic Regression
model_sel(LogisticRegression(max_iter=3000))

In [None]:

#Run Model for RandomForestClassifier
model_sel(RandomForestClassifier())

In [None]:

#Run Model for AdaBoostClassifier
model_sel(AdaBoostClassifier())

In [None]:
#Run Model for Support Vector Machines
model_sel(SVC())            

In [None]:
#Run Model for KNeighbors
model_sel(KNeighborsClassifier())

In [None]:
#Create XGboost Dataset
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)

#Create Parameters for XGboost
param = {
    'eta': 0.6, 
    'max_depth': 30,  
    'objective': 'multi:softprob',  
    'num_class': 3}
steps = 80  # The number of training iterations

#Train the model
model = xgb.train(param, D_train, steps)

#Perform Prediction
preds = model.predict(D_test)

#Choose best Prediction
best_preds = np.asarray([np.argmax(line) for line in preds])

print("****************** Metrics *********************")
print()
print("Accuracy of the model is {}".format(accuracy_score(y_test,best_preds)))
print("Recall of the model is {}".format(recall_score(y_test,best_preds)))
print("Precision of the model is {}".format(precision_score(y_test,best_preds)))
print("F1 score of the model is {}".format(f1_score(y_test,best_preds)))
print()
print("************** Confusion Matrix ****************")
print()
print(confusion_matrix(y_test,best_preds))
print()
print("*********** Classification Report **************")
print()
print(classification_report(y_test,best_preds))

model_list.append('XGBoost')
score_list.append(round(f1_score(y_test,best_preds),3))

In [None]:
fig,ax=plt.subplots(1,1,figsize=(10,6))
splot=sns.barplot(x=model_list,y=score_list,palette='twilight_r',tick_label=score_list,ax=ax)
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.3f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0.00, 9.00), 
                   textcoords = 'offset points')
ax.yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
ax.set_xlabel('Model')
ax.set_ylabel('F1Score')
plt.xticks(rotation=20)
plt.tight_layout()
plt.savefig('Images//6.Model_Performance.jpeg',dpi=300)
plt.show()

- From the above graph we can see that RandomForestClassifier is working very well
- So we will try to hypertune its paramters

In [None]:
#Instantiate object for RandomForest to optimize parameters
model=RandomForestClassifier()
model.fit(X_train,y_train)
predict=model.predict(X_test)

In [None]:

cross_val_score(model,X_train,y_train,cv=4).mean()

In [None]:
fpr,tpr,threshold=roc_curve(y_test,predict)
auc(fpr,tpr)

In [None]:
threshold.sort()

In [None]:
plt.figure(figsize=(10,6))
plt.plot(fpr,tpr,marker='o',markerfacecolor='red',markersize=10,linestyle='-.')
plt.plot(threshold)
plt.ylim(0,1.3)
plt.xlim(0,1.3)
plt.savefig('Images//7.AUC_ROC_Curve.jpeg',dpi=300)
plt.show()

In [None]:
#Hypertune Parameter
params={'n_estimators':[100,130,150,170,190,210,230,250,270,290,310,330]}
gscv=GridSearchCV(model,params)

In [None]:
gscv.fit(X_train,y_train)

In [None]:
best_param=gscv.best_params_
best_param['n_estimators']

In [None]:

model=RandomForestClassifier(n_estimators=best_param['n_estimators'])
model.fit(X_train,y_train)
predict=model.predict(X_test)

In [None]:
print("****************** Metrics *********************")
print()
print("Accuracy of the model is {}".format(accuracy_score(y_test,predict)))
print("Recall of the model is {}".format(recall_score(y_test,predict)))
print("Precision of the model is {}".format(precision_score(y_test,predict)))
print("F1 score of the model is {}".format(f1_score(y_test,predict)))
print()
print("************** Confusion Matrix ****************")
print()
print(confusion_matrix(y_test,predict))
print()
print("*********** Classification Report **************")
print()
print(classification_report(y_test,predict))

In [None]:
df_predict=pd.DataFrame(pd.Series(predict))
df_test=pd.DataFrame(pd.Series(y_test))
df_predict=pd.concat([df_predict.reset_index().drop('index',axis=1),df_test.reset_index().drop('index',axis=1)],axis=1)
df_predict.columns=['Predicted','Original']
df_pred

In [None]:

df_predict.loc[df_predict['Predicted']==df_predict['Original'],'Result']=True
df_predict.loc[df_predict['Predicted']!=df_predict['Original'],'Result']=False

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(df_predict['Result'],palette='twilight')
plt.savefig('Images//8.Result.jpeg',dpi=300)
plt.show()

In [None]:
joblib.dump(model,'Micro Credit Defaulter RF.obj') 

In [None]:
model= joblib.load('Micro Credit Defaulter RF.obj')
y_pred = model.predict(x_test)
accuracy_score(y_test, y_pred)