# Problem Statement Outline
### About the Data
* NPAs (Non Performing Assets) have reached all time high
* It’s stock has fallen by 20% in the previous quarter alone
* Majority of NPA was contributed by loan defaulters.
* along with the bank, the investors perform due diligence on the requested loan application. 


##use machine learning to figure out a way to find these defaulters and devise a plan to reduce them.
##In this challenge, you will help this bank by predicting the probability that a member will default.


* Evaluation based on AUC-ROC score.

In [None]:
!unzip /content/drive/MyDrive/dataset.zip -d /content/drive/MyDrive/data/

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#About the Data
* **member_id**  unique ID assigned to each member
* **loan_amnt**	loan amount applied by the member
* **funded_amnt**	loan amout  sanctioned by the bank
* **funded_amnt_inv**	loan amount  sanctioned by the investors
* **term**	term of loan (in months)
* **batch_enrolled**	batch numbers allotted to members
* **int_rate**	interest rate (%) on loan
* **grade**	grade assigned by the bank
* **sub_grade**	grade assigned by the bank
* **emp_title**	job / Employer title of member
* **emp_length**	employment length, where 0 means less than one year and 10 means ten or more years
* **home_ownership**	status of home ownership
* **annual_inc**	annual income ($) reported by the member
* **verification_status**	status of income verified by the bank
* **pymnt_plan**	indicates if any payment plan has started against loan
* **desc**	loan description provided by member
* **purpose**	purpose of loan
* **title**	loan title provided by member
* **zip_code**	first three digits of area zipcode of member

* **addr_state**	living state of member
* **dti**	ratio of member's total monthly debt repayment excluding mortgage divided by self reported monthly income
* **delinq_2yrs**	number of 30+ days delinquency in past 2 years
* **inq_last_6mths**	number of inquiries in last 6 months
* **mths_since_last_delinq**	number of months since last delinq
* **mths_since_last_record**	number of months since last public record
* **open_acc**	number of open credit line in member's credit line
* **pub_rec**	number of derogatory public records
* **revol_bal**	total credit revolving balance
* **revol_util**	amount of credit a member is using relative to revol_bal
* **total_acc**	total number of credit lines available in members credit line
* **initial_list_status**	unique listing status of the loan - W(Waiting), F(Forwarded)
* **total_rec_int**	interest received till date
* **total_rec_late_fee**	Late fee received till date
* **recoveries**	post charge off gross recovery
* **collection_recovery_fee**	post charge off collection fee

* **collections_12_mths_ex_med**	number of collections in last 12 months excluding medical collections
* **mths_since_last_major_derog**	months since most recent 90 day or worse rating
* **application_type**	indicates when the member is an individual or joint
* **verification_status_joint**	indicates if the joint members income was verified by the bank
* **last_week_pay**	indicates how long (in weeks) a member has paid EMI after batch enrolled
* **acc_now_delinq**	number of accounts on which the member is delinquent
* **tot_coll_amt**	total collection amount ever owed
* **tot_cur_bal**	total current balance of all accounts
* **total_rev_hi_lim**	total revolving credit limit
* **loan_status**	status of loan amount, 1 = Defaulter, 0 = Non Defaulters



In [None]:
data=pd.read_csv('/content/drive/MyDrive/data/ML_Artivatic_dataset/train_indessa.csv')
test=pd.read_csv('/content/drive/MyDrive/data/ML_Artivatic_dataset/test_indessa.csv')

In [None]:
print('data',data.shape)
print('test',test.shape)

In [None]:
#finding null values 
data.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
data.describe()

In [None]:
data.describe(include=object)

##Let's get a report of the data.


In [None]:
!pip install pandas-profiling[notebook,html]
from pandas_profiling import ProfileReport
profile = ProfileReport(train, title='Report0')
profile.to_file('your_report.html')#saves a html page with the overview of the data when running on your local machine.

Let's observe defaulter and try to observe the pattern towards becoming a defaulter.

In [None]:
sns.swarmplot(x=data['loan_status'], y=data['annual_inc'])

In [None]:
sns.lineplot(x=data['dti'],y=data['loan_status'])

In [None]:
sns.barplot(x=data['funded_amnt'], y=data['term'],hue=data['loan_status'],ci=None)

**For a larger amount of loan, a longer term of repayment is preferred by clients. But increasing the term or giving a lesser amount makes no big differencce in the person's being defaulter.**

In [None]:
sns.lineplot(x=data['funded_amnt'], y=data['int_rate'],hue=data['loan_status'])

**For a larger amount of loan and when the interest rate is higher people are more people towards defaultors can be observed.**

In [None]:
sns.barplot(x=data['loan_status'], y=data['revol_bal'],ci=None)

**Clients that are regular with their payments have a hight credit revolving balance.**

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
sns.barplot(ax=ax,x=data['loan_status'], y=data['addr_state'],ci=None)

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.countplot(x=data['addr_state'],hue=data['loan_status'],saturation=0.75,ax=ax)

In [None]:
# fig, ax = plt.subplots(figsize=(20,10/))
sns.barplot(x=data['verification_status'],y=data['loan_status'],ci=None)

In [None]:
data1 = data.groupby('verification_status')['loan_status'].sum()
data1.plot.pie(autopct="%.1f%%", pctdistance=0.5)

* The bank has a large number of clients that are not verified before sanctioning of a loan

In [None]:
new=data.groupby('verification_status')['loan_status']
new.value_counts()

* This shows that clients that are not verified are more likely to be 
defaulters 

#Missing Value Analysis


In [None]:
###Dropping Columns and Rows
threshold = 50
#Dropping columns with missing value rate higher than threshold
cols = data.columns[(100 * data.isnull().sum() / len(data)).round(2) > threshold]
data.drop(columns=cols,inplace=True)
data.shape

In [None]:
data.columns[data.isnull().sum()>0]

#**Imputation using mean**
Using a Traditional split, compute and merge trick - To save processing time of DataFrame. Can also be done with some new alogs - yet takes time to set the system configurations. So, sticking traditional.

In [None]:
data = data.sample(frac=1)

data_split_1 = data[:100000]
data_split_2 = data[100000:200000]
data_split_3 = data[200000:300000]
data_split_4 = data[300000:400000]
data_split_5 = data[400000:]

In [None]:
data_split_1.fillna(data_split_1.mean(), inplace=True)
data_split_2.fillna(data_split_2.mean(), inplace=True)
data_split_3.fillna(data_split_3.mean(), inplace=True)
data_split_4.fillna(data_split_4.mean(), inplace=True)
data_split_5.fillna(data_split_5.mean(), inplace=True)

In [None]:
print(data_split_1.shape)
print(data_split_2.shape)
print(data_split_3.shape)
print(data_split_4.shape)
print(data_split_5.shape)

In [None]:
#Merging the splitted dataframes to an aggreagated dataframe
data = pd.concat([data_split_1, data_split_2, data_split_3, data_split_4, data_split_5], ignore_index=True)
data.shape

In [None]:
#Dropping rows with Missing values - rows that are not feasible for imputation
data.dropna(inplace=True)

#Corelation  Analysis


In [None]:
def remove_collinear_features(x, threshold):
    
    # Dont want to remove correlations between loan_status
    y = x['loan_status']
    x = x.drop(columns = ['loan_status'])
    
    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)
            
            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                # print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns = drops)
    
    # Add the score back to the data
    x['loan_status'] = y
               
    return x

In [None]:
data=remove_collinear_features(data,0.6)
test_cols=data.columns #gathering columns for test data preparation

In [None]:
#Correlations between Features and Target

#Find all correlations and sort
correlations_df = data.corr()['loan_status'].sort_values()

# #Print the most negative correlations
print(correlations_df.head(15), '\n')

# #Print the most positive correlations
print(correlations_df.tail(15))


In [None]:
fig,ax=plt.subplots(figsize=(20,20))
correlation = data.corr()
sns.heatmap(correlation,xticklabels=True,yticklabels=True,ax=ax,annot=True)

##**2. Outlier analysis**
We will plot a boxplot to observe the outliers in our data and clean the outliers for training as well as testing dat to avoid any kind of extreme variance in our data.

In [None]:
# data.boxplot(figsize=(30,15))
# plt.show()

In [None]:
# cols = data.select_dtypes(include=[np.float]).columns
# for n in cols:
#   q1=data[n].quantile(.25)
#   q3=data[n].quantile(.75)
#   iqr=q3-q1
#   data[n]=np.clip(data[n],q1-1.5*iqr,q3+1.5*iqr)


#**Split in training and testing**

In [None]:
#splitting the given data in training and test to check on knowm
new = data.select_dtypes(include=[np.object]).columns
thres=10
for n in new:
  if (data[n].nunique()) < thres:
    data = pd.get_dummies(data,columns=[n],drop_first=True)


data.drop(columns=(data.select_dtypes(include=[np.object]).columns),inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

#Separate out the features and targets
features = data.drop(columns='loan_status')
targets = pd.DataFrame(data['loan_status'])

#Split into 80% training and 20% testing set
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size = 0.2, random_state = 42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
#Convert y to one-dimensional array (vector)
y_train = np.array(y_train).reshape((-1, ))
y_test = np.array(y_test).reshape((-1, ))

In [None]:
from sklearn.svm import SVC
#metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix,r2_score,mean_absolute_error
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_curve, roc_auc_score

#Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(C=0.4,max_iter=1000,solver='liblinear')
classifier.fit(X_train,y_train)
y_pred_lr = classifier.predict(X_test)

acc = accuracy_score(y_test, y_pred_lr)
prec = precision_score(y_test, y_pred_lr,average='weighted')
rec = recall_score(y_test, y_pred_lr,average='weighted')
f1 = f1_score(y_test, y_pred_lr, average='weighted')
roc_auc= roc_auc_score(y_test, y_pred_lr)
results = pd.DataFrame([['Logistic Regression', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

# results = results.append(model_results, ignore_index = True)
print(results)

#Decision Tree


In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14) 
# training the classifier
clf.fit(X_train,y_train)
# do our predictions on the test
pred_dt = clf.predict(X_test)
# Predicting Test Set

acc = accuracy_score(y_test, pred_dt)
prec = precision_score(y_test, pred_dt,average='weighted')
rec = recall_score(y_test, pred_dt,average='weighted')
f1 = f1_score(y_test, pred_dt, average='weighted')
roc_auc= roc_auc_score(y_test, pred_dt)
model_results = pd.DataFrame([['Decision Tree', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)


#Random Forest Classifier


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier()
clf_rf.fit(X_train,y_train)

# Predicting Test Set
y_pred_rf = clf_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred_rf)
prec = precision_score(y_test, y_pred_rf,average='weighted')
rec = recall_score(y_test, y_pred_rf,average='weighted')
f1 = f1_score(y_test, y_pred_rf, average='weighted')
roc_auc= roc_auc_score(y_test, y_pred_rf)
model_results = pd.DataFrame([['Random forest classifier', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)




#Ada Boost with RFC

In [None]:
from sklearn.ensemble import AdaBoostClassifier
clf=RandomForestClassifier()
abc = AdaBoostClassifier(base_estimator=clf,n_estimators=50,learning_rate=1)
# Train Adaboost Classifer
model = abc.fit(X_train,y_train)

#Predict the response for test dataset
y_pred_abc = model.predict(X_test)

acc = accuracy_score(y_test, y_pred_abc)
prec = precision_score(y_test, y_pred_abc,average='weighted')
rec = recall_score(y_test, y_pred_abc,average='weighted')
f1 = f1_score(y_test, y_pred_abc, average='weighted')
roc_auc= roc_auc_score(y_test, y_pred_abc)
model_results = pd.DataFrame([['Adaboost ', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)




#XG Boost


In [None]:
import xgboost as xgb
model =xgb.XGBClassifier(learning_rate=0.06,colsample_bytree = 0.6, subsample = 0.8,n_estimators=200,max_depth=3, gamma=0)
model.fit(X_train,y_train)
y_pred_xg = model.predict(X_test)

acc = accuracy_score(y_test, y_pred_xg)
prec = precision_score(y_test, y_pred_xg,average='weighted')
rec = recall_score(y_test, y_pred_xg,average='weighted')
f1 = f1_score(y_test, y_pred_xg, average='weighted')
roc_auc= roc_auc_score(y_test, y_pred_xg)
model_results = pd.DataFrame([['XG Boost', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)


#Bernoullie Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB
model_bnb = BernoulliNB()
model_bnb.fit(X_train,y_train)
# Predicting Test Set
pred_bnb = model_bnb.predict(X_test)

acc = accuracy_score(y_test, pred_bnb)
prec = precision_score(y_test, pred_bnb,average='weighted')
rec = recall_score(y_test, pred_bnb,average='weighted')
f1 = f1_score(y_test, pred_bnb, average='weighted')
roc_auc= roc_auc_score(y_test, pred_bnb)
model_results = pd.DataFrame([['Bernouillie Naive Bayes', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)

In [None]:
result=pd.DataFrame(results)
result

#Best Accuracy

In [None]:
plt.figure(figsize=(8,5))
max_acc_index=results.Accuracy[results.Accuracy==results.Accuracy.max()].index[0]
plt.barh(results.Model,results.Accuracy,color='c')
plt.barh(results.Model[max_acc_index],results.Accuracy[max_acc_index],color='m')
plt.show()

#Best Precision

In [None]:
plt.figure(figsize=(8,5))
max_pre_index=results.Precision[results.Precision==results.Precision.max()].index[0]
plt.barh(results.Model,results.Precision,color='c')
plt.barh(results.Model[max_pre_index],results.Precision[max_pre_index],color='m')
plt.show()

#Best Recall

In [None]:
plt.figure(figsize=(8,5))
max_rc_index=results.Recall[results.Recall==results.Recall.max()].index[0]
plt.barh(results.Model,results.Recall,color='c')
plt.barh(results.Model[max_rc_index],results.Recall[max_rc_index],color='m')
plt.show()

#best F1 score

In [None]:
plt.figure(figsize=(8,5))
max_f1_index=results['F1 Score'][results['F1 Score']==results['F1 Score'].max()].index[0]
plt.barh(results.Model,results['F1 Score'],color='c')
plt.barh(results.Model[max_f1_index],results.Accuracy[max_f1_index],color='m')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
max_f1_index=results['ROC_AUC'][results['ROC_AUC']==results['ROC_AUC'].max()].index[0]
plt.barh(results.Model,results['ROC_AUC'],color='c')
plt.barh(results.Model[max_f1_index],results.Accuracy[max_f1_index],color='m')
plt.show()

#Checking on the actual test data 

###Missing Value Analysis

In [None]:
test.shape

In [None]:
#including the same columns from the train data after eliminating the missing value columns
test=test[['member_id', 'loan_amnt', 'funded_amnt', 'term', 'batch_enrolled',
       'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'pymnt_plan',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'initial_list_status', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'collections_12_mths_ex_med',
       'application_type', 'last_week_pay', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal']]
       

In [None]:
#splittting the test data to fill NAN values

test = test.sample(frac=1)
test_split_1 = test[:100000]
test_split_2 = test[100000:200000]
test_split_3 = test[200000:300000]
test_split_4 = test[300000:]
# data_split_5 = data[400000:]

In [None]:
test_split_1.fillna(test_split_1.mean(), inplace=True)
test_split_2.fillna(test_split_2.mean(), inplace=True)
test_split_3.fillna(test_split_3.mean(), inplace=True)
test_split_4.fillna(test_split_4.mean(), inplace=True)
# data_split_5.fillna(data_split_5.mean(), inplace=True)

In [None]:
test = pd.concat([test_split_1, test_split_2, test_split_3, test_split_4], ignore_index=True)

In [None]:
#Dropping rows with Missing values - rows that are not feasible for imputation
test.dropna(inplace=True)

#Outlier Analysis

In [None]:
# cols = data.select_dtypes(include=[np.float]).columns
# for n in cols:
#   q1=data[n].quantile(.25)
#   q3=data[n].quantile(.75)
#   iqr=q3-q1
#   data[n]=np.clip(data[n],q1-1.5*iqr,q3+1.5*iqr)


CHecking categorical variable

In [None]:
new = test.select_dtypes(include=[np.object]).columns
thres=10
for n in new:
  if (test[n].nunique()) < thres:
    test = pd.get_dummies(test,columns=[n],drop_first=True)

In [None]:
test.drop(columns=(test.select_dtypes(include=[np.object]).columns),inplace=True)

In [None]:
test.shape

Scaling our data

In [None]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
# _train = sc.fit_transform(X_train)
x_test = sc.fit_transform(test)

In [None]:
#y_test=test['loan_status'] for prediction 
y_test = np.array(y_test).reshape((-1, ))

#Let's fit our data and use two models with best scores

RANDOM FOREST CLASSIFIER

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train,y_train)

# Predicting Test Set
y_pred_rf = clf_rf.predict(x_test)
acc = accuracy_score(y_test, y_pred_rf)
prec = precision_score(y_test, y_pred_rf,average='weighted')
rec = recall_score(y_test, y_pred_rf,average='weighted')
f1 = f1_score(y_test, y_pred_rf, average='weighted')
roc_auc= roc_auc_score(y_test, y_pred_rf)
model_results = pd.DataFrame([['Random forest classifier', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)

fpr, tpr, _ = roc_curve(y_test, y_pred_rf)
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve - Gradient Boosting Classification')
plt.show()



DECISION TREE

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14) 
# training the classifier
clf.fit(X_train,y_train)
# do our predictions on the test
pred_dt = clf.predict(X_test)
# Predicting Test Set

acc = accuracy_score(y_test, pred_dt)
prec = precision_score(y_test, pred_dt,average='weighted')
rec = recall_score(y_test, pred_dt,average='weighted')
f1 = f1_score(y_test, pred_dt, average='weighted')
roc_auc= roc_auc_score(y_test, pred_dt)
model_results = pd.DataFrame([['Decision Tree', acc,prec, rec, f1,roc_auc]],
               columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score','ROC_AUC'])

results = results.append(model_results, ignore_index = True)
print(results)


fpr, tpr, _ = roc_curve(y_test, pred_dt)
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve - Gradient Boosting Classification')
plt.show()


**Since Decision Tree Classifier has a better ROC score, using it's prediction for submission file**


#Getting the predictions

In [None]:
submission=pd.DataFrame(data=[test['member_id'],pred_dt],columns=['member_id','Predictions'],index=None)
submission.to_csv('/Submission.csv')
                        