<a href="https://colab.research.google.com/github/LittlePandaCode/CodeAlpha_Credit_Scoring_Mode/blob/main/CodeAlpha_Credit_Scoring_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Import the required libraries/modules.**

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


In [None]:
data=pd.read_csv("/content/drive/MyDrive/Datasets/credit_scoring/cs-test.csv")
sample=pd.read_csv("/content/drive/MyDrive/Datasets/credit_scoring/sampleEntry.csv")

## **Pre-processing and cleaning Data**

**`First, we’ll handle outliers and missing values in our dataset. After merging our datasets,We’ll identify columns that aren’t useful for our study case and remove them.`**


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101503 entries, 0 to 101502
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            101503 non-null  int64  
 1   SeriousDlqin2yrs                      0 non-null       float64
 2   RevolvingUtilizationOfUnsecuredLines  101503 non-null  float64
 3   age                                   101503 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  101503 non-null  int64  
 5   DebtRatio                             101503 non-null  float64
 6   MonthlyIncome                         81400 non-null   float64
 7   NumberOfOpenCreditLinesAndLoans       101503 non-null  int64  
 8   NumberOfTimes90DaysLate               101503 non-null  int64  
 9   NumberRealEstateLoansOrLines          101503 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  101503 non-null  int64  
 11  

In [None]:
data.drop(columns='SeriousDlqin2yrs',inplace=True)
data.dropna(inplace=True)
data.head()

Unnamed: 0.1,Unnamed: 0,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0
1,2,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0
2,3,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0
3,4,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0
4,5,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0


Individuals with a "RevolvingUtilizationOfUnsecuredLines" ratio below 50% are generally considered safer credit recipients **(coded as 0)** compared to those with ratios exceeding 50% **(coded as 1)**, indicating a more prudent use of available credit.

In [None]:
data.loc[data['RevolvingUtilizationOfUnsecuredLines']<=0.5, 'RevolvingUtilizationOfUnsecuredLines']=0
data.loc[data['RevolvingUtilizationOfUnsecuredLines']>0.5, 'RevolvingUtilizationOfUnsecuredLines']=1

**Probabilty of credit**

A 4% probability implies that there is a 96% chance that the person will repay their credit, which we code as 0. Otherwise, we code it as 1.

In [None]:
sample.head()

Unnamed: 0,Id,Probability
0,1,1.0
1,2,1.0
2,3,0.0
3,4,1.0
4,5,1.0


In [None]:
sample.loc[sample['Probability']<=0.04, 'Probability']=0
sample.loc[sample['Probability']>0.04, 'Probability']=1

In [None]:
sample.Probability.unique()

array([1., 0.])

**Merging the two datasets**

In [None]:
credit=pd.merge(data,sample,left_on='Unnamed: 0',right_on='Id',how='inner')
credit.drop(columns=['Id','Unnamed: 0',],inplace=True)


In [None]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81400 entries, 0 to 81399
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   RevolvingUtilizationOfUnsecuredLines  81400 non-null  float64
 1   age                                   81400 non-null  int64  
 2   NumberOfTime30-59DaysPastDueNotWorse  81400 non-null  int64  
 3   DebtRatio                             81400 non-null  float64
 4   MonthlyIncome                         81400 non-null  float64
 5   NumberOfOpenCreditLinesAndLoans       81400 non-null  int64  
 6   NumberOfTimes90DaysLate               81400 non-null  int64  
 7   NumberRealEstateLoansOrLines          81400 non-null  int64  
 8   NumberOfTime60-89DaysPastDueNotWorse  81400 non-null  int64  
 9   NumberOfDependents                    81400 non-null  float64
 10  Probability                           81400 non-null  float64
dtypes: float64(5), 


## **Split data**

In [None]:
train, test= train_test_split(credit, test_size=0.2, random_state=42, shuffle=True)

In [None]:
train_y=train.pop('Probability')
test_y=test.pop('Probability')

# **Train the model**

In [None]:
model= DecisionTreeClassifier()

In [None]:
model.fit(train,train_y)

Make predictions basing on dataset


In [None]:
predictions=model.predict(test)

## **Evaluating the Performance of the model**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_y,predictions))

              precision    recall  f1-score   support

         0.0       0.94      0.94      0.94      9935
         1.0       0.91      0.90      0.90      6345

    accuracy                           0.93     16280
   macro avg       0.92      0.92      0.92     16280
weighted avg       0.93      0.93      0.93     16280



let's check if the model overfit using cross-validation  

In [None]:
from sklearn.model_selection import  cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier

X = credit.drop(columns='Probability')
y = credit['Probability']

k_folds = 5

# Perform k-fold cross-validation
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, train, train_y, cv=kf, scoring='accuracy')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)
print("Average cross-validation score:", np.mean(cv_scores))


Cross-validation scores: [0.92605958 0.92590602 0.92559889 0.92421683 0.92498464]
Average cross-validation score: 0.925353194103194


**`the cross-validation scores proved that the model is performing consistently well across different folds. the averge cross-validation of approximatly 0.925 indicates that our model is able to generalize well unseen data  and it is stable and reliable for making predictions`.**

