# MACHINE LEARNING I
# 
# LAB ASSIGNMENT I: CLASSIFICATION

## Authors:

- **Alberto García Martín**: 

- **Jorge Peralta Fernández-Revuelta**:

- **Juan López Segura**: 202308780@alu.comillas.edu

In this lab assignment, we will analyze the FICO_Dataset.csv dataset and extract conclusions, performing several classification methods properly explained.

---

In [12]:
### Load necessary modules -------------------------------
# interactive plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

# plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# Data management libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Machine learning libraries
from sklearn.model_selection import train_test_split

---

## 1. Preparing the Dataset

First of all, we start by loading the dataset and taking a look at the first 10 rows, shape of the DataFrame and type of columns in order to understand the variables.

### STEP 1: IMPORT DATASET

In [13]:
### Load file --------------------------------------------
df2 = pd.read_csv('FICO_dataset_reduced_MOD.csv', sep = ";")
df2.head()

Unnamed: 0,RiskPerformance,ExternalRiskEstimate,NetFractionRevolvingBurden,AverageMInFile,MSinceOldestTradeOpen,PercentTradesWBalance,PercentInstallTrades,NumSatisfactoryTrades,NumTotalTrades,PercentTradesNeverDelq,MSinceMostRecentInqexcl7days
0,1.0,55,33,84,144.0,69.0,43,20.0,23.0,83,0
1,1.0,61,0,41,58.0,0.0,67,2.0,7.0,100,0
2,1.0,67,53,24,66.0,86.0,44,9.0,9.0,100,0
3,1.0,66,72,73,169.0,91.0,57,28.0,30.0,93,0
4,1.0,81,51,132,333.0,80.0,25,12.0,12.0,100,0


In [14]:
print("Shape of the DataFrame = ", df2.shape)

Shape of the DataFrame =  (7442, 11)


As we can see, there are 11 variables. Of those 11, 10 are independent, including:

- **ExternalRiskEstimate**: A measure of borrower's riskiness based on consolidated external data sources.

- **NetFractionRevolvingBurden**: The proportion of an individual's current credit usage compared to their maximum allowed credit.

- **AverageMInFile**: The average duration, in months, of the trades in a borrower's credit file.

- **MSinceOldestTradeOpen**: The age, in months, of a borrower's oldest credit account.

- **PercentTradesWBalance**: The percentage of ????.

- **PercentInstallTrades**: The percentage of a borrower's credit accounts that have fixed payment terms over a specified period.

- **NumSatisfactoryTrades**: Count of trades where a borrower has met obligations satisfactorily.

- **NumTotalTrades**: Number of Total Trades (total number of credit accounts).

- **MSinceMostRecentInqexcl7days**: Months since the last credit inquiry, ignoring the most recent week.

- **PercentTradesNeverDelq**: The percentage of a borrower's trades with no history of delinquency.

Therefore, the variable to be predicted is:

- **RiskPerformance**: Paid as negotiated flag (12-36 months). Class variable (0 or 1).

### STEP 2: CHECK OUT THE MISSING VALUES

In [15]:
### Info and type of variables & missing
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7442 entries, 0 to 7441
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RiskPerformance               5245 non-null   float64
 1   ExternalRiskEstimate          7442 non-null   int64  
 2   NetFractionRevolvingBurden    7442 non-null   int64  
 3   AverageMInFile                7442 non-null   int64  
 4   MSinceOldestTradeOpen         7415 non-null   float64
 5   PercentTradesWBalance         7386 non-null   float64
 6   PercentInstallTrades          7442 non-null   int64  
 7   NumSatisfactoryTrades         7425 non-null   float64
 8   NumTotalTrades                7419 non-null   float64
 9   PercentTradesNeverDelq        7442 non-null   int64  
 10  MSinceMostRecentInqexcl7days  7442 non-null   int64  
dtypes: float64(5), int64(6)
memory usage: 639.7 KB


We can observe that there are several missing values. Appart from that, it is also necessary to change the type of our response variable, and also we could reassign the type of the *float64* ones to *int64*.

In [18]:
### Basic removal of missing values
#df = df2.interpolate(method= 'nearest', axis = 0)
df = df2.dropna(inplace=False) #Inplace for "really" elminiating the rows in the dataframe
#Check "axis" or "subset" arguments for additional options.
#Check results
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5122 entries, 0 to 7441
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RiskPerformance               5122 non-null   float64
 1   ExternalRiskEstimate          5122 non-null   int64  
 2   NetFractionRevolvingBurden    5122 non-null   int64  
 3   AverageMInFile                5122 non-null   int64  
 4   MSinceOldestTradeOpen         5122 non-null   float64
 5   PercentTradesWBalance         5122 non-null   float64
 6   PercentInstallTrades          5122 non-null   int64  
 7   NumSatisfactoryTrades         5122 non-null   float64
 8   NumTotalTrades                5122 non-null   float64
 9   PercentTradesNeverDelq        5122 non-null   int64  
 10  MSinceMostRecentInqexcl7days  5122 non-null   int64  
dtypes: float64(5), int64(6)
memory usage: 480.2 KB


### STEP 3: ENCODE CATEGORICAL VARIABLES

In [19]:
#There are no categorical input variables

### Convert output variable to factor
df.RiskPerformance = df.RiskPerformance.astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5122 entries, 0 to 7441
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   RiskPerformance               5122 non-null   category
 1   ExternalRiskEstimate          5122 non-null   int64   
 2   NetFractionRevolvingBurden    5122 non-null   int64   
 3   AverageMInFile                5122 non-null   int64   
 4   MSinceOldestTradeOpen         5122 non-null   float64 
 5   PercentTradesWBalance         5122 non-null   float64 
 6   PercentInstallTrades          5122 non-null   int64   
 7   NumSatisfactoryTrades         5122 non-null   float64 
 8   NumTotalTrades                5122 non-null   float64 
 9   PercentTradesNeverDelq        5122 non-null   int64   
 10  MSinceMostRecentInqexcl7days  5122 non-null   int64   
dtypes: category(1), float64(4), int64(6)
memory usage: 445.3 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.RiskPerformance = df.RiskPerformance.astype('category')


### STEP 4: PLOT THE DATA AND CHECK OUT FOR OUTLIERS

In [20]:
# Data types
df.describe()

Unnamed: 0,ExternalRiskEstimate,NetFractionRevolvingBurden,AverageMInFile,MSinceOldestTradeOpen,PercentTradesWBalance,PercentInstallTrades,NumSatisfactoryTrades,NumTotalTrades,PercentTradesNeverDelq,MSinceMostRecentInqexcl7days
count,5122.0,5122.0,5122.0,5122.0,5122.0,5122.0,5122.0,5122.0,5122.0,5122.0
mean,71.101913,34.588052,77.169075,200.73116,65.480476,34.601913,21.138032,22.850059,91.311597,2.19387
std,12.593976,29.112502,33.560991,99.22485,22.773417,17.386041,11.302047,12.250118,15.225203,5.00721
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,64.0,8.25,57.0,135.25,50.0,22.0,13.0,15.0,89.0,0.0
50%,72.0,29.0,75.0,185.0,67.0,33.0,20.0,22.0,97.0,0.0
75%,80.0,56.0,94.0,260.0,82.0,46.0,28.0,30.0,100.0,3.0
max,93.0,232.0,322.0,604.0,100.0,100.0,78.0,100.0,100.0,24.0


Aquí yo comentaría un poco y haría subplots. Uno de histogramas, otro de boxplots y otro de scatterplots para ver las formas, outliers, etc. Tras eso, se puede comprobar la normalidad de cada uno por si fuese necesario, y luego un estudio de la varaiable categórica más profundo, para proseguir con relaciones entre variables.