# Project | Machine Learning Models Evaluation

## <font color='DarkBlue'>I. <ins>Loading the Dataset</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Importing libraries</ins>: <font color='violet'></font>

In [1]:
import julestools as jt

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from plotly import express as px

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble  import RandomForestClassifier

from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
from sklearn.utils import resample

import imblearn
from imblearn.over_sampling import SMOTE

### <font color='MediumBlue'>2 - <ins>Dataset source</ins>: <font color='violet'></font>

<ins><strong>Source: </strong></ins>

In [2]:
source = '../data/data.csv'

### <font color='MediumBlue'>3 - <ins>  Loading datasets into DataFrames</ins>: <font color='violet'></font>

In [3]:
df = pd.read_csv(source)

##  <font color='DarkBlue'>II. <ins>Exploring the Dataset</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Datasets Overview</ins>: <font color='violet'></font>

#### <font color='CornflowerBlue'>a) Displaying number of rows and number of columns: </font>

In [4]:
print(f"{df.shape[0]} rows, {df.shape[1]} columns")

6819 rows, 96 columns


#### <font color='CornflowerBlue'>b) Glancing at the datasets: </font>

<ins><strong>What the dataset looks like ? </strong></ins>

In [5]:
df.head(2)

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794


<ins><strong>What are the data types ? </strong></ins>

In [10]:
df.dtypes

Bankrupt?                                                     int64
 ROA(C) before interest and depreciation before interest    float64
 ROA(A) before interest and % after tax                     float64
 ROA(B) before interest and depreciation after tax          float64
 Operating Gross Margin                                     float64
                                                             ...   
 Liability to Equity                                        float64
 Degree of Financial Leverage (DFL)                         float64
 Interest Coverage Ratio (Interest expense to EBIT)         float64
 Net Income Flag                                              int64
 Equity to Liability                                        float64
Length: 96, dtype: object

## Cleaning Column Names

<strong><em>Stripping, renaming in lower case and replacing spaces with "_" :</strong></em>

In [7]:
def fix_col_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(r'\s+','_',regex=True)
    return df

fix_col_names(df)

Unnamed: 0,bankrupt?,roa(c)_before_interest_and_depreciation_before_interest,roa(a)_before_interest_and_%_after_tax,roa(b)_before_interest_and_depreciation_after_tax,operating_gross_margin,realized_sales_gross_margin,operating_profit_rate,pre-tax_net_interest_rate,after-tax_net_interest_rate,non-industry_income_and_expenditure/revenue,...,net_income_to_total_assets,total_assets_to_gnp_price,no-credit_interval,gross_profit_to_sales,net_income_to_stockholder's_equity,liability_to_equity,degree_of_financial_leverage_(dfl),interest_coverage_ratio_(interest_expense_to_ebit),net_income_flag,equity_to_liability
0,1,0.370594,0.424389,0.405750,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.827890,0.290202,0.026601,0.564050,1,0.016469
1,1,0.464291,0.538214,0.516730,0.610235,0.610235,0.998946,0.797380,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.601450,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.774670,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.998700,0.796967,0.808966,0.303350,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.035490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6814,0,0.493687,0.539468,0.543230,0.604455,0.604462,0.998992,0.797409,0.809331,0.303510,...,0.799927,0.000466,0.623620,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.029890
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.303520,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.002840,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.607850,0.607850,0.999074,0.797500,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009


### <font color='MediumBlue'>2 - <ins> Identifying variables and their specifications</ins>: <font color='violet'></font>

<ins><strong>Displaying number of unique values for each column : </strong></ins>

In [8]:
df.nunique()

bankrupt?                                                     2
roa(c)_before_interest_and_depreciation_before_interest    3333
roa(a)_before_interest_and_%_after_tax                     3151
roa(b)_before_interest_and_depreciation_after_tax          3160
operating_gross_margin                                     3781
                                                           ... 
liability_to_equity                                        6819
degree_of_financial_leverage_(dfl)                         6240
interest_coverage_ratio_(interest_expense_to_ebit)         6240
net_income_flag                                               1
equity_to_liability                                        6819
Length: 96, dtype: int64

<ins><strong>Displaying number of unique values for each column that has less than 10 distinct values : </strong></ins>

In [9]:
df.nunique()[lambda x: x <= 10]

bankrupt?                2
liability-assets_flag    2
net_income_flag          1
dtype: int64

In [None]:
df.info()

<strong><font color='BlueViolet'>Numerical</font></strong> **variables specifications**:

- **XXX**: <ins><em><font color='DarkMagenta'>Discrete</font></em></ins>.
- **YYY**: <ins><em><font color='DarkMagenta'>Discrete</font></em></ins>.
- **ZZZ**: <ins><em><font color='DarkMagenta'>Continuous</font></em></ins>. (Should be Discrete )
- **WWW**: <ins><em><font color='DarkMagenta'>Continuous</font></em></ins>. (Should be Discrete or categorical )


In [4]:
num_var = ['XXX',
           'YYY',
           'ZZZ'
         ]

<strong><font color='BlueViolet'>Categorical</font></strong> **variables specification**:


- **AAA**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>.
- **BBB**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>. 
- **CCC**: <ins><em><font color='DarkMagenta'>Ordinal</font></em></ins>. 
- **DDD**: <ins><em><font color='DarkMagenta'>Ordinal</font></em></ins>.(It's actually a date)

In [5]:
cat_var = ['AAA',
           'BBB',
           'CCC',
           'DDD'
         ]

### <font color='MediumBlue'>3 - <ins> Formatting & Cleaning data</ins>: <font color='violet'></font>

<ins><strong>Dealing with columns with Null values : </strong></ins>

In [14]:
print(*list(df.isna().any()[lambda x: x == True].index),sep="\n")




In [17]:
df.isna().sum()

Bankrupt?                                                   0
 ROA(C) before interest and depreciation before interest    0
 ROA(A) before interest and % after tax                     0
 ROA(B) before interest and depreciation after tax          0
 Operating Gross Margin                                     0
                                                           ..
 Liability to Equity                                        0
 Degree of Financial Leverage (DFL)                         0
 Interest Coverage Ratio (Interest expense to EBIT)         0
 Net Income Flag                                            0
 Equity to Liability                                        0
Length: 96, dtype: int64

In [15]:
df.isna().any()[lambda x: x == True].index.tolist()

[]

<ins><strong>Columns to remove : </strong></ins>

In [None]:
col_to_remove = ['net_income_flag']

<ins><strong>Cleaning invalid Values : </strong></ins>

In [15]:
# replacing "A" by "B" in 'xxx'
df['xxx'] = df['xxx'].str.replace('A','B')

<ins><strong>Setting values to upper case: </strong></ins>

In [None]:
df['xxx'] = df['xxx'].str.upper()