# Project | Machine Learning Models Evaluation

## <font color='DarkBlue'>I. <ins>Loading the Dataset</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Importing libraries</ins>: <font color='violet'></font>

In [10]:
from src import julestools

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from plotly import express as px

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble  import RandomForestClassifier

from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
from sklearn.utils import resample

import imblearn
from imblearn.over_sampling import SMOTE

### <font color='MediumBlue'>2 - <ins>Dataset source</ins>: <font color='violet'></font>

<ins><strong>Source: </strong></ins>

In [2]:
source = '../data/data.csv'

### <font color='MediumBlue'>3 - <ins>  Loading datasets into DataFrames</ins>: <font color='violet'></font>

In [3]:
df = pd.read_csv(source)

##  <font color='DarkBlue'>II. <ins>Exploring the Dataset</ins>: <font color='blue'></font>

### <font color='MediumBlue'>1 - <ins> Datasets Overview</ins>: <font color='violet'></font>

#### <font color='CornflowerBlue'>a) Displaying number of rows and number of columns: </font>

In [None]:
print(f"{df.shape[0]} rows, {df.shape[1]} columns")

#### <font color='CornflowerBlue'>b) Glancing at the datasets: </font>

<ins><strong>What the dataset looks like ? </strong></ins>

In [None]:
df.head(2)

<ins><strong>What are the data types ? </strong></ins>

In [None]:
df.dtypes

## Cleaning Column Names

<strong><em>Stripping, renaming in lower case and replacing spaces with "_" :</strong></em>

In [7]:
def fix_col_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(r'\s+','_',regex=True)
    return df

### <font color='MediumBlue'>2 - <ins> Identifying variables and their specifications</ins>: <font color='violet'></font>

<ins><strong>Displaying number of unique values for each column : </strong></ins>

In [None]:
df.nunique()

<ins><strong>Displaying number of unique values for each column that has less than 10 distinct values : </strong></ins>

In [None]:
df.nunique()[lambda x: x <= 10]

In [None]:
df.info()

<strong><font color='BlueViolet'>Numerical</font></strong> **variables specifications**:

- **XXX**: <ins><em><font color='DarkMagenta'>Discrete</font></em></ins>.
- **YYY**: <ins><em><font color='DarkMagenta'>Discrete</font></em></ins>.
- **ZZZ**: <ins><em><font color='DarkMagenta'>Continuous</font></em></ins>. (Should be Discrete )
- **WWW**: <ins><em><font color='DarkMagenta'>Continuous</font></em></ins>. (Should be Discrete or categorical )


In [4]:
num_var = ['XXX',
           'YYY',
           'ZZZ'
         ]

<strong><font color='BlueViolet'>Categorical</font></strong> **variables specification**:


- **AAA**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>.
- **BBB**: <ins><em><font color='DarkMagenta'>Nominal</font></em></ins>. 
- **CCC**: <ins><em><font color='DarkMagenta'>Ordinal</font></em></ins>. 
- **DDD**: <ins><em><font color='DarkMagenta'>Ordinal</font></em></ins>.(It's actually a date)

In [5]:
cat_var = ['AAA',
           'BBB',
           'CCC',
           'DDD'
         ]

### <font color='MediumBlue'>3 - <ins> Formatting & Cleaning data</ins>: <font color='violet'></font>

<ins><strong>Dealing with columns with Null values : </strong></ins>

In [None]:
print(*list(df.isna().any()[lambda x: x == True].index),sep="\n")

In [None]:
df.isna().any()[lambda x: x == True].index.tolist()

In [None]:
msno.bar(df)

<ins><strong>Columns to remove : </strong></ins>

In [None]:
col_to_remove = ['xxx','yyy','zzz',]

<ins><strong>Cleaning invalid Values : </strong></ins>

In [15]:
# replacing "A" by "B" in 'xxx'
df['xxx'] = df['xxx'].str.replace('A','B')

<ins><strong>Setting values to upper case: </strong></ins>

In [None]:
df['xxx'] = df['xxx'].str.upper()