## Classification Model Comparison

In this notebook, we compare the performance of several classification models using the penguins dataset. We will use cross-validation to evaluate the models and determine the best performing one.

### Dataset Description

The dataset consists of 8 columns:

- **species**: penguin species (Chinstrap, Adélie, or Gentoo)
- **culmen_length_mm**: culmen length (mm)
- **culmen_depth_mm**: culmen depth (mm)
- **flipper_length_mm**: flipper length (mm)
- **body_mass_g**: body mass (g)
- **island**: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
- **sex**: penguin sex
-**year**: year of data recording

### Importing Libraries

We begin by importing the necessary libraries for data manipulation, visualization, and model building.

In [115]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.impute import KNNImputer, SimpleImputer

# Loading and Preprocessing the Data
Now load the dataset and preprocess it by handling missing values and encoding categorical variables.

### Load Dataset

In [116]:
# Load Dataset
df = pd.read_csv('data/penguins.csv')
#df = pd.read_csv('data/penguins_modified.csv') # Csv modfied to have more missing values
df.head(5)

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Data Preprocessing

In [117]:
# See the data shape
df.shape

(344, 9)

In [119]:
# Remove rows that has sex as NA, as we can't work with that data
df = df.dropna(subset=["sex"]).copy()
df.head(5)

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [121]:
# Drop year data(column), as this data will only add unnecessary bias to years, as that data is irrelevant here
X = df.drop("year", axis=1)
y = X.pop("sex")
X.head(5)

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0
5,6,Adelie,Torgersen,39.3,20.6,190.0,3650.0


In [122]:
# Preprocessing
categoric_columns = []
numeric_columns = []
for col in X.columns:
    if X[col].dtype == 'O':
        categoric_columns.append(col)
    else:
        numeric_columns.append(col)

categorical_transformer = Pipeline(steps=[
    # wasn't able to get KNN imputer to work, so just used most_frequent
    ("imputer", SimpleImputer(strategy = 'most_frequent')),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer()),
    ("z_scaler", StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categoric_columns),
    ('num', numerical_transformer, numeric_columns)
])

# Model Comparison with K-Fold Cross Validation
We train multiple classifiers using cross-validation and evaluate their performance.

In [123]:
k = 6

# Define classifiers
classifiers = {
    "KNN": KNeighborsClassifier(),
    "SVC": SVC(),
    "GNB": GaussianNB(),
    "LDA": LinearDiscriminantAnalysis(),
    "LR": LogisticRegression(n_jobs=-1)
}

results = {}
for name, classifier in classifiers.items():
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ("classifier", classifier)
    ])
    scores = cross_val_score(pipe, X, y, cv=k, scoring='accuracy')
    results[name] = scores.mean()

In [124]:
pipe

# Results

In [125]:
# Results of the models
best_model = max(results, key=results.get)
print("Accuracy scores:")
for name, score in results.items():
    print(f"{name}: {score}")
print(f"\nBest model: {best_model} with accuracy {results[best_model]}")
print("")
print(f"*LR: logistic Regression")

Accuracy scores:
KNN: 0.8979978354978354
SVC: 0.9159090909090909
GNB: 0.6458333333333334
LDA: 0.8707792207792208
LR: 0.8919372294372293

Best model: SVC with accuracy 0.9159090909090909

*LR: logistic Regression
