## Classification Model Comparison

In this notebook, we compare the performance of several classification models using the penguins dataset. We will use cross-validation to evaluate the models and determine the best performing one.

### Dataset Description

The dataset contains information about penguins, including their species and various measurements.

### Importing Libraries

We begin by importing the necessary libraries for data manipulation, visualization, and model building.

In [21]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score

# Loading and Preprocessing the Data
Next, we load the dataset and preprocess it by handling missing values and encoding categorical variables.

In [22]:
# Load Dataset
df = pd.read_csv('penguins.csv')
df = df.dropna(subset=["sex"]).copy()
X = df.drop("sex", axis=1)
y = df["sex"]

# Preprocessing
categoric_columns = []
numeric_columns = []
for col in X.columns:
    if X[col].dtype == 'O':
        categoric_columns.append(col)
    else:
        numeric_columns.append(col)

categoric_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

numeric_transformer = Pipeline(steps=[
    ("z_scaler", StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('cat', categoric_transformer, categoric_columns),
    ('num', numeric_transformer, numeric_columns)
])

# Model Training and Evaluation
We train multiple classifiers using cross-validation and evaluate their performance.

In [26]:
# Define classifiers
classifiers = {
    "KNN": KNeighborsClassifier(),
    "SVC": SVC(),
    "GNB": GaussianNB(),
    "LDA": LinearDiscriminantAnalysis(),
    "LR": LogisticRegression(n_jobs=-1)
}

results = {}
for name, classifier in classifiers.items():
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ("classifier", classifier)
    ])
    scores = cross_val_score(pipe, X, y, cv=4, scoring='accuracy')
    results[name] = scores.mean()

best_model = max(results, key=results.get)
print("Accuracy scores:")
for name, score in results.items():
    print(f"{name}: {score}")
print(f"\nBest model: {best_model} with accuracy {results[best_model]}")
print("")
print(f"*LR = logistic Regression")

Accuracy scores:
KNN: 0.858720596672404
SVC: 0.9159853700516352
GNB: 0.6428930005737234
LDA: 0.8229704532415376
LR: 0.8770080321285141

Best model: SVC with accuracy 0.9159853700516352

*LR = logistic Regression
