# `AA Workshop 8` â€” Coding Challenge

Complete the tasks below to practice implementing classification modeling from `W8_Classification_Advanced.ipynb`.

Guidelines:
- Work in order. Run each cell after editing with Shift+Enter.
- Keep answers short; focus on making things work.
- If a step fails, read the error and fix it.

By the end you will have exercised:
- implementing classification models, including logistic regression, Naive Bayes, and SVMs
- evaluating and selecting the preferred model

## Task 1 - Predicting Penguin Species

Let's apply our classification skills in a multi-class setting. The Palmer penguins dataset contains information on three penguin species in the islands of the Palmer Archipelago, Antarctica. More information is available [here](https://archive.ics.uci.edu/dataset/690/palmer+penguins-3). When working on this task, keep in mind that this dataset is quite small (n=333 excluding observations missing data). You can load the data via seaborn using `sns.load_dataset("penguins").dropna()`. Train classifiers for the penguin species based on the two features `bill_length_mm` and `flipper_length_mm`. Specifically, evaluate the performance of a logistic regression, Gaussian Naive Bayes, and SVM classifier with linear kernel, polynomial kernel (3 degrees), and RBF kernel. Report the performance of your selected model.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

# load and inspect data
penguins = sns.load_dataset("penguins").dropna()

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [2]:
# inspect class imbalance
penguins["species"].value_counts(normalize=True)

species
Adelie       0.438438
Gentoo       0.357357
Chinstrap    0.204204
Name: proportion, dtype: float64

In [3]:
# define features and target
X = penguins[['bill_length_mm', 'flipper_length_mm']].values
y = penguins['species'].values

In [4]:
# perform train-holdout-test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)
x_train, x_hold, y_train, y_hold = train_test_split(x_train, y_train, test_size=(0.2/0.7), random_state=123)

print(len(x_train), len(x_hold), len(x_test))

166 67 100


In [5]:
# standardize
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_hold_scaled   = scaler.transform(x_hold)
x_test_scaled  = scaler.transform(x_test)

In [6]:
# logistic regression
model_lr = LogisticRegression(C=100)
model_lr.fit(x_train_scaled, y_train)

print(confusion_matrix(y_hold, model_lr.predict(x_hold_scaled)))
acc_hold_lr = accuracy_score(y_hold, model_lr.predict(x_hold_scaled))
print("Validation Accuracy (Logistic Regression):", acc_hold_lr)

[[29  1  0]
 [ 1  7  1]
 [ 0  0 28]]
Validation Accuracy (Logistic Regression): 0.9552238805970149


In [7]:
# Naive Bayes
model_nb = GaussianNB()
model_nb.fit(x_train, y_train)

print(confusion_matrix(y_hold, model_nb.predict(x_hold)))
acc_hold_nb = accuracy_score(y_hold, model_nb.predict(x_hold))
print("Validation Accuracy (Gaussian NB):", acc_hold_nb)

[[28  1  1]
 [ 1  6  2]
 [ 0  0 28]]
Validation Accuracy (Gaussian NB): 0.9253731343283582


In [8]:
# linear svm
model_svm_lin = SVC(kernel='linear', C=100)
model_svm_lin.fit(x_train_scaled, y_train)

print(confusion_matrix(y_hold, model_svm_lin.predict(x_hold_scaled)))
acc_hold_svm_lin = accuracy_score(y_hold, model_svm_lin.predict(x_hold_scaled))
print("Validation Accuracy (Linear SVM):", acc_hold_svm_lin)

[[27  2  1]
 [ 1  6  2]
 [ 0  0 28]]
Validation Accuracy (Linear SVM): 0.9104477611940298


In [9]:
# poly svm
model_svm_poly = SVC(kernel='poly', C=100, degree=3, coef0=1.0)
model_svm_poly.fit(x_train_scaled, y_train)

print(confusion_matrix(y_hold, model_svm_poly.predict(x_hold_scaled)))
acc_hold_svm_poly = accuracy_score(y_hold, model_svm_poly.predict(x_hold_scaled))
print("Validation Accuracy (Poly SVM):", acc_hold_svm_poly)

[[26  2  2]
 [ 1  6  2]
 [ 0  0 28]]
Validation Accuracy (Poly SVM): 0.8955223880597015


In [10]:
# rbf svm
model_svm_rbf = SVC(kernel='rbf', C=100)
model_svm_rbf.fit(x_train_scaled, y_train)

print(confusion_matrix(y_hold, model_svm_rbf.predict(x_hold_scaled)))
acc_hold_svm_rbf = accuracy_score(y_hold, model_svm_rbf.predict(x_hold_scaled))
print("Validation Accuracy (RBF SVM):", acc_hold_svm_rbf)

[[26  2  2]
 [ 1  6  2]
 [ 1  0 27]]
Validation Accuracy (RBF SVM): 0.8805970149253731


In [11]:
# compare holdout accuracy
scores = {
    "Logistic Regression": acc_hold_lr,
    "Gaussian NB": acc_hold_nb,
    "Linear SVM": acc_hold_svm_lin,
    "Poly SVM (deg 3)": acc_hold_svm_poly,
    "RBF SVM": acc_hold_svm_rbf
}

best_model_name = max(scores, key=scores.get)
best_model_name, scores[best_model_name]

('Logistic Regression', 0.9552238805970149)

In [12]:
# select best performing model
models = {
    "Logistic Regression": model_lr,
    "Gaussian NB": model_nb,
    "Linear SVM": model_svm_lin,
    "Poly SVM (deg 3)": model_svm_poly,
    "RBF SVM": model_svm_rbf
}

best_model = models[best_model_name]

In [13]:
# test performance
best_model = model_svm_rbf
test_acc = accuracy_score(y_test, best_model.predict(x_test_scaled))
print("Final Test Accuracy:", test_acc)

Final Test Accuracy: 0.94


---