# Palmer Penguins Modeling

Import the Palmer Penguins dataset and print out the first few rows.

Suppose we want to predict `bill_depth_mm` using the other variables in the dataset.

**Dummify** all variables that require this.

In [71]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
!pip install palmerpenguins
from palmerpenguins import load_penguins
penguins = load_penguins()
penguins = penguins.dropna()



In [72]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [73]:
import numpy as np

In [74]:
X = penguins.drop('bill_depth_mm', axis=1)
y = penguins['bill_depth_mm']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ct = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
         make_column_selector(dtype_include=object)),  # Grab all string columns
        ("standardize",
         StandardScaler(),
         make_column_selector(dtype_include=np.number))  # Grab all numerical columns
    ],
    remainder="passthrough"
)

In [75]:
elastic_pipeline = Pipeline([
    ('preprocessing', ct),
    ('elasticnet', ElasticNet(max_iter=1000000))
])

In [77]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, mean_squared_error
# Define a negative MSE scorer (since GridSearchCV maximizes the score)
neg_mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)

param_grid = {
    'elasticnet__alpha': [0.01, 0.1, 1, 10, 100]
}

# Cross-validation with GridSearchCV
grid_search = GridSearchCV(elastic_pipeline, param_grid, cv=5, scoring=neg_mse_scorer) # cv=5 for 5-fold cross-validation

# Fit the model
grid_search.fit(X_train, y_train)

# Check the best alpha value and the corresponding performance
print(grid_search.best_params_)

{'elasticnet__alpha': 0.01}


In [81]:
elastic_pipeline = Pipeline([
    ('preprocessing', ct),
    ('elasticnet', ElasticNet(alpha=0.01, l1_ratio=0.01))
])

In [82]:
elastic_pipeline.fit(X_train, y_train)

In [83]:
# 5-fold cross-validation to estimate MSE
scores = cross_val_score(elastic_pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

# Calculate the average MSE
mse_scores = -scores
mse_scores.mean()

0.6922639268095329

Let's use the other variables to predict `bill_depth_mm`. Prepare your data and fit the following models on the entire dataset:

* Your best multiple linear regression model from before
* Two kNN models (for different values of K)
* A decision tree model

Create a plot like the right plot of Fig 1. in our `Model Validation` chapter with the training and test error plotted for each of your four models.

Which of your models was best?

In [84]:
# Elastic pipeline MSE
elastic_y_pred = elastic_pipeline.predict(X)
mse1 = mean_squared_error(y, elastic_y_pred)
mse1

0.6297141514168878

In [85]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Pipeline for kNN with K=3
knn_pipeline1 = Pipeline([
    ('preprocessing', ct),
    ('knn', KNeighborsRegressor(n_neighbors=3))
])
knn_pipeline1.fit(X_train, y_train)

In [86]:
knn_y_pred1 = knn_pipeline1.predict(X)
mse2 = mean_squared_error(y, knn_y_pred1)
mse2

0.442862862862863

In [87]:
# Pipeline for kNN with K=10
knn_pipeline2 = Pipeline([
    ('preprocessing', ct),
    ('knn', KNeighborsRegressor(n_neighbors=10))
])
knn_pipeline2.fit(X_train, y_train)

In [88]:
knn_y_pred2 = knn_pipeline2.predict(X)
mse3 = mean_squared_error(y, knn_y_pred2)
mse3

0.560379279279279

In [89]:
# Decision tree
dt_pipeline = Pipeline([
    ('preprocessing', ct),
    ('decision_tree', DecisionTreeRegressor())
])
dt_pipeline.fit(X_train, y_train)

In [90]:
dt_y_pred = dt_pipeline.predict(X)
mse4 = mean_squared_error(y, dt_y_pred)
mse4

0.2560660660660661

I do not know what "Flexibility" is a measure of, therefore do not know how to create said graph.  That being said  looking at the MSE values for each of the models, it would be safe to assume that the Decision Tree pipeline fit the data the best, but also could be down to overfitting.  Would say that the best model that likely isn't too overfit would be the knn model with k=3.