# Palmer Penguins Modeling

Import the Palmer Penguins dataset and print out the first few rows.

Suppose we want to predict `bill_depth_mm` using the other variables in the dataset.

Which variables would we need to **dummify**?

In [4]:
!pip install palmerpenguins
import palmerpenguins as pen
df = pen.load_penguins()
df.head()

Collecting palmerpenguins
  Downloading palmerpenguins-0.1.4-py3-none-any.whl.metadata (2.0 kB)
Downloading palmerpenguins-0.1.4-py3-none-any.whl (17 kB)
Installing collected packages: palmerpenguins
Successfully installed palmerpenguins-0.1.4


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


To predict bill_depth_mm, we need to dummify species, island and sex.

Let's use `bill_length_mm` to predict `bill_depth_mm`. Prepare your data and fit the following models on the entire dataset:

* Simple linear regression (e.g. straight-line) model
* Quadratic (degree 2 polynomial) model
* Cubic (degree 3 polynomial) model
* Degree 10 polynomial model

Make predictions for each model and plot your fitted models on the scatterplot.

In [7]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error


In [10]:
# need to use train vs test from Last chapter: fit vs complexity
lr = LinearRegression()

X = df[['bill_length_mm']]
y = df['bill_depth_mm']

X = X.dropna()  # Drop rows with NaN in X
y = y[X.index]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

lr_fit1 = lr.fit(X_train, y_train)

train_preds = lr_fit1.predict(X_train)
test_preds = lr_fit1.predict(X_test)

In [11]:
print("MSE1_train:", {mean_squared_error(y_train, train_preds)})
print("MSE1_test:", {mean_squared_error(y_test, test_preds)})

print("R2_1_train:", {r2_score(y_train, train_preds)})
print("R2_1_test:", {r2_score(y_test, test_preds)})


MSE1_train: {3.57631662243497}
MSE1_test: {4.089223424876161}
R2_1_train: {0.06449018247909888}
R2_1_test: {-0.021776832927049794}


In [12]:
poly = PolynomialFeatures(degree=2)  # Degree 2 for X, X**2

X_2 = poly.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_2, y, test_size=0.2)

lr_fit2 = lr.fit(X_train, y_train)

train_preds2 = lr_fit2.predict(X_train)
test_preds2 = lr_fit2.predict(X_test)

In [13]:
print("MSE2_train:", {mean_squared_error(y_train, train_preds2)})
print("MSE2_test:", {mean_squared_error(y_test, test_preds2)})

print("R2_2_train:", {r2_score(y_train, train_preds2)})
print("R2_2_test:", {r2_score(y_test, test_preds2)})

MSE2_train: {3.2915591272392146}
MSE2_test: {4.075721871747065}
R2_2_train: {0.13522967529346275}
R2_2_test: {0.03247091199887053}


In [14]:
# prompt: use bill_length_mm from df to predict bill_depth_mm using a simple linear regression

# Prepare the data
poly3 = PolynomialFeatures(degree=3)  # Degree 3 for X, X**2

X_3 = poly.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_3, y, test_size=0.2)

lr_fit3 = lr.fit(X_train, y_train)

train_preds3 = lr_fit3.predict(X_train)
test_preds3 = lr_fit3.predict(X_test)


In [15]:
print("MSE3_train:", {mean_squared_error(y_train, train_preds3)})
print("MSE3_test:", {mean_squared_error(y_test, test_preds3)})

print("R2_3_train:", {r2_score(y_train, train_preds3)})
print("R2_3_test:", {r2_score(y_test, test_preds3)})

MSE3_train: {3.635153886521404}
MSE3_test: {2.722951870753544}
R2_3_train: {0.11331308807332552}
R2_3_test: {0.09840670975803378}


In [16]:
poly10 = PolynomialFeatures(degree=10)

X_10 = poly.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_10, y, test_size=0.2)

lr_fit10 = lr.fit(X_train, y_train)

train_preds10 = lr_fit10.predict(X_train)
test_preds10 = lr_fit10.predict(X_test)

In [17]:
print("MSE10_train:", {mean_squared_error(y_train, train_preds10)})
print("MSE10_test:", {mean_squared_error(y_test, test_preds10)})

print("R2_10_train:", {r2_score(y_train, train_preds10)})
print("R2_10_test:", {r2_score(y_test, test_preds10)})

MSE10_train: {3.3481646871879223}
MSE10_test: {3.974872413976908}
R2_10_train: {0.14487638696666305}
R2_10_test: {-0.09903987240513712}


In [20]:
import matplotlib.pyplot as plt

# Sort X for plotting
X_sorted = np.sort(X['bill_length_mm'])
X_sorted_df = pd.DataFrame({'bill_length_mm': X_sorted})

# Re-initialize PolynomialFeatures with the correct degrees for each model
poly2 = PolynomialFeatures(degree=2)  # For quadratic model
poly3 = PolynomialFeatures(degree=3)  # For cubic model


# Predict for the sorted X values for each model
y_pred_quadratic = lr_fit2.predict(poly2.fit_transform(X_sorted_df)) # Use poly2 for quadratic
y_pred_cubic = lr_fit3.predict(poly3.fit_transform(X_sorted_df)) # Use poly3 for cubic
y_pred_10 = lr_fit10.predict(poly10.fit_transform(X_sorted_df))

# Plot the scatterplot of the original data
plt.scatter(X['bill_length_mm'], y, label='Data')

# Plot the fitted models
plt.plot(X_sorted, y_pred_linear, label='Linear', color='red')
plt.plot(X_sorted, y_pred_quadratic, label='Quadratic', color='green')
plt.plot(X_sorted, y_pred_cubic, label='Cubic', color='blue')
plt.plot(X_sorted, y_pred_10, label='10', color='yellow')

plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.title('Fitted Models for Bill Depth vs. Bill Length')
plt.legend()
plt.show()

ValueError: X has 4 features, but LinearRegression is expecting 3 features as input.

* Are any of the models above underfitting the data? If so, which ones and how can you tell?
* Are any of thhe models above overfitting the data? If so, which ones and how can you tell?
* Which of the above models do you think fits the data best and why?

The simple linear model is underfitting the data.