# Abalone Dataset
Predicting the age of abalone from physical measurements

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

In [1]:
import pandas as pd
import statsmodels.api as sm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV, LinearRegression
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np


In [2]:
# Load the abalone dataset

df = pd.read_csv("train.csv")

# Display the first few rows of the dataset
df.head()


Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
0,0,F,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11
1,1,F,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11
2,2,I,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6
3,3,M,0.595,0.475,0.15,0.9145,0.3755,0.2055,0.25,10
4,4,I,0.555,0.425,0.13,0.782,0.3695,0.16,0.1975,9


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90615 entries, 0 to 90614
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              90615 non-null  int64  
 1   Sex             90615 non-null  object 
 2   Length          90615 non-null  float64
 3   Diameter        90615 non-null  float64
 4   Height          90615 non-null  float64
 5   Whole weight    90615 non-null  float64
 6   Whole weight.1  90615 non-null  float64
 7   Whole weight.2  90615 non-null  float64
 8   Shell weight    90615 non-null  float64
 9   Rings           90615 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 6.9+ MB


In [4]:
# Drop 'id' column as it's not informative for modeling
df = df.drop(columns=['id'])

# Remove duplicated or ambiguous column names
df = df.rename(columns={
    'Whole weight': 'Whole_weight',
    'Whole weight.1': 'Shucked_weight',
    'Whole weight.2': 'Viscera_weight'
})

In [5]:
# One-hot encode the 'Sex' categorical variable
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)

In [6]:
df.head()

Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell weight,Rings,Sex_I,Sex_M
0,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11,False,False
1,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11,False,False
2,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6,True,False
3,0.595,0.475,0.15,0.9145,0.3755,0.2055,0.25,10,False,True
4,0.555,0.425,0.13,0.782,0.3695,0.16,0.1975,9,True,False


In [7]:
# Define features and target
X = df.drop(columns='Rings')
y = df['Rings']


In [8]:
X.head()

Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell weight,Sex_I,Sex_M
0,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,False,False
1,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,False,False
2,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,True,False
3,0.595,0.475,0.15,0.9145,0.3755,0.2055,0.25,False,True
4,0.555,0.425,0.13,0.782,0.3695,0.16,0.1975,True,False


In [9]:
y.head()

0    11
1    11
2     6
3    10
4     9
Name: Rings, dtype: int64

In [10]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [11]:
# Scale features for regularization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model 1 – Regularization (e.g., Lasso) to assist in variable selection.

In [12]:
# Build Lasso model with cross-validation to select features
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)


In [13]:
# Predict and compute test MSE
y_pred_lasso = lasso_cv.predict(X_test_scaled)
lasso_mse = mean_squared_error(y_test, y_pred_lasso)

In [14]:
# Extract coefficients for best subset selection insight
lasso_coef = pd.Series(lasso_cv.coef_, index=X.columns)

lasso_coef_nonzero = lasso_coef[lasso_coef != 0]

lasso_mse, lasso_coef_nonzero

(4.099687530813719,
 Length           -0.004216
 Diameter          0.572922
 Height            0.753473
 Whole_weight      1.489079
 Shucked_weight   -3.140700
 Viscera_weight   -0.604988
 Shell weight      2.778902
 Sex_I            -0.340432
 Sex_M            -0.006242
 dtype: float64)

**Model 1: Lasso Regression (Regularization for Best Subset Selection)**

**Test MSE: 4.10**

Selected Features (Non-zero Coefficients):

Diameter: +0.573

Height: +0.753

Whole_weight: +1.489

Shucked_weight: −3.141

Viscera_weight: −0.605

Shell weight: +2.779

Sex_I: −0.340

Sex_M: −0.006 (nearly zero)

Length: small negative impact (−0.004)

This gives us an idea of which variables are most informative for predicting the number of rings.

# Model 2 - Principal Components Regression (PCR)

In [15]:
# Apply PCA and build PCR model using a pipeline
pca_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('regressor', LinearRegression())
])



In [16]:
# Fit PCR model
pca_pipeline.fit(X_train, y_train)



In [17]:
# Predict and compute test MSE
y_pred_pcr = pca_pipeline.predict(X_test)
pcr_mse = mean_squared_error(y_test, y_pred_pcr)



In [18]:
# Determine number of components explaining 95% of variance
pca_model = pca_pipeline.named_steps['pca']
explained_variance = np.cumsum(pca_model.explained_variance_ratio_)
n_components_95 = np.argmax(explained_variance >= 0.95) + 1

pcr_mse, n_components_95


(4.095322631071489, 3)

✅ Model 2: Principal Components Regression (PCR)
Test MSE: 4.10 (very close to Lasso)

Number of Principal Components to Explain ≥95% Variance: 3



**Summary Comparison**

| Model | Test MSE | Variable Selection Insight          |
| ----- | -------- | ----------------------------------- |
| Lasso | 4.10     | Yes – selected key features         |
| PCR   | 4.10     | No – based on orthogonal components |


# We will Fine tune PCR using Top 3 Principal Components

In [19]:
# Create a new pipeline using only the top 3 principal components
pcr_top3_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3)),
    ('regressor', LinearRegression())
])

# Fit the model
pcr_top3_pipeline.fit(X_train, y_train)

# Predict and compute test MSE
y_pred_pcr_top3 = pcr_top3_pipeline.predict(X_test)
pcr_top3_mse = mean_squared_error(y_test, y_pred_pcr_top3)

pcr_top3_mse


5.897541500919276

Using only the top 3 principal components:

PCR Test MSE: 5.90

📉 This is worse than using all components (MSE ≈ 4.10), indicating that while the first 3 components capture 95% of variance, they do not capture enough predictive signal for Rings.

# We will Fine tune PCR using Top 5 Principal Components

In [20]:
# Create a new pipeline using the top 5 principal components
pcr_top5_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('regressor', LinearRegression())
])

# Fit the model
pcr_top5_pipeline.fit(X_train, y_train)

# Predict and compute test MSE
y_pred_pcr_top5 = pcr_top5_pipeline.predict(X_test)
pcr_top5_mse = mean_squared_error(y_test, y_pred_pcr_top5)

pcr_top5_mse


4.838098294329311

Using the top 5 principal components:

PCR Test MSE: 4.84

📊 This is a significant improvement over using only 3 components (MSE ≈ 5.90), but still not as good as using all components (MSE ≈ 4.10).

# We will Fine tune PCR using Top 6 Principal Components

In [21]:
# Create a new pipeline using the top 6 principal components
pcr_top6_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=6)),
    ('regressor', LinearRegression())
])

# Fit the model
pcr_top6_pipeline.fit(X_train, y_train)

# Predict and compute test MSE
y_pred_pcr_top6 = pcr_top6_pipeline.predict(X_test)
pcr_top6_mse = mean_squared_error(y_test, y_pred_pcr_top6)

pcr_top6_mse


4.151776410349206

Using the top 6 principal components:

PCR Test MSE: 4.15

📈 This result is nearly as good as using all components (MSE ≈ 4.10), and much better than using just 3 or 5.

✅ Final PCR Model

**Model: Principal Components Regression (PCR)**

**# of Components: 6**

**Test MSE: 4.15**

This balances dimensionality reduction with predictive performance nearly equal to the full-component model.