# Regression with an Abalone Dataset - Improved

## 1. Introduction

This notebook aims to predict the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. 

This notebook will follow a structured approach:
1. **Exploratory Data Analysis (EDA):** Understand the data and its characteristics.
2. **Feature Engineering:** Create new features to improve model performance.
3. **Model Selection:** Train and compare different regression models.
4. **Hyperparameter Tuning:** Fine-tune the best model.
5. **Submission:** Generate the submission file.

## 2. Load Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_log_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')

## 3. Load Data

In [None]:
train_df = pd.read_csv('Regression with an Abalone Dataset/train.csv')
test_df = pd.read_csv('Regression with an Abalone Dataset/test.csv')
sample_submission_df = pd.read_csv('Regression with an Abalone Dataset/my_submission.csv')

print('Train data shape:', train_df.shape)
print('Test data shape:', test_df.shape)

In [None]:
train_df.head()

In [None]:
test_df.head()

The column names in the provided dataset have some inconsistencies (`Whole weight`, `Whole weight.1`, `Whole weight.2`). Let's rename them to be more descriptive based on the original Abalone dataset description from the UCI repository. The correct names are `Shucked weight`, `Viscera weight`, and `Shell weight`. The target variable is `Rings`.

In [None]:
train_df.rename(columns={'Whole weight': 'Whole_weight', 
                             'Whole weight.1': 'Shucked_weight', 
                             'Whole weight.2': 'Viscera_weight', 
                             'Shell weight': 'Shell_weight'}, inplace=True)

test_df.rename(columns={'Whole weight': 'Whole_weight', 
                            'Whole weight.1': 'Shucked_weight', 
                            'Whole weight.2': 'Viscera_weight', 
                            'Shell weight': 'Shell_weight'}, inplace=True)

# Also, let's add the 'Rings' column to the test set with NaN values for consistency
test_df['Rings'] = np.nan

## 4. Exploratory Data Analysis (EDA)

### 4.1. Target Variable Distribution

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(train_df['Rings'], bins=28, kde=True)
plt.title('Distribution of Rings')
plt.show()

### 4.2. Numerical Features Distribution

In [None]:
numerical_features = train_df.select_dtypes(include=np.number).columns.tolist()
numerical_features.remove('id')
numerical_features.remove('Rings')

train_df[numerical_features].hist(bins=20, figsize=(15, 10), layout=(2, 4))
plt.tight_layout()
plt.show()

### 4.3. Categorical Feature Distribution

In [None]:
sns.countplot(x='Sex', data=train_df)
plt.title('Distribution of Sex')
plt.show()

### 4.4. Correlation Matrix

In [None]:
plt.figure(figsize=(12, 8))
corr_matrix = train_df[numerical_features + ['Rings']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## 5. Feature Engineering

In [None]:
def feature_engineer(df):
    # One-hot encode the 'Sex' column
    df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
    
    # Create ratio features
    df['crab_area'] = df['Length'] * df['Diameter']
    df['approx_density'] = df['Whole_weight'] / (df['crab_area'] * df['Height'])
    df['bmi'] = df['Whole_weight'] / (df['Height']**2)
    
    # Interaction features
    df['length_dia_ratio'] = df['Length'] / df['Diameter']
    df['length_height_ratio'] = df['Length'] / df['Height']
    df['dia_height_ratio'] = df['Diameter'] / df['Height']
    df['shell_shuck_ratio'] = df['Shell_weight'] / df['Shucked_weight']

    # Replace infinities with NaNs
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    
    # Water loss
    df['water_loss'] = df['Whole_weight'] - df['Shucked_weight'] - df['Viscera_weight'] - df['Shell_weight']
    
    return df

train_featured_df = feature_engineer(train_df.copy())
test_featured_df = feature_engineer(test_df.copy())

print('Train featured shape:', train_featured_df.shape)
print('Test featured shape:', test_featured_df.shape)

## 6. Model Training and Evaluation

In [None]:
X = train_featured_df.drop(['id', 'Rings'], axis=1)
y = train_featured_df['Rings']
X_test_final = test_featured_df.drop(['id', 'Rings'], axis=1)

# Align columns - crucial for consistent feature sets
train_cols = X.columns
test_cols = X_test_final.columns

missing_in_test = set(train_cols) - set(test_cols)
for c in missing_in_test:
    X_test_final[c] = 0

missing_in_train = set(test_cols) - set(train_cols)
for c in missing_in_train:
    X[c] = 0

X_test_final = X_test_final[train_cols]

# Impute any remaining NaNs
X.fillna(X.median(), inplace=True)
X_test_final.fillna(X_test_final.median(), inplace=True)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_final_scaled = scaler.transform(X_test_final)

### 6.1. Model Comparison

In [None]:
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

models = {
    'RandomForest': RandomForestRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    # Predictions can't be negative
    y_pred[y_pred < 0] = 0
    score = rmsle(y_val, y_pred)
    results[name] = score
    print(f'{name} RMSLE: {score:.5f}')

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'RMSLE']).sort_values('RMSLE')
print('\nModel Comparison:')
print(results_df)

### 6.2. Hyperparameter Tuning (for the best model)

In [None]:
# Let's assume XGBoost is the best model based on the above results
param_grid = {
    'n_estimators': [200, 300, 400],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [5, 7, 9],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

xgb = XGBRegressor(random_state=42)

random_search = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid, 
                                   n_iter=20, cv=3, verbose=2, random_state=42, 
                                   n_jobs=-1, scoring='neg_mean_squared_log_error')

random_search.fit(X_train_scaled, y_train)

print('Best parameters found: ', random_search.best_params_)

best_xgb = random_search.best_estimator_

y_pred_tuned = best_xgb.predict(X_val_scaled)
y_pred_tuned[y_pred_tuned < 0] = 0
tuned_rmsle = rmsle(y_val, y_pred_tuned)

print(f'Tuned XGBoost RMSLE: {tuned_rmsle:.5f}')

## 7. Submission

In [None]:
final_predictions = best_xgb.predict(X_test_final_scaled)
final_predictions[final_predictions < 0] = 0

# The predictions are float, but the submission requires integers for Rings
final_predictions = np.round(final_predictions).astype(int)

submission_df = pd.DataFrame({'id': test_df['id'], 'Rings': final_predictions})
submission_df.to_csv('submission.csv', index=False)

print('Submission file created successfully!')
submission_df.head()