<a href="https://colab.research.google.com/github/marcryanbritto/AbaloneAgePrediction/blob/main/%F0%9F%90%8CAbalone_Age_Prediction_Regression%F0%9F%93%92%F0%9F%93%88.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'abalone-dataset:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F37691%2F57419%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240819%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240819T013638Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D2c92621d2661d18241745d4fafcf5888647fcd2ff7149fdeae36965963e92b824e2a662f628efe07ca9171a8d9322b995d45213416e0447ad8c9add1f6fd36567e644bc1051a16bcb22f147b46a59815f3d3a5e018beaad186839cab8a83c373d1f97bacfb2aa27afd5c8e6c87d26a69169a3488b1d5bf7c81916d2c64931a5d98d78275a2bc9a0ff9bd73d9120a4f755500c3b7d3723f32c29fea93823b533aebaba829aef1e305bf6a42950db94c5742e15692693807d391e9978caac725d68790539f262e2f4d3f1748b0d8816f1a8588d1eb60140329a834f85efad91a40e34c0c7ee2e5bcc85e676fa28a5a61a3e56e74f8b65253856ca5cfaf9f5d55ca'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading abalone-dataset, 58677 bytes compressed
Downloaded and uncompressed: abalone-dataset
Data source import complete.


## Introduction

The abalone dataset contains measurements of physical features of abalone, along with their age (in terms of the number of rings on their shells). The columns in the dataset are:

- **Sex**: This column indicates the sex of the abalone and is of object data type. There are three possible values: 'M' for male, 'F' for female, and 'I' for infant.

- **Length**: This column contains the longest shell measurement in mm and is of float data type.

- **Diameter**: This column contains the measurement perpendicular to length in mm and is of float data type.

- **Height**: This column contains the height of the whole abalone in mm and is of float data type.

- **Whole weight**: This column contains the weight of the whole abalone in grams and is of float data type.

- **Shucked weight**: This column contains the weight of the meat of the abalone in grams and is of float data type.

- **Viscera weight**: This column contains the weight of the abalone's gut (after bleeding) in grams and is of float data type.

- **Shell weight**: This column contains the weight of the abalone's shell after being dried in grams and is of float data type.

- **Rings**: This column indicates the age of the abalone in terms of the number of rings on their shell, and is of integer data type.

These measurements can be used to predict the age of abalone without the need to perform the tedious task of counting the rings through a microscope. By analyzing the relationship between these physical features and the age of abalone, we can build a predictive model to estimate their age, which can be useful for various applications in aquaculture, ecology, and fisheries management.

## Importing Dependencies

In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objs as go

## Loading Data

In [6]:
df = pd.read_csv('../input/abalone-dataset/abalone.csv')
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


## Data Exploration

**Convert Categorical Values into Numerical Values**

In [8]:
le=LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   int64  
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(2)
memory usage: 293.8 KB


**Check for NaN values**

In [10]:
df.isna().sum()
df.isna().count()

Unnamed: 0,0
Sex,4177
Length,4177
Diameter,4177
Height,4177
Whole weight,4177
Shucked weight,4177
Viscera weight,4177
Shell weight,4177
Rings,4177


In [11]:
df['Sex']=df['Sex'].astype('float')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   float64
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(8), int64(1)
memory usage: 293.8 KB


In [13]:
df.describe()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,1.052909,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.82224,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.0,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.0,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,1.0,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,2.0,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,2.0,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


**Correlation between Input Variables**

In [14]:
df.corr()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
Sex,1.0,-0.036066,-0.038874,-0.042077,-0.021391,-0.001373,-0.032067,-0.034854,-0.034627
Length,-0.036066,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672
Diameter,-0.038874,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466
Height,-0.042077,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467
Whole weight,-0.021391,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039
Shucked weight,-0.001373,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884
Viscera weight,-0.032067,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819
Shell weight,-0.034854,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574
Rings,-0.034627,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0


## Exploratory Data Analysis

In [15]:
fig = px.scatter(df, x='Length', y='Rings',
                 color='Sex',
                 title='Relationship between Abalone Age and Length',
                 labels={'Length': 'Length (mm)', 'Rings': 'Age (years)'})

fig.show()

In [16]:
fig = go.Figure(data=[go.Scatter3d(x=df['Length'], y=df['Diameter'], z=df['Height'],
                                   mode='markers', marker=dict(size=5, color=df['Rings'], colorscale='Viridis'))])
fig.update_layout(title='3D Scatter Plot of Abalone Length, Diameter, and Height',
                  scene=dict(xaxis_title='Length (mm)',
                             yaxis_title='Diameter (mm)',
                             zaxis_title='Height (mm)'))
fig.show()


In [17]:

fig = go.Figure(data=[go.Scatter(x=df['Length'], y=df['Diameter'], mode='markers',
                                 marker=dict(size=df['Whole weight'], sizemode='diameter', sizeref=0.1, color=df['Whole weight'], colorscale='Viridis', showscale=True))])

fig.update_layout(title='Bubble Chart of Abalone Length, Diameter, and Weight',
                  xaxis_title='Length (mm)',
                  yaxis_title='Diameter (mm)')
fig.show()


In [18]:
fig = go.Figure(data=[go.Violin(x=df['Sex'], y=df['Rings'], box_visible=True, points='all', jitter=0.05, marker=dict(size=1), line=dict(width=1), fillcolor='lightblue', opacity=0.6)])
fig.update_layout(title='Violin Plot of Abalone Age by Sex',
                  xaxis_title='Sex',
                  yaxis_title='Age (rings)')
fig.show()

In [19]:
corr_matrix = df.corr()
fig = go.Figure(data=go.Heatmap(z=corr_matrix.values, x=corr_matrix.index.values, y=corr_matrix.columns.values, colorscale='Viridis'))
fig.update_layout(title='Heatmap of Correlation Matrix for Abalone Dataset',
                  xaxis_title='Variable',
                  yaxis_title='Variable')

# Show the plot
fig.show()

**EDA Findings:**
- The majority of abalones in the dataset are of the male sex.
- The length and diameter of abalones are highly correlated, indicating that they have a strong linear relationship.
- The age of abalones increases with their length, diameter, and height, suggesting that these physical measurements can be used to predict the age of abalones.
- The weight of abalones is also strongly correlated with their length, diameter, and height, indicating that weight can also be used as a predictor of abalone age.
- The distribution of abalone age varies across different sex categories, with females generally having a higher age than males and infants.
- There is a positive correlation between the number of rings and abalone age, indicating that the ring count method is a reliable way to determine the age of abalones.
- The correlation matrix heatmap shows that there are strong positive correlations between the physical measurements of abalones (length, diameter, height, and weight), but only weak correlations between these measurements and the number of rings. This suggests that predicting the age of abalones from their physical measurements may require additional information, such as weather patterns and location (hence food availability).

## Model Preparation

**Separate Predictors and Target**

In [20]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [21]:
X

array([[2.    , 0.455 , 0.365 , ..., 0.2245, 0.101 , 0.15  ],
       [2.    , 0.35  , 0.265 , ..., 0.0995, 0.0485, 0.07  ],
       [0.    , 0.53  , 0.42  , ..., 0.2565, 0.1415, 0.21  ],
       ...,
       [2.    , 0.6   , 0.475 , ..., 0.5255, 0.2875, 0.308 ],
       [0.    , 0.625 , 0.485 , ..., 0.531 , 0.261 , 0.296 ],
       [2.    , 0.71  , 0.555 , ..., 0.9455, 0.3765, 0.495 ]])

In [22]:
y

array([15,  7,  9, ...,  9, 10, 12])

**Split Data into Training and Testing**

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

**Make Pipeline**

In [24]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
pipelines={
'rf':make_pipeline(RandomForestRegressor(random_state=1234)),
'gb':make_pipeline(GradientBoostingRegressor(random_state=1234)),
'ridge':make_pipeline(Ridge(random_state=1234)),
'lasso':make_pipeline(Lasso(random_state=1234)),
'enet':make_pipeline(ElasticNet(random_state=1234)),
}

In [25]:
hyperparagrid={
'rf':{
'randomforestregressor__min_samples_split':[2,4,6],
'randomforestregressor__min_samples_leaf':[1,2,3]
},

'gb':{
    'gradientboostingregressor__alpha':[0.001,0.005,0.01,0.05,0.1,0.5,0.99]
},

'ridge':{
    'ridge__alpha':[0.001,0.005,0.01,0.05,0.1,0.5,0.99]
},
'lasso':{
    'lasso__alpha':[0.001,0.005,0.01,0.05,0.1,0.5,0.99]
},
'enet':{
    'elasticnet__alpha':[0.001,0.005,0.01,0.05,0.1,0.5,0.99]
}

}

In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.exceptions import NotFittedError
from sklearn.metrics import r2_score,mean_absolute_error

**Fit Models**

In [28]:
# Split original training data
#X_train_5fold, X_test_5fold, y_train_5fold, y_test_5fold = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

fit_models={}
for algo,pipeline in pipelines.items():
  for cv in [5, 10]:
    model=GridSearchCV(pipeline,hyperparagrid[algo],cv=cv,n_jobs=-1)
    try:
        #print('Start training for {}'.format(algo))
        print(f"Started training for {algo} with {cv} fold cross validation")
        model.fit(X_train,y_train)
        fit_models[f"{algo}_{cv}fold"]=model
    except NotFittedError as e:
        print(repr(e))


Started training for rf with 5 fold cross validation
Started training for rf with 10 fold cross validation
Started training for gb with 5 fold cross validation
Started training for gb with 10 fold cross validation
Started training for ridge with 5 fold cross validation
Started training for ridge with 10 fold cross validation
Started training for lasso with 5 fold cross validation
Started training for lasso with 10 fold cross validation
Started training for enet with 5 fold cross validation
Started training for enet with 10 fold cross validation


## Model Evaluation

**Predict our Target value**

In [29]:
for algo,model in fit_models.items():
    ya=model.predict(X_test)
    print('{} scores-R2:{} MAE:{}'.format(algo,r2_score(y_test,ya), mean_absolute_error(y_test,ya)))

rf_5fold scores-R2:0.5750197480126906 MAE:1.52702656577208
rf_10fold scores-R2:0.5718176064832873 MAE:1.53317381676338
gb_5fold scores-R2:0.5679242680278109 MAE:1.5236838865014883
gb_10fold scores-R2:0.5679242680278109 MAE:1.5236838865014883
ridge_5fold scores-R2:0.5275022255520712 MAE:1.6200109002341283
ridge_10fold scores-R2:0.5275022255520712 MAE:1.6200109002341283
lasso_5fold scores-R2:0.528090136172511 MAE:1.6169991592422972
lasso_10fold scores-R2:0.528090136172511 MAE:1.6169991592422972
enet_5fold scores-R2:0.5196433416161733 MAE:1.6370721768084187
enet_10fold scores-R2:0.5196433416161733 MAE:1.6370721768084187


**Best Model: Random Forest**

In [31]:
#best_model=fit_models['rf']

best_model = None
best_score = -float('inf')  # Initialize with a very low score
best_model_name = ""

for model_name, model in fit_models.items():
    # Access the best score for the model (assuming we're interested in the highest R² score)
    best_r2_score = model.best_score_

    if best_r2_score > best_score:
        best_score = best_r2_score
        best_model = model
        best_model_name = model_name

print(f"The best model is {best_model_name} with an R2 score of {best_score:.4f}")

The best model is gb_10fold with an R2 score of 0.5456


In [32]:
best_model

## Conclusion

- The physical measurements of abalones (such as length, diameter, and weight) can be used to accurately estimate their age without the need for the laborious and time-consuming ring counting method. This has important implications for the management and conservation of abalone populations, as it provides a more efficient and cost-effective way to estimate their age and track changes in population demographics over time. But we the accuracy is very low.
- It is important to note that predicting the age of abalones from their physical measurements may require additional information, such as weather patterns and location, in order to achieve the highest possible accuracy. Therefore, future research should continue to explore and refine the predictive models for estimating the age of abalones using both physical measurements and environmental variables.
![](https://images.unsplash.com/photo-1619968987472-4d1b2784592e?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2070&q=80)