# Predicting Abalone Age
CS 6140 Midterm - Problem 1
Author: Sid Nagaich

Problem: Using the Abalone Data Set, build a system to predict the age of an abalone from physical features using both a multi-class classifier and a regressor.

In [1]:
# import libraries
import numpy as np
import pandas as pd

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Preparing the Data
We add headers to the data since the .data file does not contain them. We use a pandas dataframe to read the data, and we create a dictionary of the frequencies of the number of rings. We discard the unique instances of abalones with a particular ring count, as we will not be able to use this data. We then inspect the data. 

In [2]:
# add headers to data
headers = ["sex", "length", "diameter", "height", "whole weight", 
                "shucked weight", "viscera weight", "shell weight", "rings"]

# read data
data = pd.read_csv("abalone.data", names=headers)

# find counts of ring numbers
counts = dict()
for n in data['rings'].tolist():
    if n in counts:
        counts[n] += 1
    else:
        counts[n] = n

# drop instances of rings that only occur once, as we need at least two occurences to train the model
to_drop = []
for k, v in counts.items():
    if (v == 1):
        to_drop.append(k)

data = data[~data['rings'].isin([to_drop])]

# print number of samples
print("Number of samples: %d" % len(data))

# show first 5 rows
data.head()

Number of samples: 4177


Unnamed: 0,sex,length,diameter,height,whole weight,shucked weight,viscera weight,shell weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


We replace the 'sex' indicator (M/F/I) with a boolean matrix that indicates this information, visualized below:

In [3]:
# replace M/F/I indicator with M, F, I columns with a boolean value
for sex in "MFI":
    data[sex] = data["sex"] == sex

# delete sex column
del data["sex"]

# show new columns
data.head()

Unnamed: 0,length,diameter,height,whole weight,shucked weight,viscera weight,shell weight,rings,M,F,I
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,True,False,False
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,True,False,False
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,False,True,False
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,True,False,False
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,False,False,True


# Partitioning our Data

Since we do not have a huge amount of data, we are going to use an 75% / 25% split between training and testing data. Future considerations would be to use a further split for validation data or to stratify our splits.

In [4]:
# rings are what we want to predict
y = data.loc[:,'rings']

# remove rings from the data we will use to build our model
del data["rings"]

# grab new headers from coding 'sex' column as booleans
headers = [h for h in data.columns]

# x is data with which we will predict y
x = data.loc[:, headers]

In [5]:
from sklearn.model_selection import train_test_split

# shuffle data before splitting 75/25
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, random_state=13)

# First Model: Stochastic Gradient Descent (Regression)
We use SGD to create a regressor that will predict the number of rings. This model tends to perform better than the classifiers below, as the output is continuous (and therefore often not a discrete "wrong" value).

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDRegressor

# normalize data and use Stochastic Gradient Descent as regressor
sgd_reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3)).fit(train_x, train_y)


sgd_reg_predictions = sgd_reg.predict(test_x)

# show some predictions
i = 0
for pre in test_y:
    if (i > 10): break
    print("Actual No. Rings: " + str(pre) + " | Predicted No. Rings: " + str(sgd_reg_predictions[i]))
    i += 1

# model score
print("\nscore: " + str(sgd_reg.score(test_x, test_y)))

Actual No. Rings: 9 | Predicted No. Rings: 9.085132205857857
Actual No. Rings: 10 | Predicted No. Rings: 9.891641134553632
Actual No. Rings: 11 | Predicted No. Rings: 13.344254557050494
Actual No. Rings: 5 | Predicted No. Rings: 7.302057540654564
Actual No. Rings: 9 | Predicted No. Rings: 7.336480203017659
Actual No. Rings: 5 | Predicted No. Rings: 7.3634434221920815
Actual No. Rings: 8 | Predicted No. Rings: 9.31099503166225
Actual No. Rings: 8 | Predicted No. Rings: 7.834222766059426
Actual No. Rings: 10 | Predicted No. Rings: 8.973296027288553
Actual No. Rings: 6 | Predicted No. Rings: 6.841478589109435
Actual No. Rings: 9 | Predicted No. Rings: 11.236360607029923

score: 0.5542295132870674


# Second Model: Stochastic Gradient Descent (Classification)
We use SGD to create a classifier that will predict the number of rings. Classification for the number of rings proves to be an arduous task. While many guesses may be close to the correct number of rings, the exact classification does not correctly occur as often as we would like.

In [7]:
from sklearn.linear_model import SGDClassifier

# normalize data and use Stochastic Gradient Descent as classifier
sgd_clf = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3)).fit(train_x, train_y)

sgd_clf.score(test_x,test_y)

sgd_clf_predictions = sgd_clf.predict(test_x)
print(classification_report(test_y, sgd_clf_predictions))
print("score: " + str(sgd_clf.score(test_x, test_y)))

              precision    recall  f1-score   support

           3       0.20      0.67      0.31         3
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        36
           6       0.20      0.92      0.32        74
           7       0.12      0.01      0.02       105
           8       0.11      0.22      0.15       139
           9       0.24      0.27      0.26       164
          10       0.26      0.09      0.13       157
          11       0.09      0.01      0.01       125
          12       0.00      0.00      0.00        61
          13       0.09      0.16      0.12        49
          14       0.00      0.00      0.00        24
          15       1.00      0.03      0.06        31
          16       0.25      0.17      0.20        18
          17       0.08      0.29      0.13        14
          18       0.00      0.00      0.00        12
          19       0.00      0.00      0.00         5
          20       0.00    

# Third Model: Gradient Boosted Trees (Classification)
We use GBT to create a classifier that will predict the number of rings. We see similar issues with classifying this type of data.

In [8]:
from sklearn.ensemble import GradientBoostingClassifier

# use Gradient-Boosted Trees to fit a classifier model to our training data
clf = make_pipeline(StandardScaler(), GradientBoostingClassifier()).fit(train_x, train_y)

# use model to predict
clf_predictions = clf.predict(test_x)

# show P, R, F1 for each class (number of rings)
print(classification_report(test_y, clf_predictions))

# score the model on our test data
print("score: " + str(clf.score(test_x, test_y)))

              precision    recall  f1-score   support

           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         3
           4       0.25      0.18      0.21        11
           5       0.39      0.31      0.34        36
           6       0.28      0.24      0.26        74
           7       0.27      0.32      0.30       105
           8       0.32      0.36      0.34       139
           9       0.25      0.35      0.29       164
          10       0.23      0.28      0.26       157
          11       0.23      0.16      0.19       125
          12       0.07      0.05      0.06        61
          13       0.12      0.08      0.10        49
          14       0.05      0.04      0.05        24
          15       0.00      0.00      0.00        31
          16       0.06      0.06      0.06        18
          17       0.00      0.00      0.00        14
          18       0.00      0.00      0.00        12
          19       0.00    

# Fourth Model: Gradient Boosted Trees (Regression)
We use GBT to create a regressor that will predict the number of rings. Regression again performs better than classification for this particular application.

In [9]:
from sklearn.ensemble import GradientBoostingRegressor

# use Gradient-Boosted Trees to fit a regression model to our training data
reg = make_pipeline(StandardScaler(), GradientBoostingRegressor()).fit(train_x, train_y)

# score the model on our test data
reg.score(test_x, test_y)

# use model to predict
reg_predictions = reg.predict(test_x)

# show some predictions
i = 0
for pre in test_y:
    if (i > 10): break
    print("Actual No. Rings: " + str(pre) + " | Predicted No. Rings: " + str(reg_predictions[i]))
    i += 1

# model score
print("\nscore: " + str(reg.score(test_x, test_y)))

Actual No. Rings: 9 | Predicted No. Rings: 9.167628827944247
Actual No. Rings: 10 | Predicted No. Rings: 9.68213210277419
Actual No. Rings: 11 | Predicted No. Rings: 14.200125880017742
Actual No. Rings: 5 | Predicted No. Rings: 6.780285828246221
Actual No. Rings: 9 | Predicted No. Rings: 6.921752552550982
Actual No. Rings: 5 | Predicted No. Rings: 6.877112072640991
Actual No. Rings: 8 | Predicted No. Rings: 9.115095815456975
Actual No. Rings: 8 | Predicted No. Rings: 7.183724646395479
Actual No. Rings: 10 | Predicted No. Rings: 9.419611540751522
Actual No. Rings: 6 | Predicted No. Rings: 7.399828419036418
Actual No. Rings: 9 | Predicted No. Rings: 14.231005225124584

score: 0.5599121049116715


# Discussion
Overall, the regression models perform better than classifiers. I believe that allowing some room for error (rather than hard cut-offs in classifications) allows the model to better predict age, which is in a sense a continuous measure. We normalized the data in all of or usages. In the case of SGD, normalized data provides faster convergence. In GBTs, normalized data prevents overfitting. I do not suspect the models to be overfit to the data, as their predictive power is mediocre at best, and the different models perform similarly. This is another benefit of having multiple models to compare.