# Lesson 8 Assignment - Abalone Age Determination

## Author - Kenji Oman

### Background
Age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Other measurements, which are easier to obtain, could be used to predict the age. According to the data provider, original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled (by dividing by 200) for use with machine learning algorithms such as SVMs and ANNs.

The target field is “Rings”. Since the output is continuous the solution can be handled by a Support Vector Regression or it can be changed to a binary Support Vector Classification by assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’).

Predict the age using the following attributes:
* Sex / nominal / -- / M, F, and I (infant)
* Length / continuous / mm / Longest shell measurement
* Diameter / continuous / mm / perpendicular to length
* Height / continuous / mm / with meat in shell
* Whole weight / continuous / grams / whole abalone
* Shucked weight / continuous / grams / weight of meat
* Viscera weight / continuous / grams / gut weight (after bleeding)
* Shell weight / continuous / grams / after being dried

See [UCI's Abalone Data set](https://archive.ics.uci.edu/ml/datasets/abalone) for more information.

## Tasks
Use the provided abalone.csv file, build an experiment using support vector machine classifier and regression. Complete the following tasks and answer the questions:

1. Convert the continuous output value from continuous to binary (0,1) and build an SVC
2. Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?
3. Test different kernels and hyperparameters or consider using `sklearn.model_selection.SearchGridCV`. Which kernel performed best with what settings?
4. Show recall, precision and f-measure for the best model
5. Using the original data, with rings as a continuous variable, create an SVR model
6. Report on the predicted variance and the mean squared error

In [1]:
# Data set contains 4177 rows and 9 columns.
#URL = "https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/abalone.csv"
# Since the internet connection can be spotty at times, use a local cache of the file
URL = "abalone.csv"

In [2]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, SVR
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, mean_squared_error, explained_variance_score
from sklearn.model_selection import GridSearchCV

In [3]:
# First, load the data
df = pd.read_csv(URL)

# And set a binary version of Rings
df['BinaryRings'] = (df.Rings >= 11).astype(int)

# Also, need a binary verison of Sex
df['IsMale'] = (df.Sex == 'M').astype(int)

# And drop the original sex column
df.drop(columns='Sex', inplace=True)
df.head()

Unnamed: 0,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,BinaryRings,IsMale
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,1,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0,0


In [4]:
# Perform a train/test split
train_df, test_df = train_test_split(df, random_state=100)

In [5]:
# Using default hyperparameters and kernel, find accuracy rate
clf = SVC()
clf.fit(train_df.drop(columns=['Rings', 'BinaryRings']), train_df.BinaryRings)
clf_accuracy = clf.score(test_df.drop(columns=['Rings', 'BinaryRings']), test_df.BinaryRings)
print('SVC Accuracy: {}%'.format(round(clf_accuracy * 100, 2)))

SVC Accuracy: 74.26%


In [6]:
# Now, try a gridsearch CV to test different kernels/ hyperparameters
# First, define the parameters to search over
gs_params = [
    {'kernel': ['linear'],
     'shrinking': [True, False],
     'class_weight': [None, 'balanced']},
    
    {'kernel': ['poly'],
     'degree': list(np.arange(1, 5)),
     'shrinking': [True, False],
     'class_weight': [None, 'balanced']},
    
    {'kernel': ['rbf'],
     'shrinking': [True, False],
     'class_weight': [None, 'balanced']},
    
    {'kernel': ['sigmoid'],
     'shrinking': [True, False],
     'class_weight': [None, 'balanced']}
]

# Then, run the grid search
gscv = GridSearchCV(clf, param_grid=gs_params, scoring='accuracy', cv=10)
gscv.fit(train_df.drop(columns=['Rings', 'BinaryRings']), train_df.BinaryRings)
gscv.best_estimator_

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We find that the best performing model is with a linear kernel, with the shrinking heuristic, and no special weighting of the classes.

In [7]:
# Want recall, precision, and F-measure for the best model
predictions = gscv.predict(test_df.drop(columns=['Rings', 'BinaryRings']))
print('Recall: {}%'.format(round(recall_score(test_df.BinaryRings, predictions) * 100, 2)))
print('Precision: {}%'.format(round(precision_score(test_df.BinaryRings, predictions) * 100, 2)))
print('F-measure: {}%'.format(round(f1_score(test_df.BinaryRings, predictions) * 100, 2)))

Recall: 51.52%
Precision: 74.4%
F-measure: 60.88%


In [8]:
# Now, make regression predicitons, and give variance of predictions, and the MSE
reg = SVR()
reg.fit(train_df.drop(columns=['Rings', 'BinaryRings']), train_df.Rings)
predicted = reg.predict(test_df.drop(columns=['Rings', 'BinaryRings']))
print('Predicted Variance: {}'.format(round(np.var(predicted), 2)))
print('Actual Variance: {}'.format(round(np.var(test_df.Rings), 2)))
print('MSE: {}'.format(round(mean_squared_error(test_df.Rings, predicted), 2)))
print('Explained Variance: {}%'.format(round(explained_variance_score(test_df.Rings, predicted) * 100, 2)))

Predicted Variance: 3.2
Actual Variance: 10.74
MSE: 6.17
Explained Variance: 45.82%


So, the variance on our predictions is 3.2, while the actual variance in our target "Rings" was 10.74.  This resulted in a MSE of 6.17, with 45.82% of the actual variance explained by our model.