# Lesson 07
# Peter Lorenz

In this assignment, we use support vector machines to identify useful predictors for the age of abalones when data about their number of rings is unavailable.

## 0. Preparation

Import the required libraries:

In [1]:
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

Set global options:

In [2]:
# Display plots inline
%matplotlib inline

# Display multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Suppress scientific notation
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Declare utility functions:

Read and prepare the data:

In [5]:
# Internet location of the data set
url = "https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Abalone.csv"

# Download the data into a dataframe object
abalone_data = pd.read_csv(url)

# Display shape and initial data
abalone_data.shape
abalone_data.head()

# Examine column types
abalone_data.info()

(4177, 9)

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.225,0.101,0.15,15
1,M,0.35,0.265,0.09,0.226,0.1,0.049,0.07,7
2,F,0.53,0.42,0.135,0.677,0.257,0.141,0.21,9
3,M,0.44,0.365,0.125,0.516,0.215,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.089,0.04,0.055,7


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole Weight    4177 non-null   float64
 5   Shucked Weight  4177 non-null   float64
 6   Viscera Weight  4177 non-null   float64
 7   Shell Weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


One-hot encode the Sex column:

In [9]:
# Use pandas to one-hot encode categorical variables
abalone_data_enc = pd.get_dummies(abalone_data, columns=['Sex'], drop_first=True)

# Display shape and sample contents
abalone_data_enc.shape
abalone_data_enc.head()

(4177, 10)

Unnamed: 0,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,Sex_I,Sex_M
0,0.455,0.365,0.095,0.514,0.225,0.101,0.15,15,0,1
1,0.35,0.265,0.09,0.226,0.1,0.049,0.07,7,0,1
2,0.53,0.42,0.135,0.677,0.257,0.141,0.21,9,0,0
3,0.44,0.365,0.125,0.516,0.215,0.114,0.155,10,0,1
4,0.33,0.255,0.08,0.205,0.089,0.04,0.055,7,1,0


We are now ready to build our model.

## 1. Convert continuous output value to binary and build SVC
In this section we convert the continuous output value (Rings) from continuous to binary (0,1) and build an SVC, with class '0' representing specimens less than 11 years old and '1' representing specimens 11 years old and greater.


In [24]:
# Convert number of rings to binary using 11+ as the cutoff
is_old = np.array(pd.cut(abalone_data_enc['Rings'], [0, 11, 1000], labels=['0', '1'])).astype(int)
is_old

array([1, 0, 0, ..., 0, 0, 1])

## 2.  Determine percentage of correctly classified results
In this section we determine the percentage of correctly classified results using our best guess for hyperparameters and kernel.

## 3. Test different hyperparameters for each kernel

In this section we find the best hyperparameters for each kernel using sklearn.model_selection.SearchGridCV to determine which kernel performed best with what settings.

## 4. Show recall, precision and f-measure for the best model

In this section we show the recall, precision, and f-measure for the best model.

## 5. Create an SVR model using the original data

In this section we create an SVR model using the original data with rings as a continuous variable. We should point out the inherent problem in treating number of rings as a continuous variable, given that it is not represented by the continuous real numbers, but by discrete states that happen to be ranked sequentially.

## 6. Report on the predicted variance and the mean squared error

In this section we report on the predicted variance and the mean squared error.

## Summary

TODO