# Lesson 07 Assignment

## Background

    Kennedy's oceanographic institute client pulled into port the other day with a ton (literally) of collected samples and corresponding data to process. Some of these data tasks are being distributed to others to work on; you've got the abalone (marine snails) data to classify and determine the age from physical characteristics. 

    Age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Other measurements, which are easier to obtain, could be used to predict the age. According to the data provider, original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled (by dividing by 200) for use with machine learning algorithms such as SVMs.

    The target field is “Rings”. Since the output is continuous the solution can be handled by a Support Vector Regression or it can be changed to a binary Support Vector Classification by assigning examples that are younger than 11 years old to class: ‘0’ and those that are older (class: ‘1’).

## Instructions

    It is recommended you complete the lab exercises for this lesson before beginning the assignment.

    Using the Abalone csv file,  create a new notebook to build an experiment using support vector machine classifier and regression. Perform each of the following tasks and answer the questions:

    (1) Convert the continuous output value from continuous to binary (0,1) and build an SVC
    (2) Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?
    (3) Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?
    (4) Show recall, precision and f-measure for the best model

In [None]:
# Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 
from sklearn import svm, metrics
from sklearn.metrics import classification_report
import category_encoders as ce

## (1) Convert the continuous output value from continuous to binary (0,1) and build an SVC

In [None]:
# Reading url

data = pd.read_csv("/Users/matt.denko/Downloads/Abalone.csv") 
data.columns = ['Sex',
'Length',
'Diameter',
'Height',
'Whole Weight',
'Shucked Weight',
'Viscera Weight',
'Shell Weight',
'Rings'] 
(nrows, ncols) = data.shape
print(data.columns)
data.describe()
data.head()

In [None]:
#Removing cases with missing data

data = data.replace(to_replace= "?", value=float("NaN"))
data_null = data.isnull().sum()
print(data_null)
print("There are 0 columns with missing data")

In [None]:
# Define the target and features:

target_label = 'Rings'
feature_labels = [x for x in data.columns if x not in [target_label]]

# Get target and original x-matrix

y = data[target_label]
x = data.as_matrix(columns=feature_labels)

In [None]:
# Split dataset into training set and test set
# Convert data to Binary

X_train, X_test, Y_train, Y_test = train_test_split(x, y, 
                                  test_size=0.3,random_state=42) # 70% training and 30% test

le =  ce.OneHotEncoder(return_df=False, handle_missing="ignore", handle_unknown="ignore")

# Fit Model

le.fit(X_train)
X_encoded_train = le.transform(X_train)
X_encoded_test = le.transform(X_test)

## (2) Using your best guess for hyperparameters and kernel, what is the percentage of correctly classified results?

In [None]:
# Set the parameters

cost = .9 # penalty parameter of the error term
gamma = 5 # defines the influence of input vectors on the margins

In [None]:
# Test a LinearSVC

clf1 = svm.LinearSVC(C=cost).fit(X_encoded_train, Y_train)
clf1.predict(X_encoded_test)
print("LinearSVC")
print(classification_report(clf1.predict(X_encoded_test), Y_test))

## (3) Test different kernels and hyperparameters or consider using sklearn.model_selection.SearchGridCV. Which kernel performed best with what settings?

In [None]:
# Test linear, rbf and poly kernels

for k in ('linear', 'rbf', 'poly'):
    clf = svm.SVC(gamma=gamma, kernel=k, C=cost).fit(X_encoded_train, Y_train)
    clf.predict(X_encoded_test)
    print(k)
    print(classification_report(clf.predict(X_encoded_test), Y_test))

In [None]:
# To make plotting easier, let's just use two features.

X = data.loc[:,('Height','Length')]
Y = y

h = .5  # step size in the mesh
cost = .9  # cost
gamma = 10 # gamma 

# testing other kernels on unscaled data (for plotting tht support vectors)

svc = svm.SVC(kernel='linear', C=cost).fit(X, Y)
rbf_svc = svm.SVC(kernel='rbf', gamma=gamma, C=cost).fit(X, Y)
poly_svc = svm.SVC(kernel='poly', gamma=gamma, degree=3, C=cost).fit(X, Y)
lin_svc = svm.LinearSVC(C=cost).fit(X, Y)

# create a mesh to plot in

x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots

titles = ['SVC with linear kernel',
          'SVC with RBF kernel',
          'SVC with polynomial kernel',
          'LinearSVC (linear kernel)']

for i, kernel in enumerate((svc, rbf_svc, poly_svc, lin_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    Z = kernel.predict(np.c_[xx.ravel(), yy.ravel()])
    
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z)
    plt.axis('off')
    
    # Plot also the training points
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=Y, cmap=plt.cm.Paired)
    plt.title(titles[i])

plt.show()

### Comments:

    The kernal that performed the best was the polynomial kernal. 

## (4) Show recall, precision and f-measure for the best model

In [None]:
# Test linear, rbf and poly kernels

for k in ('linear', 'rbf', 'poly'):
    clf = svm.SVC(gamma=gamma, kernel=k, C=cost).fit(X_encoded_train, Y_train)
    clf.predict(X_encoded_test)
    print(k)
    print(classification_report(clf.predict(X_encoded_test), Y_test))

### Comments:

    The kernal that performed the best was the polynomial kernal. It had a F-measure, recall, and precision micro average of .29. While the linear and rbf had lower precision, recall and F-measure scores.