# Classification -- Homework 3

### Basic Information:
Dataset Used: Iris Dataset. SVM vs. kNN Comparison

##### Split the dataset into 
- 70% for the training set
- 15% for the development set
- 15% for the test set

Figure out an appropriate evaluation metric -- comparison to the SVM  techniques.
For kNN -- the best k value was found and reported after going through a range of numbers.

In [3]:
# Mary B. Makarious
# Homework 3 -- Classification

In [4]:
# Import Packages 

import numpy as np
import pandas as pd # data analysis tools
import matplotlib.pyplot as plt # to visualize the data
import seaborn as sns
import math
from sklearn import svm, datasets
import sklearn.metrics as met
from sklearn.metrics import accuracy_score, f1_score

# Keep everything in Jupyter
%matplotlib inline 

# Ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Iris dataset that comes with the seaborn package
data_init = sns.load_dataset('iris')
data = data_init.reindex(np.random.permutation(data_init.index))

rows = len(data)
columns = len(data.keys())

# Split the data into the train, test, and development sets

percent_15_1_start = 0 # test set
percent_15_2_start = int(rows * 0.15) # development set 
percent_70_start = int(rows * 0.3) # train set

test = data[percent_15_1_start:percent_15_2_start].reset_index()
development = data[percent_15_2_start:percent_70_start].reset_index()
train = data[percent_70_start:rows].reset_index()

# Print out the Datasets 

print(train)
print('-'*16)
print(development)
print('-'*16)
print(test)

     index  sepal_length  sepal_width  petal_length  petal_width     species
0       63           6.1          2.9           4.7          1.4  versicolor
1       52           6.9          3.1           4.9          1.5  versicolor
2      147           6.5          3.0           5.2          2.0   virginica
3      110           6.5          3.2           5.1          2.0   virginica
4       97           6.2          2.9           4.3          1.3  versicolor
5      137           6.4          3.1           5.5          1.8   virginica
6       11           4.8          3.4           1.6          0.2      setosa
7       93           5.0          2.3           3.3          1.0  versicolor
8      101           5.8          2.7           5.1          1.9   virginica
9      130           7.4          2.8           6.1          1.9   virginica
10     127           6.1          3.0           4.9          1.8   virginica
11     144           6.7          3.3           5.7          2.5   virginica

## Problem 1 -- Training a Classifier 

- Use logistic regression or SVM implementations in scikit-learn.
- Use the default classifier parameters. 
- Evaluate your classifier on the development set.

SVC with 'poly' (polynomial) Kernel Used -- Default Parameters

In [24]:
iris = datasets.load_iris()
data = data_init.reindex(np.random.permutation(data_init.index))

# Default parameters:
# C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None

svc = svm.SVC(kernel='poly', C=1.0)
svc.fit(train[['petal_length', 'petal_width']], train['species'])
print("SVM Prediction Accuracy = {0:5.1f}%".format(100.0 * svc.score(train[['petal_length', 'petal_width']], train['species'])))

SVM Prediction Accuracy =  97.1%


## Problem 2 -- Improving Model Performance

- Tweak the classifier parameters to improve your model’s performance on the development set.

##### Parameters Used:
- gamma: 1/n_features used when 'auto'
- max_iter: -1 means no limit
- probability: Probability estimates
- shrinking: Shrinking heuristic -- avoid heuristic so you can focus on accuracy
- tol: Tolerance for stopping
- degree: Degree of the polynomial function 
- random_state: Random number generator to use when shuffling the data for probability estimation

In [25]:
iris = datasets.load_iris()
data = data_init.reindex(np.random.permutation(data_init.index))

svc = svm.SVC(kernel='poly', C=1.0, gamma='auto', max_iter=-1, probability=False, shrinking=False, degree=6, tol=0.0005, random_state=1000)
svc.fit(train[['petal_length', 'petal_width']], train['species'])

print("SVM Prediction Accuracy w/ Parameter Changes = {0:5.1f}%".format(100.0 * svc.score(train[['petal_length', 'petal_width']], train['species'])))

SVM Prediction Accuracy w/ Parameter Changes =  99.0%


Changing and tweaking the parameters did not seem to really make the prediction that much better. This could be because the we are using the iris dataset.

## Problem 3 -- Implementing kNN

- kNN: k-nearest neighbors classifier.
- Tune the k value to achive the best possible performance on the development set.

kNN works off of the 'distance' between data points, so first thing to do is to get that distance.
Then, in the case of the iris dataset, we see how many of each type there are in the list of closest neighbors.
Then the type with the most is the type that we classify our point as the neighbor.

In [41]:
def classify(point, dataset, k):
    distances = []
    indices = [] 
    
    sorted_distances = []
    sorted_indices = []
    
    for i in range(0, len(dataset)):
        petal_length_delta = dataset['petal_length'][i] - point['petal_length']
        petal_width_delta = dataset['petal_width'][i] - point['petal_width']
        distance = math.sqrt(petal_length_delta ** 2 + petal_width_delta ** 2)
        distances.append(distance)
        indices.append(i)
        
    for i in range(0, len(dataset)):
        min_dist = min(distances)
        index = distances.index(min_dist)
        min_i = indices[index]
        sorted_distances.append(min_dist)
        sorted_indices.append(min_i)
        del(distances[index])
        del(indices[index])
        
# Different species 
    setosas = 0
    virginicas = 0
    versicolors = 0
    species = ''
    
    for i in range(0, k):
        if dataset['species'][sorted_indices[i]] == 'setosa':
            setosas = setosas + 1
        elif dataset['species'][sorted_indices[i]] == 'virginica':
            virginicas = virginicas + 1
        elif dataset['species'][sorted_indices[i]] == 'versicolor':
            versicolors = versicolors + 1
    
    if setosas >= virginicas and setosas >= versicolors:
        species = 'setosa'
    elif virginicas >= setosas and virginicas >= versicolors:
        species = 'virginica'
    else:
        species = 'versicolor'
    
    return species

In [42]:
# Checks how many are correct vs how many are not
def get_accuracy(data_list, test_data):
    correct_species_list = test_data['species']
    tot_num = len(data_list)
    num_correct = 0
    
    for i in range(0, len(data_list)):
        if correct_species_list[i] == data_list[i]:
            num_correct = num_correct + 1
    
    return num_correct / tot_num * 100

In [43]:
# loop through k values to see which is the best

for k in range(0, 10):
    k_Results = []
    for i in range(0, len(development)):
        k_Results.append(classify(development.iloc[[i]], train, k))
print('k value of: {} -- accuracy: {}%'.format(k, get_accuracy(k_Results, development)))

k value of: 9 -- accuracy: 95.65217391304348%


## Problem 4 -- Comparing

- Compare your best model you built in step (2) to your best kNN model by evaluating them on the test set.

In [26]:
# Model in Problem 2
# Use Test Set
iris = datasets.load_iris()
data = data_init.reindex(np.random.permutation(data_init.index))

svc = svm.SVC(kernel='poly', C=1.0, gamma='auto', max_iter=-1, probability=False, shrinking=False, degree=6, tol=0.0005, random_state=1000)
svc.fit(test[['petal_length', 'petal_width']], test['species'])

print("SVM Prediction Accuracy w/ Parameter Changes = {0:5.1f}%".format(100.0 * svc.score(test[['petal_length', 'petal_width']], test['species'])))

SVM Prediction Accuracy w/ Parameter Changes = 100.0%


In [45]:
# kNN Model using k=9
# Use Test Set
k_Results = []
k = 9
for i in range(0, len(development)):
    k_Results.append(classify(development.iloc[[i]], test, k))
print('kNN Results when k=9: {}%'.format(get_accuracy(k_Results, development)))

kNN Results when k=9: 95.65217391304348%


The SVM prediction vs. the kNN evaluated on the test set resulted in some interesting results. My thinking is because the iris set is designed to be linearly separable, the SVM is more likely to be more accurate.