**DS 301: Applied Data Modeling and Predictive Analysis**

# Lab 6 – Support Vector Machine

Nok Wongpiromsarn, 8 August 2022

**Instructions:**
1. Construct a pandas dataframe from Iris dataset.
2. Remove outliers. Here, we define outliers as those that are beyond 1.5 times the IQR above the upper quartile or below the lower quartile.
3. Apply each of the following methods to deal with the missing features. Discuss the differences in the data obtained from these methods.
   - dropna
   - fillna
   - SimpleImputer
4. Use the dataframe obtained from dropna. Visualize the data to see which pairs of the 3 species (setosa, versicolor, virginica) are linearly separable.
5. Construct features X and labels y.
   - X contains only the petal length and petal width features.
   - y is a binary target such that it is 1 if the instance is Virginica and is 0 otherwise.
6. Train LinearSVC, SVC, and SGDClassifier to identify whether a given instance is Virginica. Use C = 1. Don't forget to scale your data!
7. Pick one of the 3 classifiers and report the following performance measures.
   1. training accuracy
   2. cross-validation accuracy
   3. confusion matrix
   4. precision
   5. recall
   6. F1
   7. AUC

### 1. Construct a pandas dataframe from Iris dataset

In [None]:
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
df_feature = pd.DataFrame(iris['data'], columns = iris.feature_names)
df_label = pd.DataFrame(iris['target'], columns = ['species'])
df = pd.concat([df_feature, df_label], axis=1)

df.info()

### 2. Remove outliers.

Here, we define outliers as those that are beyond 1.5 times the IQR above the upper quartile or below the lower quartile.

**2.1 Use boxplot to determine outliers.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 

plt.figure(figsize=(26, 12))
sns.boxplot(data=df[iris.feature_names])

**2.2 Replace all the outliers with NaN.**

In [None]:
import numpy as np

Q3 = df['sepal width (cm)'].quantile(0.75)
Q1 = df['sepal width (cm)'].quantile(0.25)
IQR = Q3 - Q1

# Set the values of sepal width that are beyond 1.5 times the IQR above the upper quartile as NaN
df.loc[df['sepal width (cm)'] > Q3 + 1.5*IQR, 'sepal width (cm)'] = np.nan

# TODO: Set other outliers as NaN

# Call info to verify that NaN values show up as null
df.info()

### 3. Apply each of the following methods to deal with the missing features

- dropna
- fillna
- SimpleImputer

First, we identify all the rows with null

In [None]:
rows_with_null = df.isnull().any(axis=1)
df[rows_with_null]

**3.1 dropna**

In [None]:
df_dropna = df.dropna(subset=["sepal width (cm)"])

# TODO: Use a combination of info(), head(), describe(), and X_dropna[rows_with_null] 
# to see the difference between df and df_dropna


**3.2 fillna**

In [None]:
# TODO: Change val to some other value that is not the mean of df['sepal width (cm)']
val = df['sepal width (cm)'].mean()
df_fillna = df.copy()
df_fillna['sepal width (cm)'] = df['sepal width (cm)'].fillna(val)

# TODO: Use a combination of info(), head(), describe(), and df_fillna[rows_with_null] 
# to see the difference between df and df_fillna


**3.3 SimpleImputer**

In [None]:
from sklearn.impute import SimpleImputer

# TODO: Change strategy to something that is not "mean"
imputer = SimpleImputer(strategy="mean")
df_imputer = imputer.fit_transform(df)

# TODO: Check the type of df_imputer and call df_imputer[rows_with_null]
# to see the difference between df, df_fillna, and df_imputer


### 4. Visualize the data to see which pair of the 3 species are linearly separable.

First, we separate the input based on their labels to help with plotting

In [None]:
X_setosa = df_dropna.loc[df_dropna['species'] == 0, iris.feature_names]
X_versicolor = df_dropna.loc[df_dropna['species'] == 1, iris.feature_names]
X_virginica = df_dropna.loc[df_dropna['species'] == 2, iris.feature_names]

**4.1 Scatter plot of sepal length VS sepal width. Use different color for different species.**

In [None]:
plt.plot(X_setosa[['sepal length (cm)']], X_setosa[['sepal width (cm)']], 'bs', label='setosa')
plt.plot(X_versicolor[['sepal length (cm)']], X_versicolor[['sepal width (cm)']], 'yo', label='versicolor')
plt.plot(X_virginica[['sepal length (cm)']], X_virginica[['sepal width (cm)']], 'kx', label='virginica')
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.show()

**4.2 Scatter plot of petal length VS petal width. Use different color for different species.**

In [None]:
#TODO

### 5. Construct features X and labels y. 

- X contains only the petal length and petal width features.
- y is a binary target such that it is 1 if the instance is Virginica and is 0 otherwise.

In [None]:
X = df_dropna[['petal length (cm)', 'petal width (cm)']] # petal length, petal width

# TODO: Set the correct value of species that corresponds to Virginica.
# Hint: Use iris['target_names'] to figure out the right index.
y = df_dropna['species'] == 1

### 6. Train LinearSVC, SVC, and SGDClassifier to identify whether a given instance is Virginica.

In [None]:
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

C = 1
alpha = 1 / (C * len(X))

# Construct the classifiers
lin_clf = LinearSVC(loss="hinge", C=C, random_state=42)
svm_clf = SVC(kernel="linear", C=C)
sgd_clf = SGDClassifier(loss="hinge", alpha=alpha, max_iter=1000, tol=1e-3, random_state=42)

# TODO: Scale the features and train the classifiers


### 4. Pick one of the 3 classifiers and report the following performance measures

- training accuracy
- cross-validation accuracy
- confusion matrix
- precision
- recall
- F1
- AUC


In [None]:
# TODO