# DS 3000 - Assignment 11

**Student Name**: [Ji Yoon Rhee]

**Date**: [Apr 10, 2022]

Question 1:  (50 pts)

Create a SVM model (either classification or regression) to predict a target using a dataset of your choice. You can use: 1) a previous dataset that you are familiar with from this course, 2) a new dataset, or 3) one of the builtin datasets in sklearn. You may not use the same dataset that was covered in the lecture.

1. (5 pts) Perform any necessary data preparation steps and EDA.
2. (5 pts) Partition the data into train/test sets.
3. (10 pts) Normalize the numeric features and encode any categorical features. Ensure that you use sklearn to normalize/encode the training data, after which transform the test set (there is an example in the lecture).
4. (20 pts) Build the model that predicts the target variable, and use GridSearchCV to tune at least two hyperparameters, and perform cross validation. You'll need to read the documentation of the SVM algorithm to identify the hyperparameters.(Links to an external site.)
5. (5 pts) Display: 1) the best model/parameters and 2) the best score that was obtained.
6. (5 pts) How do you feel about the model? Did it accurately predict the target? Evaluate the scores from GridSearchCV for the training set and the test set to diagnose any bias-variance problems.
 

Examples of datasets that you can use:

- Pima Indians Diabetes Database
- Iris Dataset
- Boston Housing dataset
- Red Wine Quality
...etc
Note: Keep the dataset small. The focus of this assignment is to experiment with tuning a model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
import seaborn as sns
sns.set() 

1. (5 pts) Perform any necessary data preparation steps and EDA.

In [2]:
# Perform any necessary data preparation steps and EDA.

##loading iris dataset:
data = load_iris()

##dividing the data into the input 'X' and the labels 'y':
X        = data['data'] #the observations
y        = data['target'] #the target

Since iris dataset is a built-in, there is no need to perform any data preparation or EDA. 

2. (5 pts) Partition the data into train/test sets.

In [3]:
##loading data into dataframes:
df          = pd.DataFrame(X, columns=['sepal length', 'sepal width', 'petal length', 'petal width'])
df['class'] = [data['target_names'][idx] for idx in y]




##splitting the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X,              #the input features
                                                    y,              #the label
                                                    test_size=0.3,  #set aside 30% of the data as the test set
                                                    random_state=7, #reproduce the results
                                                    stratify=y      #preserve the distribution of the labels
                                                   )

X_train = X_train.copy()
X_test  = X_test.copy()

df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


3. (10 pts) Normalize the numeric features and encode any categorical features. Ensure that you use sklearn to normalize/encode the training data, after which transform the test set (there is an example in the lecture).

In [4]:
#Normalize the numeric features and encode any categorical features.

##normalizing the numeric features:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train) #scale the training set

There is no need to encode anything, since there are no categorical features in this dataset.

4. (20 pts) Build the model that predicts the target variable, and use GridSearchCV to tune at least two hyperparameters, and perform cross validation. You'll need to read the documentation of the SVM algorithm to identify the hyperparameters.

In [5]:
#initialize the classifier

svm = SVC() #the default kernel is rbf
svm.fit(X_train, y_train) #fit the data

SVC()

In [6]:
#encode & scale the new/test data
X_test = scaler.transform(X_test) #scale the test data

#y_test['Gender']             = encoder.transform(X_test[['Gender']]) #encode the test data

#predict the labels for the test set
y_pred   = svm.predict(X_test)

print('The predicted classes are: {}'.format(y_pred))

The predicted classes are: [1 1 2 0 1 0 2 2 2 0 0 2 0 1 1 1 2 1 2 2 0 1 2 1 1 0 0 2 0 2 1 1 1 0 0 2 1
 0 1 0 2 2 2 0 0]


In [7]:
#evaluate the prediction accuracy
print('The accuracy of the model is: {}'.format(svm.score(X_test, y_test)))

The accuracy of the model is: 0.9555555555555556


In [8]:
#get the classification report
target_names = ['setosa', 'versicolor', 'virginica']
print(classification_report(y_test, y_pred, target_names = target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.93      0.93      0.93        15
   virginica       0.93      0.93      0.93        15

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45



5. (5 pts) Display: 1) the best model/parameters and 2) the best score that was obtained.

In [9]:
#setup the parameter grid to try values of c ranging from 1 to 10, and values for gamma ranging from 1 to 5.
params = {'C': np.arange(1, 20),
         'gamma': np.arange(1, 10)}

svm = SVC()

#initialize Grid Search with Cross Validation
svm_grid = GridSearchCV(estimator = svm, param_grid = params, cv=5, return_train_score = True)
svm_grid.fit(X_train, y_train)

print(svm_grid.best_params_) #print the best parameters
print(svm_grid.best_score_)  #print the best score

{'C': 1, 'gamma': 2}
0.9523809523809523


6. (5 pts) How do you feel about the model? Did it accurately predict the target? Evaluate the scores from GridSearchCV for the training set and the test set to diagnose any bias-variance problems.

In [10]:
#display the cv results for the train and test set
cv_results = pd.DataFrame(svm_grid.cv_results_)
cv_results = cv_results.sort_values('mean_test_score', ascending=False)
cv_results[['mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']].head()

Unnamed: 0,mean_train_score,std_train_score,mean_test_score,std_test_score
9,0.995238,0.005832,0.952381,0.030117
1,0.992857,0.005832,0.952381,0.030117
10,1.0,0.0,0.942857,0.035635
0,0.985714,0.008909,0.942857,0.035635
2,1.0,0.0,0.942857,0.035635


The highest mean train score and mean test score combination is 0.995238 and 0.952381. The SVC does not appear to overfit or underfit the data. This demonstrates low bias and low variance.

Therefore, we can conclude that the model accurately can classify the target.