<a href="https://colab.research.google.com/github/maxkkessler/CSE-4820/blob/main/PA2_mkk17004.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CSE 4820/5819**: Assignment 2 

Part II: Programming Assignment (Total 20pt)

**Due:** End of Day, November 7th (Sunday)

**Submission**: Please submit your jupyter notebook file (*.ipynb) only in HuskyCT assignment link.

**Filename**: PA2_[x].ipynb (x is your netid)

In this assignment you will build a classification for the [wine dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset) using support vector machines. Based on **Tutorial 2**, you will need to use good machine learning modeling practices like creating appropariate dataset splitting, standardizing/normalizing, along with proper hyperparameter tuning.

Fill in the missing information, and experiment with the modeling process to arrive at a solution you are comfortable with.

## Load some libraries


In [None]:
import numpy as np

## Load the data

The wine dataset is included within the sklearn library. Start by importing the data.

In [None]:
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target

We can see a description of the data, and even get some summary statistics.

Note that since different features have different ranges, this is a hint that we may want to do some some normalizing/standardizing later.

It is good practice to read the relevant information associated with the data that will be modeled. This can help inform you whether there may be problems along the way.

In [None]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

## Question 1 (1 Point):

When using an SVM to classify data, we must make sure there is no missing data. That is, no feature can have have an `nan` value as SVMs cannot handle missing data during the modeling process. 

In the description of the data above (using wine.DESCR) we saw that **Missing Attribute Values: None** was documented.

Confirm that there are no missing values assuming that missing is recorded as `nan`.

In [None]:
for i in X:
  for e in i:
    if e == 'nan':
      print("nan value present")
print("no nan values")

no nan values


## Question 2 (2 Points):

Split the data into a training and testing set using the function `train_test_split` with the parameter `random_state = 42` for reproducibility using 30% of the data for the test set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30,random_state=42)

## Question 3 (4 Points):

Set up the pipeline for your model. Depending on how the data looks, you may want to use some type of normalization or standardization. The pipeline can include any necessary preprocessing steps along with the model.

Since this is a multi-class problem, ensure that this is taken into account when setting up the SVM by using a one-versus-the-rest configuration.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([('scale', StandardScaler()),
                 ('clf', SVC(decision_function_shape='ovr'))])
#Scale is the standard scale
#We use the SVC with a one-vs-rest or ovr true

## Question 4 (5 Points):

Define which parameters you want to use in your grid search.

In [None]:
param_grid = {'clf__C': [.01, .1, 1, 5, 100],
                'clf__gamma': [1, 5, .1],
                'clf__kernel': ['linear', 'rbf']}       

## Question 5 (5 Points):

Instantiate `GridSearchCV` for selecting the best hyperparameters.

* Set up `GridSearchCV` with 5 fold cross-validation. 
* The `scoring` parameter tells `GridSearchCV` how to rank the parameter configurations from the grid searching procedure. Since we are interested in producing a classification model with the highest accuracy, set this parameter accordingly. To learn more about what might work here, please see [metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [None]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, param_grid, scoring='accuracy',  cv=5, verbose = 1)

## Question 6 (1 Point):

Use the `fit` method to train the model:

In [None]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:    0.4s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scale',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('clf',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            tol=0.001,

## Question 7 (1 Point):

Print out the best model score.

In [None]:
print("Best parameter (CV score=%0.3f):" % grid.best_score_)

Best parameter (CV score=0.984):


## Question 8 (1 Point):

Print out parameter configuration associated with the best model.

In [None]:
print(grid.best_params_)

{'clf__C': 1, 'clf__gamma': 1, 'clf__kernel': 'linear'}
