# Assignment 2: Exploratory Data Analysis and K Nearest Neighbors Classification

For this assignment you will perform exploratory data analysis to visualize wine dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Wine dataset and varying the hyperparameter k.

Documentation for Scikit Learn:
+ The top level documenation page is here: https://scikit-learn.org/stable/index.html
+ The API for the KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
+ The User Guide for KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/neighbors.html#classification
+ Scikit Learn provides many Jupyter notebook examples on how use the toolkit. These Jupyter notebook examples can be run on MyBinder: https://scikit-learn.org/stable/auto_examples/index.html

For more information about the Wine dataset, see this page https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from pandas import DataFrame

##  Load Wine dataset

In [None]:
wine = datasets.load_wine()
X = wine.data
y = wine.target

## Part 1 Exploratory Data Analysis

### Dataset size

In [None]:
print("Number of instances in the wine dataset:", X.shape[0])
print("Number of features in the wine dataset:", X.shape[1])
print("The dimension of the data matrix X is", X.shape)

In [None]:
X

The `y` vector length is 178. It has three unique values: 0, 1 and 2. Each value represents a kine of wine.

In [None]:
y

### Descriptive statistics

Show the summary table of wine data including min, max, median, and quantiles

In [None]:
y.shape

In [None]:
dir(wine)

In [None]:
import numpy as np
import pandas as pd
wine_df = pd.DataFrame(data= np.c_[wine['data'], wine['target']],
                     columns= wine['feature_names'] + ['target'])

In [None]:
wine_df.describe()

In [None]:
wine.target_names

In [None]:
wine.feature_names

### (TODO) Part 1a Draw box plots

Draw four box plots for attributes: alcohol, malic_acid, ash, alcalinity_of_ash. Use color to show the different target class.

Some links to help you:

https://seaborn.pydata.org/generated/seaborn.boxplot.html

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
## Insert your code here...

### (TODO) Part 1b
Based on the box plots, if you were only allowed to choose one attribute which attribute would you choose? Whay?

#### Insert your answer for 1b here...

### (TODO) Part 1c Scatter plots

Generate [scatter plots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) using each pair of the first 4 attributes (alcohol, malic_acid, ash, alcalinity_of_ash) as axis. You should generate $6 = {4 \choose 2}$ scatter plots.

Note: use the smaller index attribute as x axis and the larger one as y axis. eg. for pair (alcohol, malic_acid), alcohol is the x axis and malic_acid is y axis.

In [None]:
## Insert your answer here...

### (TODO) Part 1d
If you were to draw linear decision boundaries to separate the classes, which scatter plot from 1c do you think will have the least error and which the most?

#### Insert your 1d answer here

### (TODO) Part 1e PCA
Scatter plots using two attributes of the data are equivalent to project the four dimensional data down to two dimensions using axis-parallel projection. Principal component analysis (PCA) is a technique to linearly project the data to lower dimensions that are not necessarily axis-parallel. Use PCA to project the data down to two dimensions.

Documentation for PCA:
+ API https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
+ User guide https://scikit-learn.org/stable/modules/decomposition.html#pca

In [None]:
### Insert your code here

### (TODO) Part 1f
In the case of the Wine dataset, does PCA do a better job of separating the classes?

#### Insert your answer


## Part 2 K Nearest Neighbor

Split the dataset into train set and test set. Use 70 percent of the dataset for training, and use 30 percent for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42)

In [None]:
print("Number of instances in the train set:", X_train.shape[0])
print("Number of instances in the test set:", X_test.shape[0])

### (TODO) Part 2a Training a KNN classifer

Create a KNeighborsClassifier with `n_neighbors = 3`. And, train the classifier using the train set.

In [None]:
### Insert you answer here

### (TODO) Part 2b Tuning hyperparameter k
As we have seen in class, hyperparameter k of the K Nearest Neighbors classification affects the inductive bias. For this part train multiple near neighbor classifier models, store the results in a DataFrame. The plot plot training error and testing error versus N/k, where N = 100 and k are given in the k_list below.

In [None]:
k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]

In [None]:
### Insert your code
# Use the `result` to store the DataFrame

### (TODO) Part 2c Decision boundaries

Plot decision boundaries of K Nearest Neighbors.

Use Scikit Learn's [DecisionBoundaryDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay) class to visualize the nearest neighbor boundaries as k is varied.

https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay

In [None]:
k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]

Simplify the problem by using only the first 2 attributes of the dataset

In [None]:
X2 = wine.data[:, :2]

In [None]:
### Insert your code here
