# Day 10 Class Exercises: Supervised Machine Learning

## Background. 
For these class exercises, we will be using the wine quality dataset which can be found at this URL:
https://archive.ics.uci.edu/ml/datasets/wine+quality. We will be using the supervised machine learning tools from the lessons to determine a model that can use physicochemical measurements of wine as a predictor of quality.  The data for these exercises can be found in the `data` directory of this repository.

<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span> Additionally, with these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 

## Get Started
Import the Numpy, Pandas, Matplotlib (matplotlib magic), Seaborn and sklearn packages. 

In [None]:
%matplotlib inline

# Data Management
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

## Exercise 1. Review the data once more
Load the wine quality data used in the Seaborn class exercises from Day 9. As a reminder, you can read about this dataset from the file [../data/winequality.names](../data/winequality.names)

Next, read in the file named `winequality-red.csv`. This data, despite the `csv` suffix, is separated using a semicolon.

In [None]:
wine = pd.read_csv('../data/winequality-red.csv', sep=";")
wine.head()

How many samples (observations) do we have?

Are the data types for the columns in the dataframe appropriate for the type of data in each column?

Any missing values?

Any duplicated rows?

## Exercise 2: Explore the Dependent data

The quality column contains our expected outcome. Because we want to predict this score, it is our dependent variable. Wines scored as 0 are considered very bad and wines scored as 10 are very excellent.  How many samples are there per each quality of wine?

As a reminder, view the quality distribution using a the seaborn barplot. Code similar to the following was used in Day 9 exercises. Adapt it here to fit your variables.

```python
qcounts = wine['quality'].value_counts(sort=False)
sns.barplot(x=qcounts.index, y=qcounts);
```

## Exercise 3:  Explore the Independent Data

The dependent data includes our physicochemical measurements.  As a reminder, let's use a Facet Grid to reiew the range of values for each of these.  Code similar to the following was used in Day 9 exercises. Adapt it here to fit your variables.
```python
# First Melt the data
wine_t = wine.melt(id_vars='quality', var_name='measurement')

# Now create a FacetGrid and add a boxplot to it.
g = sns.FacetGrid(wine_t, col='measurement', col_wrap=6, sharex=False)
g.map(sns.boxplot, 'value', order=None);
```

To get a sense of the distribution shape of each indpednent data column use a violin plot as well.Code similar to the following was used in Day 9 exercises. Adapt it here to fit your variables.

```python
g = sns.FacetGrid(wine_t, col='measurement', col_wrap=6, sharex=False)
g.map(sns.violinplot, 'value', order=None);
```

Next, let's look for columns that might show correlation with other columns. Remember, colinear data can bias some supervised machine learning models, so for data columns that are highly correlated we should remove those. Code similar to the following was used in Day 9 exercises. Adapt it here to fit your variables.

```python
# Limit the plot to only 500 points to help reduce overplotting
sns.pairplot(wine.sample(500), hue='quality', palette="tab10");
```

Perform correlation analysis on the data columns. Exclude the `quality` column from the correlation analysis. 

In Day 9 exercises, we used the [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function to draw a heatmap of correlation values to help us identify columns that are highly correlated.  Code similar to the following was used in Day 9 exercises. Adapt it here to fit your variables.

```python
plt.figure(figsize=(10, 10))
sns.heatmap(wine_cor, vmin=-1, vmax=1, annot=True, square=True);
```

<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>You may be interested to group data columns by their similarity profiles. For this, use the Seaborn [seaborn.culstermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html) function instead. It will order the data columns by similarity and provide a dendgrogram on both the `x` and `y` axes to indicate relationships of simlarity.  The following code example will create this plot. Adapt it for your variables.

```python
sns.clustermap(wine_cor, vmin=-1, vmax=1);
```

## Exercise 4:  Cleaning the data

In summary, what important observations can we make from the exploration of both the dependent and independent variables in the data?

What type of  cleaning decisions should be made?

Is the data Tidy?  Do we need to adjust it?

## Exercise 5: Use SML Classification Models 

First, separate out the outcome (dependent) variable and our observed (independent) data variables. Save these into variables named `X` and `Y`.

Normalize the observed data. Be sure to use the [normalization strategy](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) best suited for the observations about the data.

Generate the training set such that 20% of the data is left for testing and 80% for training.   Name the variables with the training data as `Xt` and `Yt` respectively. Name the data used for testing/validation as `Xv` and `Yv`

Create a k-fold cross-validation strategy object to be used by the model that will be used to split the training data into 10 equal parts.

Use the following array to store results:
```python
results = {
    'LogisticRegression' : np.zeros(10),
    'LinearDiscriminantAnalysis' : np.zeros(10),
    'KNeighborsClassifier' : np.zeros(10),
    'DecisionTreeClassifier' : np.zeros(10),
    'GaussianNB' : np.zeros(10),
    'SVC' : np.zeros(10),
    'RandomForestClassifier': np.zeros(10)
}
```

Execute a Logistic Regression classifier model

Execute a Linear Discriminant Analysis classifier model

Execute a K Neighbors classifier model

Execute a Decision Tree classifier model

Execute a GaussianNB classifier model

Execute a Support Vector Machine (SVC) classifier model

Execute a Random Forest classifier model. This is new!

<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span> You've already been introduced to classificaiton trees. A random forest is an extension in that it fits a number of decision tree classifiers on various sub-samples of the dataset and then averages those results. This improves predictive accuracy and controls over-fitting. Learn more at the [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) page.

Here's an example for use of the `RandomForestClassifier`:
```python
alg = RandomForestClassifier(n_estimators=100)
```

Plot the results of each of the models. Which performed best?

## Exercise 6: Use the Model to Predict.

Create a new object of the classifier that performed best:

Create a new model by fitting it with the training data (the same data we just used to evaulate all those different models).

Using the testing data, predict the wine quality by providing our testing data. Now that the model has been trained, it will predict a quality score using the smaller validation testing dataset.  Save the result in a new variable named `predictions`

Briefly, let's view the contents of the predictions array.

What is the overall accuracy of the predictions?

Create the confusion matrix and use the Seaborn [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function to explore how well the model worked. (Note, this may take awhile to create). For the heatmap, be sure to
+ Show the values of the confusion matrix in the cells of the heatmap
+ Set the x-axis and y-axis labels.

Finally, generate and print the classification report