<a href="https://colab.research.google.com/github/martinahuang/CUS615/blob/master/assignment_MartinaHuang_Classification_Loading_Wine_Quality_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wine Quality Dataset 

## Data Description

### Red Wine Quality - Parameters
* fixed.acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 
* volatile.acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* citric.acid (g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* residual.sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet 
* chlorides (sodium chloride - g / dm^3): the amount of salt in the wine 
* free.sulfur.dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine 
* total.sulfur.dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine 
* density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content 
* pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale 
* sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant 
* alcohol (% by volume): the percent alcohol content of the wine 
* quality: quality score between 0 and 10

### Objective.

* To explore the physiocochemical properties of red wine
* To determine an optimal machine learning model for red wine quality classification


In [0]:
# Import librarires
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

# Sklearn moduels.
from sklearn.model_selection import train_test_split


In [0]:
# Include any additional modules libraries your code might need here.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [0]:
%%capture 

# Execute this cell first to download the necessary data 
# This cell installs sample data necessary for this workshow on your colab virtual enviroment
!wget -O /content/sample_data/wineQualityReds.csv https://raw.githubusercontent.com/christoforou/intro_to_pandas_lab/master/data/red-wine-dataset/wineQualityReds.csv


In [0]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "./sample_data/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)


In [0]:
# Split the data into a training and testing set using the sklearn function train_test_split
# Noteice that 
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Challenge 1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.

* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.


In [11]:
# Your Solution here
X_train.shape



(1199, 11)

In [14]:
y_test.shape

(400,)

In [0]:
n_features = len(np.unique(X_train))

In [19]:
str(n_features)

'1033'

In [24]:
n_samples_train, n_features = X_train.shape
n_samples_test, _ = X_test.shape
n_features = len(np.unique(X_test))
n_classes = len(np.unique(y_train))

print("Number of samples in training set: %d ( %d positive, %d negative)" % (n_samples_train,np.sum(y_train==1),np.sum(y_train==0)))
print("Number of samples in the testing set: %d (%d positive, %d negative)" % (n_samples_test,np.sum(y_test==1),np.sum(y_test==0)))
print("Number of features: " +  str(n_features))
print("Number of classes: " + str(n_classes))
print("IDs for class labels: " + str(np.unique(y_train)))

Number of samples in training set: 1199 ( 0 positive, 0 negative)
Number of samples in the testing set: 400 (0 positive, 0 negative)
Number of features: 731
Number of classes: 6
IDs for class labels: [3 4 5 6 7 8]


# Challenge 2

Train a k-NN classifier using the `(X_train,y_train)` dataset and use the trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [0]:
model = KNeighborsClassifier(n_neighbors=3)

In [0]:
# Your solution 
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

y_true = y_test

# Challenge 3

Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [32]:
# Your solution 
from sklearn.metrics import confusion_matrix

print("\n This is the confusion matrix")
cnf_mx = metrics.confusion_matrix(y_true, y_pred)
print(cnf_mx)


 This is the confusion matrix
[[ 0  0  1  0  0  0]
 [ 0  1  3  8  1  0]
 [ 1 12 98 48  4  1]
 [ 0 15 74 62 18  0]
 [ 0  0 10 21 15  2]
 [ 0  0  1  2  2  0]]


In [33]:
print("\n This is the normalized confusion matrix.")
cnf_mx_joint = cnf_mx.astype('float')/ cnf_mx.sum()
print(cnf_mx_joint)


 This is the normalized confusion matrix.
[[0.     0.     0.0025 0.     0.     0.    ]
 [0.     0.0025 0.0075 0.02   0.0025 0.    ]
 [0.0025 0.03   0.245  0.12   0.01   0.0025]
 [0.     0.0375 0.185  0.155  0.045  0.    ]
 [0.     0.     0.025  0.0525 0.0375 0.005 ]
 [0.     0.     0.0025 0.005  0.005  0.    ]]


In [34]:
# Accuracy.
#
acc = metrics.accuracy_score(y_true, y_pred)
# Display the output
print("Accuracy: %.3f" % acc)

Accuracy: 0.440


In [42]:
target_names = ['Negative', 'Positive']
print(metrics.classification_report(y_true,y_pred,target_names=target_names))

ValueError: ignored

# Challenge 4

The code below loads the same dataset, by treats it as a binary classification problem. That is, instead of classifying an observation into one of 10 categories (0..10), we consider all observations with score above 5 as being good and all observation below or equal to five as being bad.





In [0]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "./sample_data/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

wine_df['quality'] = np.where(wine_df['quality']>5,"Good","Bad")

In [0]:
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Callenge 4.1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.
* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.




In [0]:
# Your Solution 


## Challenge 4.2 
Train a k-NN classifier using the `(X_train,y_train)` dataset and use trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [0]:
# Your solution 


## Challenge 4.3
Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [0]:
# Your Solution 


# Challenge 5

Knn classifier accepts a number of parameters. Once of those parameters is the number K (i.e. the number of nearest neighbors to consider when making a prediction. Evaluate the classifier for different values of K and identify which configuration achieve the best performance on the testing set. Plot or print your results.


In [0]:
# Your solution here.
