# Machine Learning Wine Quality 

This dataset was retrieved from [this page of data science projects](https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/). The dataset can be used to test outlier detection, feature selection, and unbalanced data. 

From past studies: " The inputs include objective tests (e.g. PH values) and the output is based on sensory data
  (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
  between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
  these datasets under a regression approach. The support vector machine model achieved the
  best results."
  
The classes are ordered and not balanced (e.g. there are munch more normal wines than
   excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
   or poor wines. Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
   feature selection.

The goal here is to determine the quality of the wine based on 11 attributes. First up is to read in the data and see if there is cleaning to do. There are two datasets: winequality-red and winequality-white. Since I am a fan of red wines, we'll start with that dataset.

In [1]:
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [2]:
df = pd.read_csv('winequality-red.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


Those certainly look like numbers. It doesn't look like there are any null values, but checking anyway ...

In [5]:
df.isnull().any()

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

This must be why this dataset is in the easy category. Ok now I'm just going to try to model with KNN classification using all the attributes just to see what it looks like

## Classification without feature selection

We want to be about to predict the quality of the wine, and that the 'quality' column will be the output df, and the rest will be inputs, x.

In [6]:
x  = df.drop("quality", axis=1)
y = df["quality"]

#### Scaling the Data
The data needs to be scaled as many machine learning estimators rely on data that looks normally distributed. Scaling the data standardizes features by removing the mean and scaling to unit variance.

In [12]:
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)
x

array([[-0.9199599 ,  0.3249455 , -1.42502606, ..., -0.21077233,
        -0.43400899,  0.00868398],
       [-0.45396155,  1.3377945 , -0.73215881, ..., -1.32165736,
         1.17257714, -0.88697007],
       [-0.42516706,  0.63557932, -0.94072157, ..., -0.70616423,
         0.91942022, -0.14795181],
       ...,
       [-0.82714143,  0.22463756,  0.90285255, ..., -0.30243896,
         0.12206049, -1.37766201],
       [-1.28974801,  0.70601761,  0.50430487, ..., -0.42076012,
        -0.72095797, -1.04290137],
       [-0.2425393 , -0.38672808,  1.30817897, ...,  1.55632891,
         0.06485062,  0.68869862]])

#### Principle Components Analysis

PCA is used 

In [8]:
pca = PCA(0.95)
pca.fit(x)
pca.n_components_
x = pca.transform(x)
x = pd.DataFrame(x)
x.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-1.61953,0.45095,-1.774454,0.04374,0.067014,-0.913921,-0.161043,-0.282258,0.005098
1,-0.79917,1.856553,-0.91169,0.548066,-0.018392,0.929714,-1.009829,0.762587,-0.520707
2,-0.748479,0.882039,-1.171394,0.411021,-0.043531,0.401473,-0.539553,0.597946,-0.086857
3,2.357673,-0.269976,0.243489,-0.92845,-1.499149,-0.131017,0.34429,-0.455375,0.091577
4,-1.61953,0.45095,-1.774454,0.04374,0.067014,-0.913921,-0.161043,-0.282258,0.005098


In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((1199, 9), (1199,), (400, 9), (400,))

In [10]:
knn = KNeighborsClassifier()     #The default n_neighbours is 5
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
acc_knn = round(accuracy_score(y_test,y_pred) * 100, 2)
acc_knn

55.25