# Lesson 3 Assignment - Wine Classifier

## Author - Studentname

### Instructions
Your task for this assignment:  Design a simple, low-cost sensor that can distinguish between red wine and white wine.
Your sensor must correctly distinguish between red and white wine for at least 95% of the samples in a set of 6497 test samples of red and white wine.

Your technology is capable of sensing the following wine attributes:
- Fixed acidity  
- Free sulphur dioxide
- Volatile acidity  
- Total sulphur dioxide
- Citric acid  
- Sulphates
- Residual sugar  
- pH
- Chlorides  
- Alcohol
- Density




## Tasks
1. Read <a href="https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/WineQuality.pdf">WineQuality.pdf</a>.
2. Use the RedWhiteWine.csv or RedWhiteWine.arff that is provided.
Note: If needed, remove the quality attribute, which you will not need for this assignment.
3. Build an experiment using Naive Bayes Classifier.

Answer the following questions:
1. What is the percentage of correct classification results (using all attributes)?
2. What is the percentage of correct classification results (using a subset of the attributes)?
3. What is the AUC of your model?
4. What is the best AUC that you can achieve?
5. Which are the the minimum number of attributes? Why?


In [3]:
URL = "https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/RedWhiteWine.csv"

In [4]:
# Import libraries
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd


In [29]:
wine_data = pd.read_csv(URL)
wine_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Class
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [30]:
print(wine_data.dtypes)
print(wine_data.shape)
a = wine_data['Class']
print(a.value_counts())
print(wine_data.describe())

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
Class                     int64
dtype: object
(6497, 13)
0    4898
1    1599
Name: Class, dtype: int64
       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    6497.000000       6497.000000  6497.000000     6497.000000   
mean        7.215307          0.339666     0.318633        5.443235   
std         1.296434          0.164636     0.145318        4.757804   
min         3.800000          0.080000     0.000000        0.600000   
25%         6.400000          0.230000     0.250000        1.800000   
50%         7.000000          0.290000     0.310000        3.000000   
75%         7.70

scale the data


In [31]:
num_features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
                'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH',
                'sulphates', 'alcohol']
 
scaled_features = {}
for each in num_features:
    mean, std = wine_data[each].mean(), wine_data[each].std()
    scaled_features[each] = [mean, std]
    wine_data.loc[:, each] = (wine_data[each] - mean)/std

print(wine_data.describe())

       fixed acidity  volatile acidity   citric acid  residual sugar  \
count   6.497000e+03      6.497000e+03  6.497000e+03    6.497000e+03   
mean    9.396824e-16     -2.652262e-14  4.807301e-14   -2.252111e-15   
std     1.000000e+00      1.000000e+00  1.000000e+00    1.000000e+00   
min    -2.634386e+00     -1.577208e+00 -2.192664e+00   -1.017956e+00   
25%    -6.288845e-01     -6.661100e-01 -4.722972e-01   -7.657389e-01   
50%    -1.660764e-01     -3.016707e-01 -5.940918e-02   -5.135217e-01   
75%     3.738663e-01      3.664680e-01  4.911081e-01    5.584015e-01   
max     6.698910e+00      7.533774e+00  9.230570e+00    1.268585e+01   

          chlorides  free sulfur dioxide  total sulfur dioxide       density  \
count  6.497000e+03         6.497000e+03          6.497000e+03  6.497000e+03   
mean   1.278966e-14        -6.367933e-17         -5.225926e-16  2.181060e-12   
std    1.000000e+00         1.000000e+00          1.000000e+00  1.000000e+00   
min   -1.342536e+00        -1.6

Now that things are normalized, let's get a training data set

In [43]:
msk = np.random.rand(wine_data.shape[0]) <= 0.8
## remove the quality column and set as separate 
quality_data = wine_data['quality']
wine_d = wine_data.drop(['quality'],axis = 1)
wine_train = wine_d.iloc[msk, 0:12]
wine_train_target = quality_data.loc[msk]
wine_test = wine_d.iloc[~msk, 0:12]
wine_test_target = quality_data.loc[~msk]


print(wine_train.describe())
print("wine train target is ", wine_train_target.describe())
print(wine_test_target.describe())


       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    5243.000000       5243.000000  5243.000000     5243.000000   
mean        0.000316          0.004276    -0.000635        0.000883   
std         0.999635          1.000111     0.997415        1.002197   
min        -2.634386         -1.577208    -2.192664       -1.017956   
25%        -0.628884         -0.666110    -0.472297       -0.765739   
50%        -0.166076         -0.301671    -0.059409       -0.513522   
75%         0.373866          0.366468     0.491108        0.558401   
max         6.698910          7.533774     9.230570       12.685846   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  5243.000000          5243.000000           5243.000000  5243.000000   
mean      0.004156             0.001276             -0.002124     0.007099   
std       1.008667             1.004708              1.004909     1.002768   
min      -1.342536            -1.663455         

Now for the actual fitting

In [46]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb_model = gnb.fit(wine_train, wine_train_target)
y_pred = gnb_model.predict(wine_test)
misclassified_points = (wine_test_target != y_pred).sum()
print("Number of mislabeled points out of a total %d points : %d"\
      % (wine_test.shape[0], misclassified_points))
print("Accuracy = %.2f"%(round((wine_test.shape[0] - float(misclassified_points))/wine_test.shape[0]*100,2)))

Number of mislabeled points out of a total 1254 points : 794
Accuracy = 36.68
