## Identify Wine Quality using different attributes

## Part 1 - DEFINE

### Dataset Information

### Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: 

[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf 
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib 

#### Summary

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

#### Features

1 - fixed acidity
<br>
2 - volatile acidity
<br>
3 - citric acid
<br>
4 - residual sugar
<br>
5 - chlorides
<br>
6 - free sulfur dioxide
<br>
7 - total sulfur dioxide
<br>
8 - density
<br>
9 - pH
<br>
10 - sulphates
<br>
11 - alcohol
<br>

#### Target variable
12 - quality (score between 0 and 10)

In [1]:
#import your libraries
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
#etc

#your info here
__author__ = "Viswanathan K S"
__email__ = "viswasiva2003@gmail.com"

## Part 2 - Discover 

In [2]:
# Find the Current Directory
import os
print(os.getcwd())

/home/ksviswa/DSDJ_Portfolio/wine-quality


### ---- 2 Load the data ----

In [3]:
#load the data into a Pandas dataframe
file_path = "/home/ksviswa/DSDJ_Portfolio/wine-quality/"

red_wine_data = pd.read_csv(file_path + "data/winequality-red.csv" , sep=';')
print(red_wine_data.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [4]:
white_wine_data = pd.read_csv(file_path + "data/winequality-white.csv" , sep = ';')
print(white_wine_data.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.0              0.27         0.36            20.7      0.045   
1            6.3              0.30         0.34             1.6      0.049   
2            8.1              0.28         0.40             6.9      0.050   
3            7.2              0.23         0.32             8.5      0.058   
4            7.2              0.23         0.32             8.5      0.058   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 45.0                 170.0   1.0010  3.00       0.45   
1                 14.0                 132.0   0.9940  3.30       0.49   
2                 30.0                  97.0   0.9951  3.26       0.44   
3                 47.0                 186.0   0.9956  3.19       0.40   
4                 47.0                 186.0   0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6 

In [6]:
print(red_wine_data.shape)
print(white_wine_data.shape)

(1599, 12)
(4898, 12)


### ---- 3 Clean the data ----

In [9]:
# Check for missing values

print("Red Wine Missing Values : ")
print(red_wine_data.isnull().sum())
print("\n")

print("White Wine Missing Values : ")
print(white_wine_data.isnull().sum())
print("\n")

Red Wine Missing Values : 
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


White Wine Missing Values : 
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64




### ---- 4 Explore the data (EDA) ----

In [16]:
# Use Describe to provide the statistics of the red wine and white wine data.

print("Red Wine Statistics : ")
print(red_wine_data.describe())
print("\n")

print("White Wine Statistics : ")
print(white_wine_data.describe())
print("\n")

#look for correlation between each feature and the target

#Correlation between each feature and target 
print("\n Correlation : All Features vs Target \n")

#print(red_wine_data.corr()['quality'][:-1].sort_values(ascending = False))
print(white_wine_data.corr()['quality'][:-1].sort_values(ascending = False))

#print(red_wine_data[red_wine_data.columns[0:]].corr()['quality'][:])
#print(white_wine_data[white_wine_data.columns[0:]].corr()['quality'][:])

#look for correlation between features 
#print("\n Correlation : All Features vs Features \n")
#print(train_data.corr()) # This will create the correlation matrix with all the features


Red Wine Statistics : 
       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000       