# Wine Quality Classification

### Objective

The objective for this project is to classify the quality of wine based on its physicochemical properties.

### Dataset description
The dataset is called Wine Quality, which is found on the UCI Machine Learning Repository. The link to the dataset is [here](https://archive.ics.uci.edu/dataset/186/wine+quality).  
There are 11 features, all of them are continuous variables. The target variable is called `quality`

### Data preprocessing
All preprocessing steps done to the data is on `data-preprocessing.ipynb`

The raw dataset contains two csv files: one for red wine and the other for white wine. The first five rows of each file are shown below:

In [11]:
import pandas as pd

In [12]:
red_wine_raw = pd.read_csv("../data/raw/winequality-red.csv", delimiter=";")
red_wine_raw.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [13]:
white_wine_raw = pd.read_csv("../data/raw/winequality-white.csv", delimiter=";")
white_wine_raw.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


The cleaned file is called `winequality_cleaned.csv`, located in `data/cleaned` directory. The first five rows of cleaned data are shown below:

In [14]:
cleaned = pd.read_csv("../data/cleaned/winequality_cleaned.csv")
cleaned.head()

Unnamed: 0,is_red_wine,alcohol,density,volatile acidity,chlorides,quality_label
0,1,9.4,0.9978,0.7,0.076,1
1,1,9.8,0.9968,0.88,0.098,1
2,1,9.8,0.997,0.76,0.092,1
3,1,9.8,0.998,0.28,0.075,0
4,1,9.4,0.9978,0.66,0.075,1


The cleaned dataset is split into training (`winequality_train.csv`) and testing (`winequality_test.csv`) datasets using a 80/20 split. 

### Model construction
The models chosen for this project are: Logistic regression, Support Vector Machine, and Linear Discriminant Analysis.  
All functions used to run the models above are in `model.py`

In [15]:
from model import run_lda, run_logistic_regression, run_svm

train = pd.read_csv("../data/cleaned/winequality_train.csv")
X_train = train.drop(columns="quality_label")
y_train = train["quality_label"]

test = pd.read_csv("../data/cleaned/winequality_test.csv")
X_test = test.drop(columns="quality_label")
y_test = test["quality_label"]

We first run each model without any tuning. For SVM, this means we are using cost = 1, gamma = 1, and degree = 2(for polynomial basis kernel)

In [16]:
run_logistic_regression(X_train, X_test, y_train, y_test)

Average cross-validation score: 0.74
Accuracy score: 0.73


In [17]:
run_svm(X_train, X_test, y_train, y_test, kernel='linear')

Average cross-validation score: 0.74
Accuracy score: 0.73


In [18]:
run_svm(X_train, X_test, y_train, y_test, kernel='rbf')

Average cross-validation score: 0.73
Accuracy score: 0.74


In [19]:
run_svm(X_train, X_test, y_train, y_test, kernel='poly')

Average cross-validation score: 0.74
Accuracy score: 0.74


In [20]:
run_lda(X_train, X_test, y_train, y_test)

Average cross-validation score: 0.74
Accuracy score: 0.74


### Result