<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-4/classification_winequality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-class Classification

## Introduction

We will be using the wine quality data set for this exercise. This data set contains various chemical properties of wine, such as acidity, sugar, pH, alcohol, as well as color. It also contains a quality metric (3-9, with highest being better). 

Using what you have learnt in the previous exercises, you will now build a classification model to predict the quality of the wine, given the various chemical properties and color.

## Getting the Data

You can download the data from the following link:

https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/Wine_Quality_Data.csv

In [1]:
import pandas as pd
from pandas.core.common import random_state
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

data_url = "https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/Wine_Quality_Data.csv"
df = pd.read_csv(data_url)
df.head()




   fixed_acidity  volatile_acidity  ...  quality  color
0            7.4              0.70  ...        5    red
1            7.8              0.88  ...        5    red
2            7.8              0.76  ...        5    red
3           11.2              0.28  ...        6    red
4            7.4              0.70  ...        5    red

[5 rows x 13 columns]

## Data Exploration

Find out the following: 
- how many samples we have? 
- are there any missing values? 
- are there any categorical data? 
- how many different grades (qualities) of wine. 

In [2]:
## Write your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 660.0 KB


In [3]:
df["color"].unique()




array(['red', 'white'], dtype=object)

## Data Preparation

As part of data prep, you will need some of the following:
- Encode any categorical columns if necessary
- Handle any missing values
- Scaling if necessary
- Split the datasets into train/val/test

Decide if you want to do K-fold cross-validation or set aside a dedicated validation set. Explain your choice.

Think about the splitting strategy, do you need stratified split?

In [None]:
label_map = {"red": 0, "white": 1}

df["color"] = df["color"].map(label_map)

In [4]:
df["color"].head()




0    0
1    0
2    0
3    0
4    0
Name: color, dtype: int64

In [5]:
df["color"].value_counts()
# is skewed to white




color
1    4898
0    1599
Name: count, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

X = df[
    [
        "fixed_acidity",
        "volatile_acidity",
        "citric_acid",
        "residual_sugar",
        "chlorides",
        "free_sulfur_dioxide",
        "total_sulfur_dioxide",
        "density",
        "pH",
        "sulphates",
        "alcohol",
        "quality",
    ]
]
y = df["color"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Build and validate your model

For this exercise, use SVM as a start. You do not neeed to understand what the parameters mean at this point, as you will learn more during the ML Algorithms module. 

What do you notice about the validation accuracy/recall/precision? You can just use classification report to get more info about the performance of each class. Analyse the report and explain your result. 

In [6]:
## Write your code here
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

svc = LinearSVC(random_state=42)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9815384615384616


In [7]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96       341
           1       0.98      0.99      0.99       959

    accuracy                           0.98      1300
   macro avg       0.98      0.97      0.98      1300
weighted avg       0.98      0.98      0.98      1300



## Improve your model

Based on your analysis above, what do you think you can do to improve the model? 

Try to implement ONE possible change to improve your model.  Has the model improved in validation performance? 

Test it now on your test set. Do you get similar result as your validation result?

In [None]:
## Write your code here