<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-4/classification_winequality_solution_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-class Classification (Extra Exercise)

## Introduction

We will be using the wine quality data set for this exercise. This data set contains various chemical properties of wine, such as acidity, sugar, pH, alcohol, as well as color. It also contains a quality metric (3-9, with highest being better). 

Using what you have learnt in the previous exercises, you will now build a classification model to predict the quality of the wine, given the various chemical properties and color.

## Getting the Data

You can download the data from the following link:

https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/Wine_Quality_Data.csv


In [2]:
## Write your code here
import pandas as pd 

data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/Wine_Quality_Data.csv'
data = pd.read_csv(data_url)

## Data Exploration

Find out the following: 
- how many samples we have? 
- are there any missing values? 
- are there any categorical data? 
- how many different grades (qualities) of wine. 

In [7]:
# different grades of wine 
data.quality.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


In [4]:
## Write your code here
data.isnull().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
color                   0
dtype: int64

## Data Preparation

As part of data prep, you will need some of the following:
- Encode any categorical columns if necessary
- Handle any missing values
- Scaling if necessary
- Split the datasets into train/val/test

Decide if you want to do K-fold cross-validation or set aside a dedicated validation set. Explain your choice.

Think about the splitting strategy, do you need stratified split?

In [17]:
## Write your code here

def combine(y):
    if y <= 4:
        return 4
    if y >= 8: 
        return 8
    return y

In [33]:
from sklearn.model_selection import train_test_split

X = pd.get_dummies(data.drop(['quality'], axis=1))
y = data['quality'].map(lambda x: combine(x))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [34]:
y_test.value_counts()/len(y_test)

6    0.436154
5    0.329231
7    0.166154
4    0.037692
8    0.030769
Name: quality, dtype: float64

In [35]:
y_train.value_counts()/len(y_train)

6    0.436598
5    0.329036
7    0.166057
4    0.037906
8    0.030402
Name: quality, dtype: float64

## Build and validate your model

For this exercise, use RandomForestClassifier with the following parameters: n_estimators = 40, max_depth=10.  You do not neeed to understand what the parameters mean at this point, as you will learn more during the ML Algorithms module.  (We are not using LogisticRegression because it does perform as well for this dataset)

What do you notice about the validation accuracy/recall/precision? You can just use classification report to get more info about the performance of each class. Analyse the report and explain your result. 

In [37]:
## Write your code here
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([("scale", StandardScaler()),
                     ("clf", RandomForestClassifier(n_estimators=40, 
                                                       max_depth=10,
                                                       random_state=0))])

In [38]:
from sklearn.model_selection import cross_val_predict 
from sklearn.model_selection import cross_val_score

y_preds = cross_val_predict(pipeline, X_train, y_train, cv=3)


In [39]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_preds))

              precision    recall  f1-score   support

           4       0.56      0.03      0.05       197
           5       0.68      0.64      0.66      1710
           6       0.58      0.77      0.66      2269
           7       0.63      0.41      0.50       863
           8       0.92      0.14      0.24       158

    accuracy                           0.62      5197
   macro avg       0.67      0.40      0.42      5197
weighted avg       0.63      0.62      0.60      5197



## Improve your model

Based on your analysis above, what do you think you can do to improve the model? 

Try to implement ONE possible change to improve your model.  Has the model improved in validation performance? 

Test it now on your test set. Do you get similar result as your validation result?

In [None]:
## Write your code here