**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from torch.utils.data import TensorDataset
import torch.nn as nn
import torch

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/8s0ivlo9yshhxkn/winequality.zip?dl=1", directory="data/winequality")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Classifying Wine Quality using Neural Nets

In this notebook you can try out a neural classifier on another simple dataset.

**Note:**  The example is purely illustrational. The dataset is well-structured (the data is divided into columns with clear meanings etc.), and would therefore probably be better approached with a different method in practice – possibly with some approached based on decision trees. Artificial neural networks and deep learning are usually applied to problems with unstructured data such as images, audio, text etc.

### Loading and Preprocessing the Dataset

We will load our dataset from a CSV file:



In [None]:
df = pd.read_csv("data/winequality/winequality-white.csv")
df.head()

The description of the data comes in a separate file, should we need it:



In [None]:
with open("data/winequality/winequality", "r") as file:
    print("".join(file.readlines()))

#### The Splitting of the Dataset

Next we continue by splitting the dataset. In this case, however, the data will not be split into two parts the way we usually split it, but rather into three parts: training, validation and testing data in the ratio of 70 : 5 : 25. The validation data will be used during learning for regularization and model selection (the details are below). When splitting, we stratify by quality.



In [None]:
df_train_valid, df_test = train_test_split(df, test_size=0.25,
                                     stratify=df['quality'],
                                     random_state=4)
df_train, df_valid = train_test_split(df_train_valid, test_size=0.05/0.75,
                                     stratify=df_train_valid['quality'],
                                     random_state=4)

#### Column Selection and Pipeline Creation

As usual, we will determine which columns are numeric and which are categorical and we will create our pipeline object for preprocessing.



In [None]:
categorical_inputs = []
numeric_inputs = list(df.columns[:-1])
output = ["quality"]

input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy='constant', fill_value='MISSING'),
        OneHotEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

Wine quality is given on a scale from 1 to 10 in our dataset (column `quality`). Given that our dataset contains a relatively large amount of noise, this scale might be too fine: we will instead differentiate between three degrees of quality and we will do the transformation to the new scale automatically: using transformer `KBinsDiscretizer` from the `scikit-learn` package.



In [None]:
output_preproc = KBinsDiscretizer(3, encode='ordinal', strategy='quantile')

#### Application of Pre-processing

Finally, we apply our transformers to the data. As usual, we make sure to apply the `transform` method and not the `fit_transform` method to our testing – and in this case also to our validation – data.



In [None]:
X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = output_preproc.fit_transform(df_train[output]).reshape(-1)

X_valid = input_preproc.transform(df_valid[categorical_inputs+numeric_inputs])
Y_valid = output_preproc.transform(df_valid[output]).reshape(-1)

X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = output_preproc.transform(df_test[output]).reshape(-1)

We also convert into the datatypes expected by Pytorch, i.e. to 32-bit floats (inputs) and 64-bit integers (class labels).



In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

X_train = torch.as_tensor(X_train, dtype=torch.float32).to(device)
Y_train = torch.as_tensor(Y_train, dtype=torch.long).to(device)
X_valid = torch.as_tensor(X_valid, dtype=torch.float32).to(device)
Y_valid = torch.as_tensor(Y_valid, dtype=torch.long).to(device)
X_test = torch.as_tensor(X_test, dtype=torch.float32).to(device)
Y_test = torch.as_tensor(Y_test, dtype=torch.long).to(device)

---
### Task 1

**Apply a neural classifier with dropout to the wine quality classification problem. In your training loop, log results on the validation set at each epoch. You can use the results on the validation set to tune your network architecture, the amount of dropout, etc.** 

**Aid: For the sizes of linear layers, you can start with the following:** 

* `num_inputs`;
* 64;
* 32;
* 16;
* `num_outputs`;
---


In [None]:
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        
    
    # ------




In [None]:
num_inputs = X_train.shape[1]
num_outputs = len(np.unique(Y_train.cpu()))
model = Net(num_inputs, num_outputs).to(device)

criterion = nn.CrossEntropyLoss()




# ------





In [None]:
plt.plot(loss_train, label="train")
plt.plot(loss_valid, label="valid")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.grid(ls='--')
plt.legend()

In [None]:
# accuracy on the train set
model.eval()
with torch.no_grad():
    y_train_logit = model(X_train)
    y_train = y_train_logit.argmax(dim=1)

Y_train_cpu = Y_train.cpu()
y_train_cpu = y_train.cpu()

cm = pd.crosstab(
    output_preproc.inverse_transform(
        Y_train_cpu.reshape(-1, 1)).reshape(-1),
    output_preproc.inverse_transform(
        y_train_cpu.reshape(-1, 1)).reshape(-1),
    rownames=['actual'],
    colnames=['predicted']
)
print(cm)

acc = accuracy_score(Y_train_cpu, y_train_cpu)
print("Accuracy = {}".format(acc))

In [None]:
# accuracy on the validation set
model.eval()
with torch.no_grad():
    y_valid_logit = model(X_valid)
    y_valid = y_valid_logit.argmax(dim=1)

Y_valid_cpu = Y_valid.cpu()
y_valid_cpu = y_valid.cpu()

cm = pd.crosstab(
    output_preproc.inverse_transform(
        Y_valid_cpu.reshape(-1, 1)).reshape(-1),
    output_preproc.inverse_transform(
        y_valid_cpu.reshape(-1, 1)).reshape(-1),
    rownames=['actual'],
    colnames=['predicted']
)
print(cm)

acc = accuracy_score(Y_valid_cpu, y_valid_cpu)
print("Accuracy on valid = {}".format(acc))

### Evaluation on the Test Set

Once your final model is ready, you can evaluate it on the test set.



In [None]:
# evaluate on the test set
model.eval()
with torch.no_grad():
    y_test_logit = model(X_test)
    y_test = y_test_logit.argmax(dim=1)

Y_test_cpu = Y_test.cpu()
y_test_cpu = y_test.cpu()

cm = pd.crosstab(
    output_preproc.inverse_transform(
        Y_test_cpu.reshape(-1, 1)).reshape(-1),
    output_preproc.inverse_transform(
        y_test_cpu.reshape(-1, 1)).reshape(-1),
    rownames=['actual'],
    colnames=['predicted']
)
print(cm)

acc = accuracy_score(Y_test_cpu, y_test_cpu)
print("Accuracy on test = {}".format(acc))

### Classification using XGBoost

Finally, just like in the regression notebook, since we are working with nice, structured, tabular data here, we are also going to fit an XGBoost model to our dataset for comparison. Chances are that the results will be on par with our neural model or even better at a fraction of the computational cost.



In [None]:
from xgboost import XGBClassifier
X_train_np = X_train.cpu().numpy()
Y_train_np = Y_train.cpu().numpy()
X_test_np = X_test.cpu().numpy()
Y_test_np = Y_test.cpu().numpy()

model = XGBClassifier()
model.fit(X_train_np, Y_train_np);

In [None]:
y_test = model.predict(X_test_np)
cm = pd.crosstab(Y_test_np.reshape(-1),
                 y_test.reshape(-1),
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm)

print("Accuracy on test: {}.".format(accuracy_score(
    Y_test_np, y_test
)))