# Task: Predicting the quality of white wine
This report is divided into two sections: 1) Methodology, and 2) Implementation. The first part explains the steps of the approach and the selected model. The second part includes the annotated Python codes which implement the given methodology and predict the labels for the test set.


### 1. Methodology
In this task we are asked to implement an ordinal regression task. To do so, we perform 8 learning models including one Threshold-based ordinal regression model and 7 classification models which include Multinomial Naive Bayes, SVM classifier, Deep Neural Network (Multi-layer Perceptron (MLP) Classifier), Decision Tree Classifier, Gradient Boosting Classifier, K-Nearest-Neighbors (KNN) Classifier, and Random Forest (RF) Classifier.
In order to compare the performance of the models, we implement a 10-fold cross validation on the labled dataset that we have. Due to limited labeled data (only 2000 rows), we divide it into only two parts of train and validation sets, and no test set, since the result of the test set does not affect our model selection. The selected model will be then used for predicting the lables of the unlabled test set (which has 2898 rows).

For the ordinal regression, MLP, Gradient Boosting, KNN, and RF classifiers, some of their hyperparameters are tuned, where multiple values for the hyperparameters are examined and the one which gives the highest accuracy is used for that classifier. More specifically, for the ordinal regression, its hyperparameter alpha is set to 0.001. For the MLP, the activation function "relu", the solver "adam", a maximum iteration of 5000, 17 layers, and 20 hidden units at each layer are used. For the Gradient Boosting Classifier, the number of estimators and the learning rate are set to 300 and 0.1, respectively. For KNN, K=300. Finally, for the RF, 400 trees and a maximum depth of 40 are used. For other models, the default hyperparameters are considered.

After implementing the 10-fold cross validation and obtaining the average accuracy of each model, we realize that the RF classifier is the best one with an average accuracy of 63.1%. The Gradient Boosting and the Decision Tree Classifiers are the next best ones, with an average accuarcy of 58.9% and 54.1%, respectively. The rest of the models do not even surpass an average accuracy of 50%. Surprisingly, even the ordinal regression model performs poorly and is the fourth best model with an average accuracy of 49.6%.

In the final step, the selected model (RF classifier) is trained again on the whole labled dataset (which has 2000 rows) and then is used to predict the lables of the unlabeled test set of the task (which has 2898 rows). It is worth mentioning that our Random Forest Classifier is a blackbox and is not interpretable. However, since the main objective of this task is achieving the highest accuray, RF is used.

Moreover, the scikit-learn library which is used in this task, by default uses a Stratified K-Folds cross-validator, when you perform K-Folds cross validation. Stratified K-Folds, generate validation sets in a way that they all contain the same distribution of classes, or as close as possible.


### 2. Implementation
This section, provides the steps reuired to code and implement the ordinal regression task for predicting the quality of a white wine, for the given datasets of the task.

Install mord package for ordinal regression.

In [None]:
!pip install mord

Collecting mord
  Downloading https://files.pythonhosted.org/packages/67/9d/c791c841501d9ff4ecb76b57f208dec6cf9f925109c59c995ddec80f9b32/mord-0.6.tar.gz
Building wheels for collected packages: mord
  Building wheel for mord (setup.py) ... [?25l[?25hdone
  Created wheel for mord: filename=mord-0.6-cp36-none-any.whl size=6008 sha256=83fa348319dc4dc36ac99d85372a8034569a385732e4f06a1ee5a14ec29a6aa8
  Stored in directory: /root/.cache/pip/wheels/98/14/b2/244c2cec93a0c6edb29b488bd6b2710ded7e9d457033b86366
Successfully built mord
Installing collected packages: mord
Successfully installed mord-0.6


Import the required libraries and packages.

In [None]:
#Import pandas for DataFrame Structure and the corresponding operations
import pandas as pd

#Import train_test_split function
from sklearn.model_selection import train_test_split

#Import cross_val_score for obtaining the accuracy of the validation sets
#in K-Fold cross validation
from sklearn.model_selection import cross_val_score   

#For file upload and download in google colab
from google.colab import files

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Importing the mean function from statistics module
#(for taking the average of a list)
from statistics import mean


#import mord for ordinal regression
import mord
#import Multinomial Naive Bayes model for classification
from sklearn.naive_bayes import MultinomialNB
#import svm for SVM classification
from sklearn import svm
#import MLPClassifier for Multi-layer Perceptron classification (deep learning)
from sklearn.neural_network import MLPClassifier
#import DecisionTreeClassifier for a tree classification
from sklearn.tree import DecisionTreeClassifier
#import GradientBoostingClassifier for Gradient Boosting Classification
from sklearn.ensemble import GradientBoostingClassifier
#import KNeighborsClassifier for KNN classification
from sklearn.neighbors import KNeighborsClassifier
#import RandomForestClassifier for Random Forest Classification
from sklearn.ensemble import RandomForestClassifier



Import the training dataset.
Note that you should allow cookies in order for the following code to work in Google Colab.
If it is not already the case, you can see in this [link](https://stackoverflow.com/questions/53581023/google-colab-file-download-failed-to-fetch-error) how to do this.

In [None]:
uploaded = files.upload()

Saving datrain.txt to datrain.txt


Import the test dataset.

In [None]:
uploaded = files.upload()

Saving dateststudent.txt to dateststudent.txt


Loading the train and test datasets into a pandas DataFrame structure.

In [None]:
train = pd.read_csv('datrain.txt', sep=' ')
test = pd.read_csv('dateststudent.txt', sep=' ')

display(train.head())
display(test.head())

Unnamed: 0,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol,y
0,7.3,0.32,0.35,1.4,0.05,8.0,163.0,0.99244,3.24,0.42,10.7,1
1,7.3,0.26,0.31,1.6,0.04,39.0,173.0,0.9918,3.19,0.51,11.4,2
2,8.3,0.25,0.49,16.8,0.048,50.0,228.0,1.0001,3.03,0.52,9.2,2
3,7.0,0.16,0.73,1.0,0.138,58.0,150.0,0.9936,3.08,0.3,9.2,1
4,5.8,0.18,0.37,1.2,0.036,19.0,74.0,0.98853,3.09,0.49,12.7,3


Unnamed: 0,fixedacidity,volatileacidity,citricacid,residualsugar,chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
1,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
3,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6
4,7.9,0.18,0.37,1.2,0.04,16.0,75.0,0.992,3.18,0.63,10.8


Now, we define the covariate matrix X and the target vector y for the training dataset.

In [None]:
covariates= \
"fixedacidity|volatileacidity|citricacid|residualsugar|chlorides|freesulfurdioxide"
X = train.filter(regex=('('+covariates+')')) 
print(X.shape)

y = train['y']
print(y.shape)

(2000, 6)
(2000,)


Implement 10-Fold cross validation on the training dataset (with 2000 rows) using different ordinal regression, and classification models.

In [None]:
#Implementing a Threshold-based model from the mord package for ordinal regression
mord2 = mord.LogisticIT(alpha=0.001)
print("The accuracy of mord.LogisticIT is: " 
      , mean(cross_val_score(mord2, X, y, cv=10)))

mnb = MultinomialNB()
print("The accuracy of MultinomialNB is: " 
      , mean(cross_val_score(mnb, X, y, cv=10)))

svmc = svm.SVC()
print("The accuracy of svm.SVC is: " 
      , mean(cross_val_score(svmc, X, y, cv=10)))

mlp = MLPClassifier(hidden_layer_sizes=
                    (20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20) 
                    , activation='relu', solver='adam', max_iter=5000)
print("The accuracy of MLPClassifier is: " 
      , mean(cross_val_score(mlp, X, y, cv=10)))

tree = DecisionTreeClassifier(random_state=0)
print("The accuracy of DecisionTreeClassifier is: " 
      , mean(cross_val_score(tree, X, y, cv=10)))

gbc = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1)
print("The accuracy of GradientBoostingClassifier is: " 
      , mean(cross_val_score(gbc, X, y, cv=10)))

knn = KNeighborsClassifier(n_neighbors=300)
print("The accuracy of KNeighborsClassifier is: " 
      , mean(cross_val_score(knn, X, y, cv=10)))

rf = RandomForestClassifier(n_estimators=400, max_depth=40, random_state=0)
print("The accuracy of RandomForestClassifier is: " 
      , mean(cross_val_score(rf, X, y, cv=10)))

The accuracy of mord.LogisticIT is:  0.496
The accuracy of MultinomialNB is:  0.4305
The accuracy of svm.SVC is:  0.481
The accuracy of MLPClassifier is:  0.489
The accuracy of DecisionTreeClassifier is:  0.5415
The accuracy of GradientBoostingClassifier is:  0.5895
The accuracy of KNeighborsClassifier is:  0.4805
The accuracy of RandomForestClassifier is:  0.631


Now, we define the covariate matrix Xtest for the test dataset.
Please note that we could also do a variable selection (using the training dataset) before defining Xtest. Right now all variables are included.

In [None]:
Testcovariates = \
"fixedacidity|volatileacidity|citricacid|residualsugar|chlorides|freesulfurdioxide"
Xtest = test.filter(regex=('('+Testcovariates+')')) 
print(Xtest.shape)

(2898, 6)


TEST SET PREDICTIONS: Here, we use the model with the highest validation accuracy in the 10-Fold CV method that we used before. The random forest method was selected which had the highest accuracy.

In [None]:
rf = RandomForestClassifier(n_estimators=400, max_depth=40, random_state=0)

#Train the model using the training set
rf.fit(X, y)

#Predict the response for train dataset
y_predtr = rf.predict(X)

#Predict the response for test dataset
ypred = rf.predict(Xtest)


#Model Accuracy. This shows how often is the classifier correct in train data set.
print("rf train Accuracy:",metrics.accuracy_score(y, y_predtr))


rf train Accuracy: 1.0


In [None]:
print("rf test predictions:", ypred[0:19])

df = pd.DataFrame(ypred)

df.to_csv("Predictions.csv")

files.download('Predictions.csv')


rf test predictions: [2 1 1 1 1 1 1 2 1 2 1 3 1 3 2 1 1 1 2]


This report was originally produced for the assignment of the PhD course: Advanced Statistical Learning at Data Science Department of HEC Montreal.