 Assignment

After you have tested some classifiers/regressors on the wines dataset (homework 1), you need to improve your models. Your tasks for this homework are:

    Use gridsearch with cross validation to improve one of your previously submitted models (Homework 1) for each problem: regression, binary classification and multiclass classification. Select your parameters and scan their possible values properly (hint: use numpy). Report the test set score, best parameters, best cross-validation score and best estimator. What is the improvement with respect to the past results?

    For your binary classification problem, use gridsearch and cross-validation with SVC. What is the best kernel? Print the heat map of mean cross-validation score (hint: use the mglearn library), the ROC curve and report the AUC value.

    For your binary classification model, repeat the gridsearch with cross validation step using AUC as the evaluation metric. Do the result improves? Elaborate.

You can work in small groups but you should submit your notebook individually. Deadline 15 of November at midnight.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

# Download red wine data to Pandas dataframe
redWine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",delimiter=";")
# Check if there are some null values
print("Are there are null values in redWine data frame? "+ str(redWine.isnull().values.any()))
print("redWine shape is:")
print(redWine.shape)
# Add 1 as red wine type
redWine["type"] = 1
display(redWine.head())

# Download white wine data to Pandas dataframe
whiteWine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",delimiter=";")
# Check if there are some null values
print("Are there are null values in whiteWine data frame? "+ str(whiteWine.isnull().values.any()))
print("whiteWine shape is:")
print(whiteWine.shape)
# Add 0 as white wine type
whiteWine["type"] = 0
# Merge redWine and whiteWine dataframe 
wine = pd.concat([redWine, whiteWine],sort=False)
print("the merged wine dataframe shape is:")
print(wine.shape)

In [None]:
from sklearn.model_selection import GridSearchCV
# Use alcohol level as label and convert dataframe to numpy array
X = wine.drop(['alcohol'], axis=1).values
y = wine['alcohol'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid ={'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
            'tol':[0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
GCV = GridSearchCV(Ridge(random_state=30), param_grid, scoring='r2',n_jobs =-1,cv=5, return_train_score=True)
GCV.fit(X_train, y_train)
print("The best parameters are: "+ str(GCV.best_params_))
print("The best score is: "+str(GCV.best_score_))
print("Mean train scores are: "+ str(GCV.cv_results_['mean_train_score']))
print("Mean test scores are: "+ str(GCV.cv_results_['mean_test_score']))

In [None]:
# Use binary classifier to predict wine type, namely white wine or red wine
from sklearn.neighbors import KNeighborsClassifier
X = wine.drop(['type'], axis=1).values
y = wine['type'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
param_grid ={'n_neighbors':[1, 2, 3, 4, 5, 6, 7, 8],
            'p':[1,2],
            'leaf_size':[20, 25, 30, 35, 40, 45]}
GCV = GridSearchCV(KNeighborsClassifier(), param_grid, scoring='accuracy',n_jobs =-1,cv=5)
GCV.fit(X_train, y_train)
print("The best parameters are: "+ str(GCV.best_params_))
print("The best accuracy is: "+str(GCV.best_score_))

In [None]:
# Use multiclass classifier RandomForestClassifier to predict wine quality
from sklearn.ensemble import RandomForestClassifier
X = wine.drop(['quality'], axis=1).values
y = wine['quality'].values
unique, counts = np.unique(y, return_counts=True)
display(dict(zip(unique, counts)))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid ={'n_estimators':[10, 50, 100, 150, 200, 250],
             'min_samples_split':[2, 3, 4, 5, 6],
             'min_samples_leaf':[1, 2, 3, 4, 5]}

GCV = GridSearchCV(RandomForestClassifier(n_jobs =-1), param_grid, n_jobs =-1,cv=5)
GCV.fit(X_train, y_train)
print("The best parameters are: "+ str(GCV.best_params_))
print("The best accuracy is: "+str(GCV.best_score_))

In [None]:
# Use SVC classifier to predict wine type
from sklearn.svm import SVC
X = wine.drop(['type'], axis=1).values
y = wine['type'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
param_grid = [{'kernel': ['rbf'],
               'C': [0.001, 0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
              {'kernel': ['linear'],
               'C': [0.001, 0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
              ]

GCV = GridSearchCV(SVC(), param_grid, n_jobs =-1,cv=5)
GCV.fit(X_train, y_train)
print("The best parameters are: "+ str(GCV.best_params_))
print("The best accuracy is: "+str(GCV.best_score_))

In [None]:
print("The best estimator is: {}".format(GCV.best_estimator_))

I also tried other kernels, for example ‘poly’, ‘sigmoid’, ‘precomputed’. It took too much time, so we only compared 'rbf' and 'linear'.

In [None]:
# Run SVC with linear kernel again to get mean_test_score
param_grid = [{'kernel': ['linear'],
               'C': [0.001, 0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
              ]
GCV = GridSearchCV(SVC(), param_grid, n_jobs =-1,cv=5)
GCV.fit(X_train, y_train)

In [None]:
scores = np.array(GCV.cv_results_['mean_test_score'].reshape(6, 6))

In [None]:
import mglearn
# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, xlabel='gamma', xticklabels=param_grid[0]['gamma'],
ylabel='C', yticklabels=param_grid[0]['C'], cmap="viridis")

In [None]:
from sklearn.metrics import roc_curve
%matplotlib inline
import matplotlib.pyplot as plt

svc = SVC(C=10, gamma=0.001, kernel='linear').fit(X_train, y_train)
def plot_roc_curve():
    fpr, tpr, thresholds = roc_curve(y_test, svc.decision_function(X_test))
    close_zero = np.argmin(np.abs(thresholds))
    plt.plot(fpr, tpr, label="ROC Curve SVC")
    plt.xlabel("FPR")
    plt.ylabel("TPR (recall)")
    plt.plot(fpr[close_zero], tpr[close_zero], 'o', 
    label="threshold zero SVC", fillstyle="none", c='k', mew=2)
    plt.legend(loc=4)

In [None]:
 plot_roc_curve()

In [None]:
from sklearn.metrics import roc_auc_score
svc_auc = roc_auc_score(y_test, svc.decision_function(X_test))
print("AUC for SVC: {:.3f}".format(svc_auc))

In [None]:
# using AUC scoring instead:
grid = GridSearchCV(SVC(), param_grid, scoring="roc_auc")
grid.fit(X_train, y_train)
print("\nGrid-Search with AUC")
print("Best parameters:", grid.best_params_)
print("Best cross-validation score (AUC): {:.3f}".format(grid.best_score_))
print("Test set AUC: {:.3f}".format(
roc_auc_score(y_test, grid.decision_function(X_test))))
print("Test set accuracy: {:.3f}".format(grid.score(X_test, y_test)))

In [None]:
# We print how many wines are white and how many wines are red.
# 1-red wine; 0-white wine.
unique, counts = np.unique(y, return_counts=True)
display(dict(zip(unique, counts)))

With AUC as the scoring metric, the result(0.992) is better than the previous one(0.989).
From the printed numbers of red and white wines, we see this dataset is not balanced. 
AUC is a better metric for imbalanced classification problems than accuracy.