# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [16]:
import pandas as pd
import numpy as np

In [17]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [18]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
6211,7.0,0.36,0.21,2.3,0.086,20.0,65.0,0.99558,3.4,0.54,10.1,6,red
359,6.9,0.24,0.34,4.7,0.04,43.0,161.0,0.9935,3.2,0.59,10.6,6,white
1640,7.8,0.49,0.49,7.0,0.043,29.0,149.0,0.9952,3.21,0.33,10.0,5,white
2777,7.6,0.2,0.36,1.9,0.043,24.0,111.0,0.99237,3.29,0.54,11.3,6,white
6073,6.5,0.61,0.0,2.2,0.095,48.0,59.0,0.99541,3.61,0.7,11.5,6,red


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [19]:
tf = pd.DataFrame(data['quality'].groupby(data['type']).value_counts())
tf

Unnamed: 0_level_0,Unnamed: 1_level_0,quality
type,quality,Unnamed: 2_level_1
red,5,681
red,6,638
red,7,199
red,4,53
red,8,18
red,3,10
white,6,2198
white,5,1457
white,7,880
white,8,175


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [20]:
from sklearn import preprocessing

for r in data_r.loc[:, data_r.columns != 'quality'].columns:
    data_r[r]=preprocessing.scale(data_r[r])

In [21]:
#Standarizing the Features

from sklearn import preprocessing

data_restand = pd.DataFrame(index=data_r.index)
data_westand = pd.DataFrame(index=data_w.index)

for r in data_r.loc[:, data_r.columns != 'quality'].columns:
    data_restand[r]=preprocessing.scale(data_r[r])

for w in data_w.loc[:, data_w.columns != 'quality'].columns:
    data_westand[w]=preprocessing.scale(data_w[w])
    
data_restand['quality']=data_r['quality']
data_westand['quality']=data_w['quality']

In [22]:
data_restand.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5
5,-0.52836,0.738418,-1.391472,-0.524166,-0.26496,-0.274931,-0.196679,0.558274,1.288643,-0.579207,-0.960246,5
6,-0.241094,0.403229,-1.08337,-0.666062,-0.392483,-0.083669,0.381091,-0.183745,-0.072005,-1.169337,-0.960246,5
7,-0.585813,0.682553,-1.391472,-0.949853,-0.477498,-0.083669,-0.774449,-1.137769,0.51113,-1.110324,-0.397043,7
8,-0.298547,0.291499,-1.288771,-0.382271,-0.307468,-0.657454,-0.865676,0.028261,0.316751,-0.520193,-0.866379,7
9,-0.470907,-0.155419,0.457144,2.526589,-0.349975,0.107592,1.688677,0.558274,0.251958,0.837107,0.072294,5


In [23]:
data_r.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5
5,-0.52836,0.738418,-1.391472,-0.524166,-0.26496,-0.274931,-0.196679,0.558274,1.288643,-0.579207,-0.960246,5
6,-0.241094,0.403229,-1.08337,-0.666062,-0.392483,-0.083669,0.381091,-0.183745,-0.072005,-1.169337,-0.960246,5
7,-0.585813,0.682553,-1.391472,-0.949853,-0.477498,-0.083669,-0.774449,-1.137769,0.51113,-1.110324,-0.397043,7
8,-0.298547,0.291499,-1.288771,-0.382271,-0.307468,-0.657454,-0.865676,0.028261,0.316751,-0.520193,-0.866379,7
9,-0.470907,-0.155419,0.457144,2.526589,-0.349975,0.107592,1.688677,0.558274,0.251958,0.837107,0.072294,5


In [24]:
data_restand['QualityBin'] = np.where(data_r['quality']>=7,1,0)
data_westand['QualityBin'] = np.where(data_w['quality']>=7,1,0)

In [25]:
data_restand.groupby('QualityBin').size()

QualityBin
0    1382
1     217
dtype: int64

In [26]:
data_westand.groupby('QualityBin').size()

QualityBin
0    3838
1    1060
dtype: int64

In [27]:
data_restand.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,QualityBin
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,0
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5,0
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5,0
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6,0
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,0
5,-0.52836,0.738418,-1.391472,-0.524166,-0.26496,-0.274931,-0.196679,0.558274,1.288643,-0.579207,-0.960246,5,0
6,-0.241094,0.403229,-1.08337,-0.666062,-0.392483,-0.083669,0.381091,-0.183745,-0.072005,-1.169337,-0.960246,5,0
7,-0.585813,0.682553,-1.391472,-0.949853,-0.477498,-0.083669,-0.774449,-1.137769,0.51113,-1.110324,-0.397043,7,1
8,-0.298547,0.291499,-1.288771,-0.382271,-0.307468,-0.657454,-0.865676,0.028261,0.316751,-0.520193,-0.866379,7,1
9,-0.470907,-0.155419,0.457144,2.526589,-0.349975,0.107592,1.688677,0.558274,0.251958,0.837107,0.072294,5,0


In [28]:
#Elimino variables creadas por error anteriormente
data_restand = data_restand.drop(["Quality_bin", "Response"], axis=1)

KeyError: "['Quality_bin' 'Response'] not found in axis"

In [29]:
data_restand.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,QualityBin
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,0
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5,0
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5,0
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6,0
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,0
5,-0.52836,0.738418,-1.391472,-0.524166,-0.26496,-0.274931,-0.196679,0.558274,1.288643,-0.579207,-0.960246,5,0
6,-0.241094,0.403229,-1.08337,-0.666062,-0.392483,-0.083669,0.381091,-0.183745,-0.072005,-1.169337,-0.960246,5,0
7,-0.585813,0.682553,-1.391472,-0.949853,-0.477498,-0.083669,-0.774449,-1.137769,0.51113,-1.110324,-0.397043,7,1
8,-0.298547,0.291499,-1.288771,-0.382271,-0.307468,-0.657454,-0.865676,0.028261,0.316751,-0.520193,-0.866379,7,1
9,-0.470907,-0.155419,0.457144,2.526589,-0.349975,0.107592,1.688677,0.558274,0.251958,0.837107,0.072294,5,0


In [30]:
data_restand['QualityBin'] = data_restand['QualityBin'].astype(bool)
data_westand['QualityBin'] = data_westand['QualityBin'].astype(bool)

In [138]:
data_restand.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,QualityBin
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,False
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777,5,False
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777,5,False
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777,6,False
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246,5,False
5,-0.52836,0.738418,-1.391472,-0.524166,-0.26496,-0.274931,-0.196679,0.558274,1.288643,-0.579207,-0.960246,5,False
6,-0.241094,0.403229,-1.08337,-0.666062,-0.392483,-0.083669,0.381091,-0.183745,-0.072005,-1.169337,-0.960246,5,False
7,-0.585813,0.682553,-1.391472,-0.949853,-0.477498,-0.083669,-0.774449,-1.137769,0.51113,-1.110324,-0.397043,7,True
8,-0.298547,0.291499,-1.288771,-0.382271,-0.307468,-0.657454,-0.865676,0.028261,0.316751,-0.520193,-0.866379,7,True
9,-0.470907,-0.155419,0.457144,2.526589,-0.349975,0.107592,1.688677,0.558274,0.251958,0.837107,0.072294,5,False


In [140]:
Xr = data_restand.loc[:, ~data_r.columns.isin(['quality', 'QualityBin'])]
Yr = data_restand['QualityBin']

In [143]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
Xr_train, Xr_test, Yr_train, Yr_test= train_test_split(Xr, Yr, random_state=1)

In [144]:
from sklearn.svm import SVC
clf_red = SVC(kernel='linear')
clf_red.fit(Xr_train, Yr_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [145]:
clf_red.coef_

array([[-1.19327504e-04, -4.42206314e-04,  4.75679170e-05,
        -5.47128269e-06, -1.47953495e-04,  1.84500758e-04,
        -4.27528793e-04,  2.28049730e-05, -2.10216798e-04,
         2.72930422e-04,  4.95445975e-04]])

In [148]:
y_r_pred_lin = clf_red.predict(Xr_test)

In [149]:
from sklearn.metrics import accuracy_score
accuracy_score(Yr_test, y_r_pred_lin)

0.8875

In [151]:
#Defining X, Y
Xw = data_westand.loc[:, ~data_w.columns.isin(['quality', 'QualityBin'])]
Yw = data_westand['QualityBin']

In [153]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
Xw_train, Xw_test, Yw_train, Yw_test= train_test_split(Xw, Yw, random_state=1)

In [154]:
clf_white = SVC(kernel='linear')
clf_white.fit(Xw_train, Yw_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [155]:
clf_white.coef_

array([[ 1.06621034e-04, -5.04326021e-05, -7.67509805e-06,
         4.44633145e-04, -4.79234273e-05,  4.99520682e-05,
         5.29716954e-05, -6.59186565e-04,  9.77741297e-05,
         9.17373032e-05, -1.26354311e-04]])

In [156]:
y_w_pred_lin = clf_white.predict(Xw_test)

In [157]:
accuracy_score(Yw_test, y_w_pred_lin)

0.7763265306122449

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


# SVM's to Red wine

In [200]:
#poly to red
clf_redp = SVC(kernel='poly')
clf_redp.fit(Xr_train, Yr_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [184]:
y_r_pred_poly = clf_redp.predict(Xr_test)
print(f"Accuracy Score para kernel 'poly' es: {accuracy_score(Yr_test, y_r_pred_poly)}")

Accuracy Score para kernel 'poly' es: 0.9025


In [185]:
#rbf to red
clf_redr = SVC(kernel='rbf')
clf_redr.fit(Xr_train, Yr_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [186]:
y_r_pred_rbf = clf_redr.predict(Xr_test)
print(f"Accuracy Score para kernel 'rbf' es: {accuracy_score(Yr_test, y_r_pred_rbf)}")

Accuracy Score para kernel 'rbf' es: 0.9


In [188]:
#Sigmoid to red
clf_reds = SVC(kernel='sigmoid')
clf_reds.fit(Xr_train, Yr_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [201]:
y_r_pred_sig = clf_reds.predict(Xr_test)
print(f"Accuracy Score para kernel 'sigmoide' es: {accuracy_score(Yr_test, y_r_pred_sig)}")

Accuracy Score para kernel 'sigmoide' es: 0.8475


# SVM's to white  wine

In [193]:
#Poly to white
clf_wp = SVC(kernel='poly')
clf_wp.fit(Xw_train, Yw_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [195]:
y_w_pred_poly = clf_wp.predict(Xw_test)
print(f"Accuracy Score para kernel 'poly' es: {accuracy_score(Yw_test, y_w_pred_poly)}")

Accuracy Score para kernel 'poly' es: 0.8024489795918367


In [196]:
#rbf to white
clf_wr = SVC(kernel='rbf')
clf_wr.fit(Xw_train, Yw_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [197]:
y_w_pred_rbf = clf_wr.predict(Xw_test)
print(f"Accuracy Score para kernel 'rbf' es: {accuracy_score(Yw_test, y_w_pred_rbf)}")

Accuracy Score para kernel 'rbf' es: 0.8244897959183674


In [198]:
#Sigmoid to white
clf_ws = SVC(kernel='sigmoid')
clf_ws.fit(Xw_train, Yw_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [203]:
y_w_pred_sig = clf_ws.predict(Xw_test)
print(f"Accuracy Score para kernel 'sig' es: {accuracy_score(Yw_test, y_w_pred_sig)}")

Accuracy Score para kernel 'sig' es: 0.7355102040816327


# keeping in mind the results, the better kernels to red and white wine is RBF

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [204]:
for c in [0.1, 1, 10, 100, 1000]:
    for g in [0.01, 0.001, 0.0001]:
        clf_rbf_cg = SVC(kernel='rbf', C=c, gamma=g)
        clf_rbf_cg.fit(Xr_train, Yr_train)
        y_r_pred_rbf_cg = clf_rbf_cg.predict(Xr_test)
        print(f'''Accuracy Score para kernel \'rbf\' con el parámetro C = {c} y el Gamma = {g} es: {accuracy_score(Yr_test, y_r_pred_rbf_cg)}''')
        print('-'*50)

Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.01 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.001 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.0001 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.01 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.001 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.0001 es: 0.8875
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 10 y el Gamma = 0.01 es: 0.905
--------------------------------------------------
Accuracy Score para kernel 'rbf' con

In [205]:
for c in [0.1, 1, 10, 100, 1000]:
    for g in [0.01, 0.001, 0.0001]:
        clf_rbf_cg = SVC(kernel='rbf', C=c, gamma=g)
        clf_rbf_cg.fit(Xw_train, Yw_train)
        y_w_pred_rbf_cg = clf_rbf_cg.predict(Xw_test)
        print(f'''Accuracy Score para kernel \'rbf\' con el parámetro C = {c} y el Gamma = {g} es: {accuracy_score(Yw_test, y_w_pred_rbf_cg)}''')
        print('-'*50)

Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.01 es: 0.7763265306122449
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.001 es: 0.7763265306122449
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 0.1 y el Gamma = 0.0001 es: 0.7763265306122449
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.01 es: 0.7828571428571428
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.001 es: 0.7763265306122449
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 1 y el Gamma = 0.0001 es: 0.7763265306122449
--------------------------------------------------
Accuracy Score para kernel 'rbf' con el parámetro C = 10 y el Gamma = 0.01 es: 0.8122448979591836
--

# The parameters that gives the best performance are:
# Red wine= C:1.000 Gamma:0,001
# White wine= C:100 Gamma:0,01

# Exercise 6.5

Compare the results with other methods

In [215]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9, solver='liblinear', multi_class='ovr')
logreg_l1 = LogisticRegression(C=0.1, penalty='l1', solver='liblinear', multi_class='ovr')
logreg_l2 = LogisticRegression(C=0.1, penalty='l2', solver='liblinear', multi_class='ovr')

# To Red Wine

In [210]:
logreg.fit(Xr_train, Yr_train)
logreg_l1.fit(Xr_train, Yr_train)
logreg_l2.fit(Xr_train, Yr_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [211]:
y_r_pred_logreg = logreg.predict(Xr_test)
y_r_pred_logreg_l1 = logreg_l1.predict(Xr_test)
y_r_pred_logreg_l2 = logreg_l2.predict(Xr_test)

In [224]:
print(f"Accuracy Score - linear Regression: {accuracy_score(Yr_test, y_r_pred_logreg)}")
print(f"Accuracy Score - linear Regression (Lasso): {accuracy_score(Yr_test, y_r_pred_logreg_l1)}")
print(f"Accuracy Score - linear Regression (truncada): {accuracy_score(Yr_test, y_r_pred_logreg_l2)}")

Accuracy Score - linear Regression: 0.8825
Accuracy Score - linear Regression (Lasso): 0.895
Accuracy Score - linear Regression (truncada): 0.885


#Keeping in mind the results, we can see that comparing kernel rbf with those others methods. RBF with parameters C:1.000 Gamma:0,001 is the best method to red wine with an accuracy equal to 0,9

# To white wine

In [218]:
logreg.fit(Xw_train, Yw_train)
logreg_l1.fit(Xw_train, Yw_train)
logreg_l2.fit(Xw_train, Yw_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [220]:
y_w_pred_logreg = logreg.predict(Xw_test)
y_w_pred_logreg_l1 = logreg_l1.predict(Xw_test)
y_w_pred_logreg_l2 = logreg_l2.predict(Xw_test)

In [226]:
print(f"Accuracy Score - linear Regression: {accuracy_score(Yw_test, y_w_pred_logreg)}")
print(f"Accuracy Score - linear Regression (lasso): {accuracy_score(Yw_test, y_w_pred_logreg_l1)}")
print(f"Accuracy Score - linear Regression (truncada): {accuracy_score(Yw_test, y_w_pred_logreg_l2)}")

Accuracy Score - linear Regression: 0.7951020408163265
Accuracy Score - linear Regression (lasso): 0.7942857142857143
Accuracy Score - linear Regression (truncada): 0.7918367346938775


Keeping in mind the results, we can see that comparing kernel rbf with those others methods. RBF with parameters C:100 Gamma:0,01 is the best method to white wine with an accuracy equal to 0,82

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [39]:
dataRegul=data

In [40]:
dataRegul['type01']=dataRegul['type']
dataRegul['type01']=np.where(dataRegul['type01']=='white',0,1)
dataRegul['type01']=dataRegul['type01'].astype(bool)
dataRegul.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type01
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,white,False
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,white,False


In [41]:
Y=dataRegul['quality']
X=dataRegul[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide','total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']]
#X=dataRegul.drop('quality', 'type', axis=1)

In [42]:
from sklearn.model_selection import train_test_split
test_size = 0.30
seed = 1
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=test_size, random_state=seed)

In [43]:
#Training a linear regression to predict wine quality (Continous)
from sklearn.linear_model import LinearRegression

lreg = LinearRegression()
lreg.fit(X_train, Y_train)
Est = lreg.predict(X_test)

In [44]:
lreg.coef_

array([ 8.17222598e-02, -1.35545890e+00, -1.85481405e-01,  4.64447086e-02,
       -4.62071948e-01,  5.05655172e-03, -2.36326899e-03, -5.97396097e+01,
        4.96358747e-01,  7.51750905e-01,  2.60259722e-01])

In [45]:
Vars=np.array(list(X))
coeficientes=lreg.coef_
result=pd.DataFrame(coeficientes, index=Vars)
result.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
result

Unnamed: 0,Coeficientes
fixed acidity,0.081722
volatile acidity,-1.355459
citric acid,-0.185481
residual sugar,0.046445
chlorides,-0.462072
free sulfur dioxide,0.005057
total sulfur dioxide,-0.002363
density,-59.73961
pH,0.496359
sulphates,0.751751


In [49]:
from sklearn import metrics
print('MSE:',metrics.mean_squared_error(Y_test,Est))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,Est)))

MSE: 0.5172435447318322
RMSE: 0.719196457674697


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [50]:
#Estimating a ridge regression with alpha equals 0 to confirm it's same as linear regression and evaluating the RMSE
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0, normalize=True)
ridgereg.fit(X_train, Y_train)
Est1 = ridgereg.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,Est1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,Est1)))

MSE: 0.5172435447318321
RMSE: 0.7191964576746969


In [51]:
#Comparing the coefficients
print(ridgereg.coef_)
Vars1=np.array(list(X))
coeficientes1=ridgereg.coef_
result1=pd.DataFrame(coeficientes1, index=Vars)
result1.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
result1

[ 8.17222598e-02 -1.35545890e+00 -1.85481405e-01  4.64447086e-02
 -4.62071948e-01  5.05655172e-03 -2.36326899e-03 -5.97396097e+01
  4.96358747e-01  7.51750905e-01  2.60259722e-01]


Unnamed: 0,0
fixed acidity,0.081722
volatile acidity,-1.355459
citric acid,-0.185481
residual sugar,0.046445
chlorides,-0.462072
free sulfur dioxide,0.005057
total sulfur dioxide,-0.002363
density,-59.73961
pH,0.496359
sulphates,0.751751


In [52]:
#Estimating a ridge regression with alpha equals 0,1 and evaluating the RMSE
from sklearn.linear_model import Ridge
ridgereg2 = Ridge(alpha=0.1, normalize=True)
ridgereg2.fit(X_train, Y_train)
Est2 = ridgereg2.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,Est2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,Est2)))

MSE: 0.5205394958344858
RMSE: 0.7214842311752113


In [53]:
#Comparing the coefficients
print(ridgereg2.coef_)

[ 3.55608171e-02 -1.19532434e+00 -3.81606811e-02  2.72233554e-02
 -1.00443963e+00  3.83206663e-03 -1.73947247e-03 -3.06879459e+01
  2.78266499e-01  6.30644922e-01  2.57460816e-01]


In [54]:
Vars2=np.array(list(X))
coeficientes2=ridgereg2.coef_
result2=pd.DataFrame(coeficientes2, index=Vars)
result2.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
result2

Unnamed: 0,0
fixed acidity,0.035561
volatile acidity,-1.195324
citric acid,-0.038161
residual sugar,0.027223
chlorides,-1.00444
free sulfur dioxide,0.003832
total sulfur dioxide,-0.001739
density,-30.687946
pH,0.278266
sulphates,0.630645


In [55]:
#Estimating a ridge regression with alpha equals 1 and evaluating the RMSE
from sklearn.linear_model import Ridge
ridgereg3 = Ridge(alpha=1, normalize=True)
ridgereg3.fit(X_train, Y_train)
Est3 = ridgereg3.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,Est3))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,Est3)))

MSE: 0.5751435696488125
RMSE: 0.7583822055196262


In [56]:
#Comparing the coefficients
print(ridgereg3.coef_)
Vars3=np.array(list(X))
coeficientes3=ridgereg3.coef_
result3=pd.DataFrame(coeficientes3, index=Vars)
result3.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
result3

[ 1.16682699e-03 -5.93480545e-01  1.56313446e-01  5.75654646e-03
 -1.29437958e+00  1.22060477e-03 -5.93278420e-04 -2.23808275e+01
  9.61261785e-02  2.91604701e-01  1.37938675e-01]


Unnamed: 0,0
fixed acidity,0.001167
volatile acidity,-0.593481
citric acid,0.156313
residual sugar,0.005757
chlorides,-1.29438
free sulfur dioxide,0.001221
total sulfur dioxide,-0.000593
density,-22.380828
pH,0.096126
sulphates,0.291605


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [57]:
#Estimating a lasso regression with alpha equals 0,01 and evaluating the RMSE
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, Y_train)
EstLasso = lassoreg.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,EstLasso))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,EstLasso)))

MSE: 0.7469306367883662
RMSE: 0.8642514893179915


In [58]:
#Comparing the coefficients
print(lassoreg.coef_)
VarsLa=np.array(list(X))
coeficientesLa=lassoreg.coef_
resultLa=pd.DataFrame(coeficientesLa, index=Vars)
resultLa.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
resultLa

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]


Unnamed: 0,0
fixed acidity,-0.0
volatile acidity,-0.0
citric acid,0.0
residual sugar,-0.0
chlorides,-0.0
free sulfur dioxide,0.0
total sulfur dioxide,-0.0
density,-0.0
pH,0.0
sulphates,0.0


In [59]:
#Estimating a lasso regression with alpha equals 0,1 and evaluating the RMSE
from sklearn.linear_model import Lasso
lassoreg1 = Lasso(alpha=0.1, normalize=True)
lassoreg1.fit(X_train, Y_train)
EstLasso1 = lassoreg1.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,EstLasso1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,EstLasso1)))

MSE: 0.7469306367883662
RMSE: 0.8642514893179915


In [60]:
#Comparing the coefficients
print(lassoreg1.coef_)
VarsLa1=np.array(list(X))
coeficientesLa1=lassoreg1.coef_
resultLa1=pd.DataFrame(coeficientesLa1, index=Vars)
resultLa1.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
resultLa1

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]


Unnamed: 0,0
fixed acidity,-0.0
volatile acidity,-0.0
citric acid,0.0
residual sugar,-0.0
chlorides,-0.0
free sulfur dioxide,0.0
total sulfur dioxide,-0.0
density,-0.0
pH,0.0
sulphates,0.0


In [61]:
#Estimating a lasso regression with alpha equals 1 and evaluating the RMSE
lassoreg2 = Lasso(alpha=1, normalize=True)
lassoreg2.fit(X_train, Y_train)
EstLasso2 = lassoreg2.predict(X_test)
print('MSE:',metrics.mean_squared_error(Y_test,EstLasso2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test,EstLasso2)))

MSE: 0.7469306367883662
RMSE: 0.8642514893179915


In [62]:
#Comparing the coefficients
print(lassoreg2.coef_)
VarsLa2=np.array(list(X))
coeficientesLa2=lassoreg2.coef_
resultLa2=pd.DataFrame(coeficientesLa2, index=Vars)
resultLa2.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
resultLa2

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]


Unnamed: 0,0
fixed acidity,-0.0
volatile acidity,-0.0
citric acid,0.0
residual sugar,-0.0
chlorides,-0.0
free sulfur dioxide,0.0
total sulfur dioxide,-0.0
density,-0.0
pH,0.0
sulphates,0.0


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [76]:
#Creating a binary target
dataRegul['Quality2'] = np.where(dataRegul['quality']>=7,1,0)
dataRegul.head(20)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type01,Quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,0
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False,0
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,white,False,0
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False,0
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False,0
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,white,False,0


In [77]:
dataRegul['Quality3']=dataRegul['Quality2'].astype(bool)
dataRegul.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type01,Quality2,Quality3
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False,0,False
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False,0,False
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False,0,False
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,0,False
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,0,False
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False,0,False
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,white,False,0,False
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False,0,False
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False,0,False
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,white,False,0,False


In [78]:
# defining X y Y
X1=dataRegul[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide','total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']]
Y1=dataRegul['Quality3']

In [79]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test= train_test_split(X1, Y1, random_state=1)

In [80]:
#Training a logistic regression to predict wine quality (binary)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9,solver='liblinear',multi_class='auto')
logreg.fit(X_train, Y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='auto', n_jobs=None, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [81]:
# Evaluating the coefficients
print(logreg.coef_)

[[ 1.85595398e-01 -3.95536715e+00 -5.68910242e-01  6.62701060e-02
  -1.19159675e+01  1.15718162e-02 -5.00798606e-03 -8.12200130e+00
   1.19433226e+00  1.69222128e+00  8.74781809e-01]]


In [89]:
# generate predicted probabilities
Y_pred_prob = logreg.predict(X_test)
print(Y_pred_prob)

[False False False ...  True False False]


In [90]:
# calculate log loss
print(metrics.log_loss(Y_test, Y_pred_prob))

6.503941388994924


In [91]:
#Evaluating the F1Score
from sklearn.metrics import f1_score
f1_score(Y_test,Y_pred_prob, average='macro')

0.6139531200506811

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [94]:
# standardize X_train and X_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = X_train.astype(float)
X_test = X_test.astype(float)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [100]:
# try C=0.01 with L1 penalty
logregre = LogisticRegression(C=0.01, penalty='l1',solver='liblinear',multi_class='auto')
logregre.fit(X_train_scaled, Y_train)
print(logregre.coef_)

[[ 0.         -0.26204351  0.          0.          0.          0.
   0.          0.          0.          0.01821347  0.77119408]]


In [114]:
# generate predicted probabilities and calculate log loss
Y_pred_prob0 = logregre.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob0))

6.631455893141906


In [115]:
f1_score(Y_test,Y_pred_prob0, average='macro')

0.5571650176085863

In [116]:
# try C=0.01 with L2 penalty
logregre1 = LogisticRegression(C=0.01, penalty='l2',solver='liblinear',multi_class='auto')
logregre1.fit(X_train_scaled, Y_train)
print(logregre1.coef_)

[[ 0.16983683 -0.33357089  0.02004888  0.2713157  -0.2104919   0.11264497
  -0.17296422 -0.27062207  0.13531634  0.20690045  0.65108468]]


In [117]:
# generate predicted probabilities and calculate log loss
Y_pred_prob1 = logregre1.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob1))

6.376409170689485


In [118]:
#Evaluating the F1Score
f1_score(Y_test,Y_pred_prob1, average='macro')

0.6130952380952381

In [119]:
# try C=0.1 with L1 penalty
logregre2 = LogisticRegression(C=0.1, penalty='l1',solver='liblinear',multi_class='auto')
logregre2.fit(X_train_scaled, Y_train)
print(logregre2.coef_)

[[ 0.29466619 -0.51679475 -0.01055185  0.45704498 -0.27267619  0.1419318
  -0.23170325 -0.3664849   0.22167297  0.2722101   0.83707606]]


In [120]:
# generate predicted probabilities and calculate log loss
Y_pred_prob2 = logregre2.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob2))

6.397666262616908


In [121]:
#Evaluating the F1Score
f1_score(Y_test,Y_pred_prob2, average='macro')

0.6181899647872207

In [122]:
# try C=0.1 with L2 penalty
logregre3 = LogisticRegression(C=0.1, penalty='l2',solver='liblinear',multi_class='auto')
logregre3.fit(X_train_scaled, Y_train)
print(logregre3.coef_)

[[ 0.4375043  -0.51431614 -0.04996592  0.62614408 -0.28064981  0.18325519
  -0.30025829 -0.60993186  0.30447318  0.30950057  0.73280542]]


In [123]:
# generate predicted probabilities and calculate log loss
Y_pred_prob3 = logregre3.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob3))

6.270139949030955


In [124]:
#Evaluating the F1Score
f1_score(Y_test,Y_pred_prob3, average='macro')

0.6337598717710704

In [125]:
# try C=1 with L1 penalty
logregre4 = LogisticRegression(C=1, penalty='l1',solver='liblinear',multi_class='auto')
logregre4.fit(X_train_scaled, Y_train)
print(logregre4.coef_)

[[ 0.56089238 -0.53374281 -0.06465129  0.79459032 -0.25915555  0.19339172
  -0.33296146 -0.83924961  0.3802626   0.34412044  0.67654706]]


In [126]:
# generate predicted probabilities and calculate log loss
Y_pred_prob4 = logregre4.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob4))

6.312651672586015


In [127]:
#Evaluating the F1Score
f1_score(Y_test,Y_pred_prob4, average='macro')

0.6363883617153203

In [128]:
# try C=1 with L2 penalty
logregre5 = LogisticRegression(C=1, penalty='l2',solver='liblinear',multi_class='auto')
logregre5.fit(X_train_scaled, Y_train)
print(logregre5.coef_)

[[ 0.56984526 -0.53471561 -0.06863037  0.80379798 -0.26259165  0.19763252
  -0.33905249 -0.85146694  0.38523459  0.34658632  0.67127941]]


In [129]:
# generate predicted probabilities and calculate log loss
Y_pred_prob5 = logregre5.predict(X_test_scaled)
print(metrics.log_loss(Y_test, Y_pred_prob5))

6.312651672586015


In [130]:
#Evaluating the F1Score
f1_score(Y_test,Y_pred_prob5, average='macro')

0.6363883617153203