# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [3]:
import pandas as pd
import numpy as np

In [4]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [5]:
data = data_w.assign(type = 'white')
data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
378,5.7,0.32,0.5,2.6,0.049,17.0,155.0,0.9927,3.22,0.64,10.0,6,white
3036,6.8,0.29,0.34,3.5,0.054,26.0,189.0,0.99489,3.42,0.58,10.4,5,white
3330,6.7,0.23,0.33,8.1,0.048,45.0,176.0,0.99472,3.11,0.52,10.1,6,white
1868,7.4,0.21,0.27,7.3,0.031,41.0,144.0,0.9932,3.15,0.38,11.8,7,white
20,6.2,0.66,0.48,1.2,0.029,29.0,75.0,0.9892,3.33,0.39,12.8,8,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [6]:
pd.crosstab(index=data['quality'], columns=data['type'], margins=True )

type,red,white,All
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,10,20,30
4,53,163,216
5,681,1457,2138
6,638,2198,2836
7,199,880,1079
8,18,175,193
9,0,5,5
All,1599,4898,6497


# SVM

# Exercise 6.2

* Standarized the features (not the quality)  ---Normalizar
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [7]:
data_not = data.drop('type',1)
data_not=data_not.drop('quality',1)
print(data_not)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0               7.0             0.270         0.36           20.70      0.045   
1               6.3             0.300         0.34            1.60      0.049   
2               8.1             0.280         0.40            6.90      0.050   
3               7.2             0.230         0.32            8.50      0.058   
4               7.2             0.230         0.32            8.50      0.058   
5               8.1             0.280         0.40            6.90      0.050   
6               6.2             0.320         0.16            7.00      0.045   
7               7.0             0.270         0.36           20.70      0.045   
8               6.3             0.300         0.34            1.60      0.049   
9               8.1             0.220         0.43            1.50      0.044   
10              8.1             0.270         0.41            1.45      0.033   
11              8.6         

In [8]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaled_df=scaler.fit_transform(data_not)
scaled_df=(pd.DataFrame(scaled_df, columns=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']))
print (scaled_df)


      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0          0.264463          0.126667     0.216867        0.308282   0.059801   
1          0.206612          0.146667     0.204819        0.015337   0.066445   
2          0.355372          0.133333     0.240964        0.096626   0.068106   
3          0.280992          0.100000     0.192771        0.121166   0.081395   
4          0.280992          0.100000     0.192771        0.121166   0.081395   
5          0.355372          0.133333     0.240964        0.096626   0.068106   
6          0.198347          0.160000     0.096386        0.098160   0.059801   
7          0.264463          0.126667     0.216867        0.308282   0.059801   
8          0.206612          0.146667     0.204819        0.015337   0.066445   
9          0.355372          0.093333     0.259036        0.013804   0.058140   
10         0.355372          0.126667     0.246988        0.013037   0.039867   
11         0.396694         

In [9]:
data_Type=data['type_bin'] = np.where(data['type']=='white',1,0)

In [10]:
data_col=data[data.columns[11:14]]
data_tot=scaled_df.join(data_col)

In [11]:
data_tot_w=data_tot[data_tot['type']=='white']
data_tot_r=data_tot[data_tot['type']=='red']

In [13]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.svm import SVC
from sklearn.svm import SVR

X_r=data_tot_r[data_tot_r.columns[0:11]]
y_r=data_tot_r['quality']

X_w=data_tot_w[data_tot_w.columns[0:11]]
y_w=data_tot_w['quality']

X_r_train, X_r_test, y_r_train, y_r_test=train_test_split(X_r,y_r,test_size=0.3)
X_w_train, X_w_test, y_w_train, y_w_test=train_test_split(X_w,y_w,test_size=0.3)

svr_rbf = SVC(kernel='rbf', gamma='auto')
svr_lin = SVC(kernel='linear',gamma='auto')
svr_poly = SVC(kernel='poly',gamma='auto')
svm_sigmo=svm.SVC(kernel='sigmoid', gamma='auto')

y_r_rbf = svr_rbf.fit(X_r_train, y_r_train)
R_r=y_r_rbf.predict(X_r_test)
r_rbf=sum(R_r==y_r_test)/len(y_r_test)

y_w_rbf = svr_rbf.fit(X_w_train, y_w_train)
R_w=y_w_rbf.predict(X_w_test)
w_rbf=sum(R_w==y_w_test)/len(y_w_test)

y_r_lin = svr_lin.fit(X_r_train, y_r_train)
L_r=y_r_lin.predict(X_r_test)
r_lin=sum(L_r==y_r_test)/len(y_r_test)

y_w_lin = svr_lin.fit(X_w_train, y_w_train)
L_w=y_w_lin.predict(X_w_test)
w_lin=sum(L_w==y_w_test)/len(y_w_test)

y_r_poly = svr_poly.fit(X_r_train, y_r_train)
Po_r=y_r_poly.predict(X_r_test)
r_poly=sum(Po_r==y_r_test)/len(y_r_test)

y_w_poly = svr_poly.fit(X_w_train, y_w_train)
Po_w=y_w_poly.predict(X_w_test)
w_poly=sum(Po_w==y_w_test)/len(y_w_test)

y_r_sigmoid=svm_sigmo.fit(X_r_train,y_r_train)
S_r=y_r_sigmoid.predict(X_r_test)
r_sig=sum(S_r==y_r_test)/len(y_r_test)

y_w_sigmoid=svm_sigmo.fit(X_w_train,y_w_train)
S_w=y_w_sigmoid.predict(X_w_test)
w_sig=sum(S_w==y_w_test)/len(y_w_test)


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’) 


In [14]:
print('rbf Accuracy    ', 'red :',r_rbf,' white:',w_rbf)
print('poly Accuracy   ', 'red :',r_poly,'white:',w_poly)
print('linear Accuracy ', 'red :',r_lin,' white:',w_lin)
print('sigmoid Accuracy', 'red :',r_sig,' white:',w_sig)

rbf Accuracy     red : 0.6083333333333333  white: 0.5197278911564626
poly Accuracy    red : 0.4395833333333333 white: 0.46122448979591835
linear Accuracy  red : 0.5979166666666667  white: 0.5231292517006803
sigmoid Accuracy red : 0.58125  white: 0.4748299319727891


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance 

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [20]:
c1= SVC(kernel='rbf', gamma=0.01, C=0.1)
c2= SVC(kernel='rbf', gamma='auto', C=1)
c3= SVC(kernel='rbf', gamma=0.001, C=10)
c4= SVC(kernel='rbf', gamma='auto', C=100)
c5= SVC(kernel='rbf', gamma=0.0001, C=1000)

c1_r_rbf = c1.fit(X_r_train, y_r_train)
c1_r=c1_r_rbf.predict(X_r_test)
c1r_rbf=sum(c1_r==y_r_test)/len(y_r_test)

c1_w_rbf = c1.fit(X_w_train, y_w_train)
c1_w=c1_w_rbf.predict(X_w_test)
c1w_rbf=sum(c1_w==y_w_test)/len(y_w_test)

c2_r_rbf = c2.fit(X_r_train, y_r_train)
c2_r=c2_r_rbf.predict(X_r_test)
c2r_rbf=sum(c2_r==y_r_test)/len(y_r_test)

c2_w_rbf = c2.fit(X_w_train, y_w_train)
c2_w=c2_w_rbf.predict(X_w_test)
c2w_rbf=sum(c2_w==y_w_test)/len(y_w_test)

c3_r_rbf = c3.fit(X_r_train, y_r_train)
c3_r=c3_r_rbf.predict(X_r_test)
c3r_rbf=sum(c3_r==y_r_test)/len(y_r_test)

c3_w_rbf = c3.fit(X_w_train, y_w_train)
c3_w=c3_w_rbf.predict(X_w_test)
c3w_rbf=sum(c3_w==y_w_test)/len(y_w_test)

c4_r_rbf = c4.fit(X_r_train, y_r_train)
c4_r=c4_r_rbf.predict(X_r_test)
c4r_rbf=sum(c4_r==y_r_test)/len(y_r_test)

c4_w_rbf = c4.fit(X_w_train, y_w_train)
c4_w=c4_w_rbf.predict(X_w_test)
c4w_rbf=sum(c4_w==y_w_test)/len(y_w_test)

c5_r_rbf = c5.fit(X_r_train, y_r_train)
c5_r=c5_r_rbf.predict(X_r_test)
c5r_rbf=sum(c5_r==y_r_test)/len(y_r_test)

c5_w_rbf = c5.fit(X_w_train, y_w_train)
c5_w=c5_w_rbf.predict(X_w_test)
c5w_rbf=sum(c5_w==y_w_test)/len(y_w_test)

# Exercise 6.5

Compare the results with other methods

In [21]:
print('rbf Accuracy c1    ', 'red :',c1r_rbf,' white:',c1w_rbf)
print('rbf Accuracy c2    ', 'red :',c2r_rbf,' white:',c2w_rbf)
print('rbf Accuracy c3    ', 'red :',c3r_rbf,' white:',c3w_rbf)
print('rbf Accuracy c4    ', 'red :',c4r_rbf,' white:',c4w_rbf)
print('rbf Accuracy c5    ', 'red :',c5r_rbf,' white:',c5w_rbf)

rbf Accuracy c1     red : 0.4395833333333333  white: 0.46122448979591835
rbf Accuracy c2     red : 0.6083333333333333  white: 0.5197278911564626
rbf Accuracy c3     red : 0.4395833333333333  white: 0.46122448979591835
rbf Accuracy c4     red : 0.6  white: 0.5265306122448979
rbf Accuracy c5     red : 0.60625  white: 0.5170068027210885


Los metodos generan resultados similares, sin embargo quien mejor representa la prediccion sigue siendo rbf, comparado a los otros metodos

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous) -- predecir continuamente la calidad de 3 a 9

* Analyze the coefficients

* Evaluate the RMSE

In [26]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

X=data_tot[data_tot.columns[0:11]]
y=data_tot['quality']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 0.95752544 -2.03094504 -0.33645095  2.93642304 -0.26418949  1.51137144
 -1.01089897 -2.97662758  0.58360803  1.35627374  1.79220429]
Mean squared error: 0.51


aunque el error cuadratico no es tan pequeño con los datos y las variables se puede definir y predecir la variable dependiente

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [30]:
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print('Coefficients: \n', ridgereg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 0.42515614 -1.78403445 -0.08730228  1.74303462 -0.5856916   1.15164668
 -0.74468497 -1.57156103  0.32195241  1.14369605  1.76032959]
Mean squared error: 0.52


In [31]:
ridgereg = Ridge(alpha=1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print('Coefficients: \n', ridgereg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 0.01706692 -0.88067166  0.24822433  0.37314046 -0.77082965  0.37504967
 -0.25022498 -1.16026335  0.10561067  0.53283731  0.94040336]
Mean squared error: 0.58


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [32]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print('Coefficients: \n', lassoreg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Coefficients: 
 [ 0.         -1.33396819  0.          0.         -0.          0.
 -0.         -0.          0.          0.          1.77422724]
Mean squared error: 0.58


In [33]:
lassoreg = Lasso(alpha=1, normalize=True)
lassoreg.fit(X_train, y_train)
print('Coefficients: \n', lassoreg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Coefficients: 
 [-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]
Mean squared error: 0.58


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [36]:
data_tot.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type_bin
0,0.264463,0.126667,0.216867,0.308282,0.059801,0.152778,0.37788,0.267785,0.217054,0.129213,0.115942,6,white,1
1,0.206612,0.146667,0.204819,0.015337,0.066445,0.045139,0.290323,0.132832,0.449612,0.151685,0.217391,6,white,1
2,0.355372,0.133333,0.240964,0.096626,0.068106,0.100694,0.209677,0.154039,0.418605,0.123596,0.304348,6,white,1
3,0.280992,0.1,0.192771,0.121166,0.081395,0.159722,0.414747,0.163678,0.364341,0.101124,0.275362,6,white,1
4,0.280992,0.1,0.192771,0.121166,0.081395,0.159722,0.414747,0.163678,0.364341,0.101124,0.275362,6,white,1


In [38]:
data_tot['quality'].value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

In [40]:
data_tot['quality_type'] = np.where(data_tot['quality']< 6,0,1)

In [46]:
data_tot['quality'].value_counts()

1    4113
0    2384
Name: quality, dtype: int64

In [47]:
X=data_tot[data_tot.columns[0:11]]
y=data_tot['quality']

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [49]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9,solver='liblinear',multi_class='auto')
logreg.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='auto', n_jobs=None, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [51]:
print('Coefficients: \n', logreg.coef_)
y_pred_prob = logreg.predict_proba(X_test)
print('predicted probabilities', y_pred_prob)

Coefficients: 
 [[ 1.54662929 -6.6579565  -1.54319534  6.20361567 -0.79202973  4.45561891
  -3.12472316 -3.87932198  0.90686793  3.93265128  5.82629683]]
predicted probabilities [[0.14432673 0.85567327]
 [0.09018548 0.90981452]
 [0.47616348 0.52383652]
 ...
 [0.02057451 0.97942549]
 [0.06943923 0.93056077]
 [0.08402093 0.91597907]]


In [57]:
from sklearn.metrics import log_loss
print('calculate log loss', log_loss(y_test, y_pred_prob))

calculate log loss 0.5107099065522029


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [58]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = X_train.astype(float)
X_test = X_test.astype(float)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [64]:
# C=0.01 with L1 penalty
logreg = LogisticRegression(C=0.01, penalty='l1',solver='liblinear',multi_class='auto')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.         -0.48570385  0.          0.05572712  0.          0.
   0.          0.          0.          0.11484858  0.86106117]]


In [65]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5283558582075295


In [66]:
# C=0.01 with L2 penalty
logreg = LogisticRegression(C=0.01, penalty='l2',multi_class='auto',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.09458229 -0.57751976 -0.05963997  0.30266515 -0.08513586  0.18570927
  -0.28711263 -0.17744673  0.08110469  0.26545366  0.811105  ]]


In [67]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5150941524449589


In [68]:
# C=0.1 with L1 penalty
logreg = LogisticRegression(C=0.1, penalty='l1',solver='liblinear',multi_class='auto')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.03431397 -0.72274389 -0.10396291  0.2870452  -0.04826651  0.23034865
  -0.34621275  0.          0.03658572  0.28394827  1.07545229]]


In [69]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5103820803732918


In [70]:
# C=0.1 with L2 penalty
logreg = LogisticRegression(C=0.1, penalty='l2',multi_class='auto',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.15028494 -0.71373756 -0.12352109  0.42530979 -0.05177741  0.26063747
  -0.39039379 -0.20897562  0.10652593  0.31893787  0.98154657]]


In [71]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5107464017663997


In [72]:
# C=1.0 with L1 penalty
logreg = LogisticRegression(C=1.0, penalty='l1',solver='liblinear',multi_class='auto')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.14534226 -0.73542897 -0.13151383  0.42673088 -0.04627053  0.26940738
  -0.40153886 -0.18799125  0.10213793  0.32203355  1.01751604]]


In [73]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5105970959632321


In [74]:
# C=1.0 with L2 penalty
logreg = LogisticRegression(C=1.0, penalty='l2',multi_class='auto',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.16275776 -0.73267209 -0.13346015  0.44864345 -0.04590714  0.27247935
  -0.40699637 -0.22212463  0.11268129  0.32702151  1.00216482]]


In [75]:
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(log_loss(y_test, y_pred_prob))

0.5107066258331013
