# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
6213,7.5,0.63,0.27,2.0,0.083,17.0,91.0,0.99616,3.26,0.58,9.8,6,red
4735,6.0,0.17,0.36,1.7,0.042,14.0,61.0,0.99144,3.22,0.54,10.8,6,white
1257,6.4,0.17,0.27,6.7,0.036,88.0,223.0,0.9948,3.28,0.35,10.2,6,white
5553,9.7,0.55,0.17,2.9,0.087,20.0,53.0,1.0004,3.14,0.61,9.4,5,red
1880,7.7,0.3,0.42,14.3,0.045,45.0,213.0,0.9991,3.18,0.63,9.2,5,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
x = pd.value_counts(data.type).to_frame().reset_index()
x.columns = ['Type','Count']
x

Unnamed: 0,Type,Count
0,white,4898
1,red,1599


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [5]:
#Standarized the features (not the quality)
data['quality_bool'] = data['quality'] > 6
datared = data[data['type'] == 'red']
datawhite = data[data['type'] == 'white']

#Create a binary target for each type of wine
data['type_bin'] = data['type'].replace({'white':1, 'red':0})

datared.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality_bool
4898,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,False
4899,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,False
4900,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,False
4901,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,False
4902,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,False


In [33]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality_bool,type_bin
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False,1
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False,1
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False,1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,1
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False,1


In [6]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

svm = SVC(kernel= 'linear')

y1 = datawhite['quality_bool']
x1 = datawhite.drop(['quality_bool', 'type', 'quality'],1)

ss = StandardScaler(with_mean=True, with_std=True)
ss.fit(x1.astype(np.float))
x1 = ss.transform(x1.astype(np.float))
ss.mean_, ss.scale_


svm.fit(x1,y1)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [7]:
svm2 = SVC(kernel= 'linear')

y2 = datared['quality_bool']
x2 = datared.drop(['quality_bool', 'type', 'quality'],1)

ss = StandardScaler(with_mean=True, with_std=True)
ss.fit(x2.astype(np.float))
x2 = ss.transform(x2.astype(np.float))
ss.mean_, ss.scale_

svm2.fit(x2,y2)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [8]:
# SVM for white wines

svm_white_poly = SVC(kernel= 'poly')
svm_white_poly.fit(x1,y1)
score_white_poly = svm_white_poly.score(x1,y1)

svm_white_rbf = SVC(kernel= 'rbf')
svm_white_rbf.fit(x1,y1)
score_white_rbf = svm_white_rbf.score(x1,y1)

svm_white_sigmoid = SVC(kernel= 'sigmoid')
svm_white_sigmoid.fit(x1,y1)
score_white_sigmoid = svm_white_sigmoid.score(x1,y1)

# SVM for red wines

svm_red_poly = SVC(kernel= 'poly')
svm_red_poly.fit(x2,y2)
score_red_poly = svm_red_poly.score(x2,y2)

svm_red_rbf = SVC(kernel= 'rbf')
svm_red_rbf.fit(x2,y2)
score_red_rbf = svm_red_rbf.score(x2,y2)

svm_red_sigmoid = SVC(kernel= 'sigmoid')
svm_red_sigmoid.fit(x2,y2)
score_red_sigmoid = svm_red_sigmoid.score(x2,y2)

print('The scores for the white wine SVM using the different kernels are ','poly:', score_white_poly,'rbf:', score_white_rbf,'sigmoid:', score_white_sigmoid   )
print('The scores for the red wine SVM using the different kernels are ','poly:', score_red_poly,'rbf:', score_red_rbf,'sigmoid:', score_red_sigmoid   )




The scores for the white wine SVM using the different kernels are  poly: 0.8156390363413638 rbf: 0.8409554920375664 sigmoid: 0.7133523887300939
The scores for the red wine SVM using the different kernels are  poly: 0.908692933083177 rbf: 0.8986866791744841 sigmoid: 0.8311444652908068


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [9]:
white_list = []
red_list = []

for c in [0.1, 1, 10, 100, 1000]:
    for g in [0.01, 0.001, 0.0001]:
        svm_white_rbf = SVC(kernel= 'rbf',C=c, gamma=g)
        svm_white_rbf.fit(x1,y1)
        white_list.append([c,g,svm_white_rbf.score(x1,y1)]) 
        svm_red_poly = SVC(kernel= 'poly',C=c, gamma=g)
        svm_red_poly.fit(x2,y2)
        red_list.append([c,g, svm_red_poly.score(x2,y2)])

white_list.sort(key=lambda x: x[2]) 
red_list.sort(key=lambda x: x[2]) 
print(white_list)
print(red_list)
print("La mejor combinacion para ambos tipos es c=1000 y gamma=0.01")

[[0.1, 0.01, 0.7835851367905268], [0.1, 0.001, 0.7835851367905268], [0.1, 0.0001, 0.7835851367905268], [1, 0.001, 0.7835851367905268], [1, 0.0001, 0.7835851367905268], [10, 0.001, 0.7835851367905268], [10, 0.0001, 0.7835851367905268], [100, 0.0001, 0.7835851367905268], [1000, 0.0001, 0.7835851367905268], [1, 0.01, 0.8007349938750511], [100, 0.001, 0.8064516129032258], [1000, 0.001, 0.8182931808901592], [10, 0.01, 0.819314005716619], [100, 0.01, 0.8344222131482237], [1000, 0.01, 0.8468762760310331]]
[[0.1, 0.01, 0.8642901813633521], [0.1, 0.001, 0.8642901813633521], [0.1, 0.0001, 0.8642901813633521], [1, 0.01, 0.8642901813633521], [1, 0.001, 0.8642901813633521], [1, 0.0001, 0.8642901813633521], [10, 0.001, 0.8642901813633521], [10, 0.0001, 0.8642901813633521], [100, 0.001, 0.8642901813633521], [100, 0.0001, 0.8642901813633521], [1000, 0.001, 0.8642901813633521], [1000, 0.0001, 0.8642901813633521], [10, 0.01, 0.8655409631019387], [100, 0.01, 0.8799249530956847], [1000, 0.01, 0.9105691056

# Exercise 6.5

Compare the results with other methods

In [10]:
from sklearn import linear_model
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

reg_white = linear_model.LogisticRegression()
reg_white.fit(x1,y1)

reg_red = linear_model.LogisticRegression()
reg_red.fit(x2,y2)

predictions_white = reg_white.predict(x1)
predictions_red = reg_white.predict(x2)
print(accuracy_score(y1, predictions_white))
print(accuracy_score(y2, predictions_red))

print('''SVM es mejor para los dos tipos de vino:  
white_SVM=0.85 vs white_LOGIT=0.8
red_SVM=0.91 vs white_LOGIT=0.7986''')

0.8017558187015108
0.7986241400875547
SVM es mejor para los dos tipos de vino:  
white_SVM=0.85 vs white_LOGIT=0.8
red_SVM=0.91 vs white_LOGIT=0.7986


# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [11]:
from sklearn.model_selection import train_test_split

lr_x = data.drop(['quality_bool', 'type', 'type_bin', 'quality'],1)
lr_y = data['quality']

#Standarize Xi
ss = StandardScaler(with_mean=True, with_std=True)
ss.fit(lr_x.astype(np.float))
lr_x = ss.transform(lr_x.astype(np.float))
ss.mean_, ss.scale_

#Standarize y
y_mean, y_std = lr_y.mean(), lr_y.std()
lr_y = (lr_y - y_mean)/ y_std

X_train, X_test, y_train, y_test = train_test_split(lr_x, lr_y, test_size=0.2, random_state=42)

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

lr.coef_

array([ 0.09787725, -0.24841422, -0.01551171,  0.23856108, -0.01492416,
        0.12888738, -0.17405069, -0.17861212,  0.07367229,  0.1335276 ,
        0.36460781])

In [12]:
#RMSE
from sklearn import metrics
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred )))

RMSE: 0.7906802961090872


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [29]:
# Ridge
from sklearn.linear_model import Ridge
ridgereg01 = Ridge(alpha=0.1, normalize = True)
ridgereg01.fit(X_train, y_train)
y_pred01 = ridgereg01.predict(X_test)
print('RMSE with alpha 0.1:', np.sqrt(metrics.mean_squared_error(y_test,y_pred01 )))

RMSE with alpha 0.1: 0.7928917226432389


In [30]:
ridgereg1 = Ridge(alpha=1, normalize = True)
ridgereg1.fit(X_train, y_train)
y_pred1 = ridgereg1.predict(X_test)
print('RMSE with alpha 1:', np.sqrt(metrics.mean_squared_error(y_test,y_pred1 )))

RMSE with alpha 1: 0.8360599085422661


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [24]:
# Lasso
from sklearn.linear_model import Lasso
from sklearn.metris import classification_report

lassoreg001 = Lasso(alpha=0.01)
lassoreg001.fit(X_train, y_train)
y_pred001 = lassoreg001.predict(X_test)
print('RMSE with alpha 0.01:', np.sqrt(metrics.mean_squared_error(y_test,y_pred001 )))

RMSE with alpha 0.01: 0.7928590039681798


In [19]:
lassoreg01 = Lasso(alpha=0.1)
lassoreg01.fit(X_train, y_train)
y_pred01 = lassoreg01.predict(X_test)
print('RMSE with alpha 0.1:', np.sqrt(metrics.mean_squared_error(y_test,y_pred01 )))

RMSE with alpha 0.1: 0.8214188635497303


In [32]:
lassoreg1 = Lasso(alpha=1)
lassoreg1.fit(X_train, y_train)
y_pred1 = lassoreg1.predict(X_test)
print('RMSE with alpha 1:', np.sqrt(metrics.mean_squared_error(y_test,y_pred1 )))


RMSE with alpha 1: 0.9681743922374527


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [58]:
from sklearn.metrics import f1_score
y = data['quality_bool']
x = data.drop(['quality_bool', 'type', 'quality'],1)

reg = linear_model.LogisticRegression()
reg.fit(x,y)

predictions = reg.predict(x)
print('The accuaracy score is',accuracy_score(y, predictions),', which means this model has a good accuracy.')
print('The f1 score is',f1_score(y, predictions), 'which means the model wouldn\'t be the best, the closer to 1 the better.')

The accuaracy score is 0.8166846236724642 , which means this model has a good accuracy.
The f1 score is 0.32673827020915774 which means the model wouldn't be the best, the closer to 1 the better.


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [59]:
from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#X_train = X_train.astype(float)
#X_test = X_test.astype(float)
#scaler.fit(X_train)
#X_train_scaled = scaler.transform(X_train)
#X_test_scaled = scaler.transform(X_test)

In [81]:
results = []

X=data[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']]
y=data['type_bin']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)  


X_train= preprocessing.scale(X_train)
X_test = preprocessing.scale(X_test)

for c in [0.01, 0.1, 1]:
    for p in ['l1', 'l2']:
        
        logreg = linear_model.LogisticRegression(C=c, penalty=p,solver='liblinear')
        logreg.fit(X_train, y_train)
        y_pred = logreg.predict(X_test)   
        results.append([c,p,f1_score(y, predictions)])

print(results)

[[0.01, 'l1', 0.1562152133580705], [0.01, 'l2', 0.1562152133580705], [0.1, 'l1', 0.1562152133580705], [0.1, 'l2', 0.1562152133580705], [1, 'l1', 0.1562152133580705], [1, 'l2', 0.1562152133580705]]
