# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')
data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
Aux = data[["type","quality","density"]].groupby(["quality","type"]).count().unstack()
Aux

Unnamed: 0_level_0,density,density
type,red,white
quality,Unnamed: 1_level_2,Unnamed: 2_level_2
3,10.0,20.0
4,53.0,163.0
5,681.0,1457.0
6,638.0,2198.0
7,199.0,880.0
8,18.0,175.0
9,,5.0


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


**Create a binary target for each type of wine**

In [5]:
data["quality2"]= data["quality"].map({3:'0',4:'0',5:'0',6:'1',7:'1',8:'1',9:'1'})
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,1
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,1
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,1
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,1


In [6]:
data_blanco=data[data["type"]=="white"]
data_rojo=data[data["type"]!="white"]
#print(data_blanco[["quality2", "type"]].groupby("quality2").count())
#print(data_rojo[["quality2", "type"]].groupby("quality2").count())

**Standarized the features (not the quality)**

In [9]:
from sklearn.preprocessing import Normalizer
data_blanco_norm=Normalizer().fit_transform(data_blanco.loc[:,:"alcohol"])
data_rojo_norm=Normalizer().fit_transform(data_rojo.loc[:,:"alcohol"])

**Create two Linear SVM's for the white and red wines, repectively.**

In [91]:
from sklearn.model_selection import train_test_split # Dividir en train y test
X_blanco=data_blanco_norm
Y_blanco=data_blanco["quality2"]
X_rojo=data_rojo_norm
Y_rojo=data_rojo["quality2"]
X_blanco_train, X_blanco_test, Y_blanco_train, Y_blanco_test = train_test_split(X_blanco, Y_blanco, test_size=0.30)
X_rojo_train, X_rojo_test, Y_rojo_train, Y_rojo_test = train_test_split(X_rojo, Y_rojo, test_size=0.30)

from sklearn.svm import SVC # "Support Vector Classifier"
from sklearn.metrics import accuracy_score
clf_blanco = SVC(kernel='linear')
clf_blanco.fit(X_blanco_train, Y_blanco_train)
y_blanco_predict = clf_blanco.predict(X_blanco_test)
blanco_accuracy=accuracy_score(Y_blanco_test, y_blanco_predict, normalize=True)
print(blanco_accuracy)

clf_rojo = SVC(kernel='linear')
clf_rojo.fit(X_rojo_train, Y_rojo_train)
y_rojo_predict = clf_rojo.predict(X_rojo_test)
rojo_accuracy=accuracy_score(Y_rojo_test, y_rojo_predict, normalize=True)
print(rojo_accuracy)

0.6714285714285714
0.61875


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [93]:
K=['linear','poly', 'rbf', 'sigmoid']
Acc = []
for k in range(len(K)):
    clf_blanco2 = SVC(kernel=K[k]) 
    clf_blanco2.fit(X_blanco_train, Y_blanco_train)
    y_blanco_predict2 = clf_blanco2.predict(X_blanco_test)
    blanco_accuracy2=accuracy_score(Y_blanco_test, y_blanco_predict2, normalize=True)
    Acc.append(blanco_accuracy2)
    print('Blanco_Acc_'+K[k]+':',blanco_accuracy2)

Blanco_Acc_linear: 0.6714285714285714
Blanco_Acc_poly: 0.6714285714285714
Blanco_Acc_rbf: 0.6714285714285714
Blanco_Acc_sigmoid: 0.6714285714285714


In [95]:
Acc = []
for k in range(len(K)):
    clf_rojo2 = SVC(kernel=K[k]) 
    clf_rojo2.fit(X_rojo_train, Y_rojo_train)
    y_rojo_predict2 = clf_rojo2.predict(X_rojo_test)
    rojo_accuracy2=accuracy_score(Y_rojo_test, y_rojo_predict2, normalize=True)
    Acc.append(rojo_accuracy2)
    print('Rojo_Acc_'+K[k]+':',rojo_accuracy2)

Rojo_Acc_linear: 0.61875
Rojo_Acc_poly: 0.5375
Rojo_Acc_rbf: 0.6270833333333333
Rojo_Acc_sigmoid: 0.6145833333333334


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [101]:
C = [0.1, 1, 10, 100, 1000]
gamma = [0.01, 0.001, 0.0001]
cols=['type','C','gamma1','Accuracy']
Acc_tabla=pd.DataFrame(columns=cols,data=[])
Acc_val=[]
k=0
for i in range(len(C)):
    for j in range(len(gamma)):
        clf_blanco3 = SVC(kernel='rbf', C=C[i], gamma=gamma[j])
        clf_blanco3.fit(X_blanco_train, Y_blanco_train)
        y_blanco_predict3 = clf_blanco3.predict(X_blanco_test)
        blanco_accuracy3=accuracy_score(Y_blanco_test, y_blanco_predict3, normalize=True)
        Acc_val.append(blanco_accuracy3)
        Acc_tabla.loc[k] = ['Blanco',C[i],gamma[j],blanco_accuracy3]
        k+=1

Acc_val=[]
#k=WineAcc.shape[0]
#Red Wine
for i in range(len(C)):
    for j in range(len(gamma)):
        clf_rojo3 = SVC(kernel='rbf', C=C[i], gamma=gamma[j])
        clf_rojo3.fit(X_rojo_train, Y_rojo_train)
        y_rojo_predict3 = clf_rojo3.predict(X_rojo_test)
        rojo_accuracy3=accuracy_score(Y_rojo_test, y_rojo_predict3, normalize=True)
        Acc_val.append(rojo_accuracy3)
        Acc_tabla.loc[k] = ['Rojo',C[i],gamma[j],rojo_accuracy3]
        k+=1

print(Acc_tabla[Acc_tabla['Accuracy']==np.max(Acc_tabla['Accuracy'])])

    type       C  gamma1  Accuracy
27  Rojo  1000.0    0.01    0.6875


# Exercise 6.5

Compare the results with other methods

In [106]:
from sklearn.linear_model import LogisticRegression
logreg_blanco = LogisticRegression(C=1e9, solver='liblinear')
logreg_blanco.fit(X_blanco_train, Y_blanco_train)
y_blanco_predict4 = logreg_blanco.predict(X_blanco_test)
blanco_accuracy4=accuracy_score(Y_blanco_test, y_blanco_predict4, normalize=True)

logreg_rojo = LogisticRegression(C=1e9, solver='liblinear')
logreg_rojo.fit(X_rojo_train, Y_rojo_train)
y_rojo_predict4 = logreg_rojo.predict(X_rojo_test)
rojo_accuracy4=accuracy_score(Y_rojo_test, y_rojo_predict4, normalize=True)
print('Blanco_Acc_reglog:',blanco_accuracy4,'\n','Rojo_Acc_reglog:',rojo_accuracy4)

Blanco_Acc_reglog: 0.7414965986394558 
 Rojo_Acc_reglog: 0.7291666666666666


Para ambos tipos de vino esta funcionando mucho mejor la regresion logistica que el SVM, en ambos casos es mejor el Accuracy de la regresion que el de cualquiera de las combinaciones de parametros para el SVM.

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [109]:
from sklearn.linear_model import LinearRegression
linreg_blanco = LinearRegression(normalize=True)
linreg_blanco.fit(X_blanco_train, Y_blanco_train)
y_blanco_predict5 = linreg_blanco.predict(X_blanco_test)
print('Coeficientes:',linreg_blanco.coef_)
from sklearn import metrics
print('RMSE=',np.sqrt(metrics.mean_squared_error(Y_blanco_test, y_blanco_predict5)))

Coeficientes: [  -0.27102082 -112.29722231   -6.47547133    1.45233976  -60.89455299
    2.19863622    5.71887039 -212.79058155   18.7232475    24.93145285
   17.51296378]
RMSE= 0.4169574337986386


In [110]:
from sklearn.linear_model import LinearRegression
linreg_rojo = LinearRegression(normalize=True)
linreg_rojo.fit(X_rojo_train, Y_rojo_train)
y_rojo_predict5 = linreg_rojo.predict(X_rojo_test)
print('Coeficientes:',linreg_rojo.coef_)
from sklearn import metrics
print('RMSE=',np.sqrt(metrics.mean_squared_error(Y_rojo_test, y_rojo_predict5)))

Coeficientes: [  1.54639859 -18.00119284  -6.60698737   0.05709231 -11.39380057
   1.29406515   2.17824585 -37.43797798  -5.8826721   16.46610955
   6.38207564]
RMSE= 0.4362365794262937


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

**Estimate a ridge regression with alpha equals 0.1 and 1.**

In [111]:
from sklearn.linear_model import Ridge
alpha=[0.1, 1]
cols=['type','alpha','RMSE']
Blanco_reg_ridge=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(alpha)):
    Blanco_regridge = Ridge(alpha=alpha[i], normalize=True)
    Blanco_regridge.fit(X_blanco_train, Y_blanco_train)
    y_blanco_predict6 = Blanco_regridge.predict(X_blanco_test)
    RMSE=np.sqrt(metrics.mean_squared_error(Y_blanco_test, y_blanco_predict6))
    Blanco_reg_ridge.loc[k] = ['white',alpha[i],RMSE]
    k+=1
Blanco_reg_ridge

Unnamed: 0,type,alpha,RMSE
0,white,0.1,0.434159
1,white,1.0,0.454102


In [112]:
Rojo_reg_ridge=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(alpha)):
    Rojo_regridge = Ridge(alpha=alpha[i], normalize=True)
    Rojo_regridge.fit(X_rojo_train, Y_rojo_train)
    y_rojo_predict6 = Rojo_regridge.predict(X_rojo_test)
    RMSE=np.sqrt(metrics.mean_squared_error(Y_rojo_test, y_rojo_predict6))
    Rojo_reg_ridge.loc[k] = ['red',alpha[i],RMSE]
    k+=1
Rojo_reg_ridge

Unnamed: 0,type,alpha,RMSE
0,red,0.1,0.444994
1,red,1.0,0.466077


**Compare the coefficients with the linear regression**

In [116]:
print('Coeficientes_RegLin_blanco:',linreg_blanco.coef_)
print('Coeficientes_RegRidge_blanco:',Blanco_regridge.coef_)

Coeficientes_RegLin_blanco: [  -0.27102082 -112.29722231   -6.47547133    1.45233976  -60.89455299
    2.19863622    5.71887039 -212.79058155   18.7232475    24.93145285
   17.51296378]
Coeficientes_RegRidge_blanco: [  -0.42181962  -29.40727441    3.11687254   -0.23325702 -158.84817625
    0.42259903   -0.8118267     1.89182679    1.24307775   12.11433891
    1.15268509]


Para el caso del vino blanco se observa que los coeficientes de la regresión Ridge son mucho mas pequeños que en la regresión lineal.

In [117]:
print('Coeficientes_RegLin_rojo:',linreg_rojo.coef_)
print('Coeficientes_RegRidge_rojo:',Rojo_regridge.coef_)

Coeficientes_RegLin_rojo: [  1.54639859 -18.00119284  -6.60698737   0.05709231 -11.39380057
   1.29406515   2.17824585 -37.43797798  -5.8826721   16.46610955
   6.38207564]
Coeficientes_RegRidge_rojo: [ 0.10634099 -5.57172943  5.6763286   0.15540337 -8.78134652  0.3948138
 -0.13807881  0.30798622  0.07755105  4.39212675  0.20967816]


Para este caso no se puede generalizar si los coeficientes son todos mayores o menores de una regresión a otra porque hay algunos mayores en la regresión lineal y otros mayores en la regresión Ridge.

**Evaluate the RMSE**

In [114]:
# select the best alpha with RidgeCV
from sklearn.linear_model import RidgeCV
ridgeregcv = RidgeCV(alphas=alpha, normalize=True, scoring='neg_mean_squared_error')
ridgeregcv.fit(X_blanco_train, Y_blanco_train)
ridgeregcv.alpha_
print(Blanco_reg_ridge[Blanco_reg_ridge['alpha']==ridgeregcv.alpha_])

    type  alpha      RMSE
0  white    0.1  0.434159


In [115]:
# select the best alpha with RidgeCV
from sklearn.linear_model import RidgeCV
ridgeregcv2 = RidgeCV(alphas=alpha, normalize=True, scoring='neg_mean_squared_error')
ridgeregcv2.fit(X_rojo_train, Y_rojo_train)
ridgeregcv2.alpha_
print(Rojo_reg_ridge[Rojo_reg_ridge['alpha']==ridgeregcv2.alpha_])

  type  alpha      RMSE
0  red    0.1  0.444994


Para ambos tipos de vino el mejor modelo según el RMSE es el que considera alpha=0.1

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

**Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.**

In [123]:
from sklearn.linear_model import Lasso
alpha=[0.01,0.1, 1]
cols=['type','alpha','RMSE']
Blanco_reg_lasso=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(alpha)):
    Blanco_reglasso = Lasso(alpha=alpha[i], normalize=True)
    Blanco_reglasso.fit(X_blanco_train, Y_blanco_train)
    y_blanco_predict7 = Blanco_reglasso.predict(X_blanco_test)
    RMSE=np.sqrt(metrics.mean_squared_error(Y_blanco_test, y_blanco_predict7))
    Blanco_reg_lasso.loc[k] = ['white',alpha[i],RMSE]
    k+=1
Blanco_reg_lasso

Unnamed: 0,type,alpha,RMSE
0,white,0.01,0.469779
1,white,0.1,0.469779
2,white,1.0,0.469779


In [124]:
Rojo_reg_lasso=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(alpha)):
    Rojo_reglasso = Lasso(alpha=alpha[i], normalize=True)
    Rojo_reglasso.fit(X_rojo_train, Y_rojo_train)
    y_rojo_predict7 = Rojo_reglasso.predict(X_rojo_test)
    RMSE=np.sqrt(metrics.mean_squared_error(Y_rojo_test, y_rojo_predict7))
    Rojo_reg_lasso.loc[k] = ['red',alpha[i],RMSE]
    k+=1
Rojo_reg_lasso

Unnamed: 0,type,alpha,RMSE
0,red,0.01,0.498608
1,red,0.1,0.498608
2,red,1.0,0.498608


**Compare the coefficients with the linear regression**

In [127]:
print('Coeficientes_RegLin_blanco:',linreg_blanco.coef_)
print('Coeficientes_RegLasso_blanco:',Blanco_reglasso.coef_)
print('Intercepto_RegLasso_blanco:',Blanco_reglasso.intercept_)

Coeficientes_RegLin_blanco: [  -0.27102082 -112.29722231   -6.47547133    1.45233976  -60.89455299
    2.19863622    5.71887039 -212.79058155   18.7232475    24.93145285
   17.51296378]
Coeficientes_RegLasso_blanco: [ 0. -0.  0. -0. -0.  0. -0.  0.  0.  0.  0.]
Intercepto_RegLasso_blanco: 0.662485414235706


Para este caso ninguno de los coeficientes aporta a la regresion Lasso y todo el aporte viene dado por el coeficiente.

In [128]:
print('Coeficientes_RegLin_rojo:',linreg_rojo.coef_)
print('Coeficientes_RegRidge_rojo:',Rojo_reglasso.coef_)
print('Intercepto_RegRidge_rojo:',Rojo_reglasso.intercept_)

Coeficientes_RegLin_rojo: [  1.54639859 -18.00119284  -6.60698737   0.05709231 -11.39380057
   1.29406515   2.17824585 -37.43797798  -5.8826721   16.46610955
   6.38207564]
Coeficientes_RegRidge_rojo: [ 0. -0.  0.  0.  0.  0. -0.  0.  0.  0.  0.]
Intercepto_RegRidge_rojo: 0.5335120643431636


Para este caso no se puede generalizar si los coeficientes son todos mayores o menores de una regresión a otra porque hay algunos mayores en la regresión lineal y otros mayores en la regresión Lasso

Para este caso ninguno de los coeficientes aporta a la regresion Lasso y todo el aporte viene dado por el coeficiente.

**Evaluate the RMSE**

In [130]:
print(Blanco_reg_lasso)
print(Rojo_reg_lasso)

    type  alpha      RMSE
0  white   0.01  0.469779
1  white   0.10  0.469779
2  white   1.00  0.469779
  type  alpha      RMSE
0  red   0.01  0.498608
1  red   0.10  0.498608
2  red   1.00  0.498608


Para evaluar el RMSE en esta regresión lasso no es necesario elegir el mejor dado a que con cualquier valor de alpha se obtiene el mismo RMSE en cada uno de los tipos de vino.

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

**Train a logistic regression to predict wine quality (binary)**

In [238]:
from sklearn.metrics import classification_report, f1_score
logreg_blanco1 = LogisticRegression(C=1e9, solver='liblinear')
logreg_blanco1.fit(X_blanco_train, Y_blanco_train)
y_blanco_predict4 = logreg_blanco1.predict(X_blanco_test)
blanco_accuracy4=accuracy_score(Y_blanco_test, y_blanco_predict4, normalize=True)

logreg_rojo1 = LogisticRegression(C=1e9, solver='liblinear')
logreg_rojo1.fit(X_rojo_train, Y_rojo_train)
y_rojo_predict4 = logreg_rojo1.predict(X_rojo_test)
rojo_accuracy4=accuracy_score(Y_rojo_test, y_rojo_predict4, normalize=True)
print('Blanco_Acc_reglog:',blanco_accuracy4,'\n','Rojo_Acc_reglog:',rojo_accuracy4)

Blanco_Acc_reglog: 0.7414965986394558 
 Rojo_Acc_reglog: 0.7291666666666666


**Analyze the coefficients**

In [239]:
print('Coeficientes_RegLog_blanco:',logreg_blanco1.coef_)

Coeficientes_RegLog_blanco: [[  -19.88495194  -789.10087748    -1.11983376     8.95195832
   -161.36534433    12.53174856    35.01199787 -1027.08425218
     23.21045777   171.35069683   120.19785683]]


In [240]:
print('Coeficientes_RegLog_rojo:',logreg_rojo1.coef_)


Coeficientes_RegLog_rojo: [[   6.90032203 -119.7205067   -45.67074047    0.88240723 -109.09099471
     6.52305163   12.4536095  -157.59665331  -55.12641732  112.34507775
    39.43012093]]


**Evaluate the f1score**

In [241]:
print(classification_report(Y_blanco_test, y_blanco_predict4))

             precision    recall  f1-score   support

          0       0.65      0.46      0.54       483
          1       0.77      0.88      0.82       987

avg / total       0.73      0.74      0.73      1470



In [242]:
print(classification_report(Y_rojo_test, y_rojo_predict4))

             precision    recall  f1-score   support

          0       0.68      0.78      0.73       222
          1       0.79      0.68      0.73       258

avg / total       0.74      0.73      0.73       480



# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [243]:
C=[0.01,0.1, 1]
penalty=['l1','l2']
cols=['wine','C','penalty','F1']
WWine_log_reg=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(C)):
    for j in range(len(penalty)):
        logreg_blanco = LogisticRegression(C=C[i], penalty=penalty[j], solver='liblinear')
        logreg_blanco.fit(X_blanco_train, Y_blanco_train)
        y_blanco_predict8 = logreg_blanco.predict(X_blanco_test)
        blanco_accuracy8=f1_score(np.array(Y_blanco_test, dtype='float'), np.array(y_blanco_predict8, dtype='float'))
        WWine_log_reg.loc[k] = ['white',C[i],penalty[j],blanco_accuracy8]
        k+=1
WWine_log_reg

Unnamed: 0,wine,C,penalty,F1
0,white,0.01,l1,0.803419
1,white,0.01,l2,0.803419
2,white,0.1,l1,0.803419
3,white,0.1,l2,0.803419
4,white,1.0,l1,0.791064
5,white,1.0,l2,0.803169


In [244]:
C=[0.01,0.1, 1]
penalty=['l1','l2']
cols=['wine','C','penalty','F1']
WWine_log_reg=pd.DataFrame(columns=cols,data=[])
k=0
for i in range(len(C)):
    for j in range(len(penalty)):
        logreg_rojo = LogisticRegression(C=C[i], penalty=penalty[j], solver='liblinear')
        logreg_rojo.fit(X_rojo_train, Y_rojo_train)
        y_rojo_predict8 = logreg_rojo.predict(X_rojo_test)
        rojo_accuracy8=f1_score(np.array(Y_rojo_test, dtype='float'), np.array(y_rojo_predict8, dtype='float'))
        WWine_log_reg.loc[k] = ['Rojo',C[i],penalty[j],rojo_accuracy8]

        k+=1
WWine_log_reg

  'precision', 'predicted', average, warn_for)


Unnamed: 0,wine,C,penalty,F1
0,Rojo,0.01,l1,0.0
1,Rojo,0.01,l2,0.699187
2,Rojo,0.1,l1,0.683502
3,Rojo,0.1,l2,0.679376
4,Rojo,1.0,l1,0.678899
5,Rojo,1.0,l2,0.690647


A diferencia del vino blanco, en el vino rojo difieren cada uno de los resultados variando los parámetros cde la regresión, para este el mejor F1-Score se da cuando C=0.01 y se utiliza l2 como penalidad, mientras que para el vino blanco las primeras 4 combianciones nos arrojan el mismo resultado de F1-Score.

In [246]:
logreg_blanco2 = LogisticRegression(C=0.01,penalty='l2' , solver='liblinear')
logreg_blanco2.fit(X_blanco_train, Y_blanco_train)
y_blanco_predict9 = logreg_blanco2.predict(X_blanco_test)


logreg_rojo2 = LogisticRegression(C=0.01,penalty='l2', solver='liblinear')
logreg_rojo2.fit(X_rojo_train, Y_rojo_train)
y_rojo_predict9 = logreg_rojo2.predict(X_rojo_test)

print('Coeficientes_RegLog_Blanco:',logreg_blanco1.coef_)
print('Coeficientes_RegLog_Blanco2:',logreg_blanco2.coef_)

Coeficientes_RegLog_Blanco: [[  -19.88495194  -789.10087748    -1.11983376     8.95195832
   -161.36534433    12.53174856    35.01199787 -1027.08425218
     23.21045777   171.35069683   120.19785683]]
Coeficientes_RegLog_Blanco2: [[ 2.86945911e-02 -1.03031391e-03  2.10508504e-03 -8.32462188e-03
  -1.99765636e-04  3.26336246e-01  2.37748524e-01  5.96623367e-03
   2.12961643e-02  3.93924603e-03  1.22924597e-01]]


In [247]:
print('Coeficientes_Reglog_rojo:',logreg_rojo1.coef_)
print('Coeficientes_Reglog_rojo:',logreg_rojo2.coef_)

Coeficientes_Reglog_rojo: [[   6.90032203 -119.7205067   -45.67074047    0.88240723 -109.09099471
     6.52305163   12.4536095  -157.59665331  -55.12641732  112.34507775
    39.43012093]]
Coeficientes_Reglog_rojo: [[ 0.13882872 -0.00311358  0.01045369  0.03243711  0.00055912  0.11061618
  -0.07986812  0.01195043  0.03787738  0.01350831  0.18768045]]


Al comparar la regresion logistica con los valores por defecto vs la regresion logistica con los valores ajustados segun los resultados optimos, se observan que para algunas de las variables varian los coeficientes indicando mayor o menor peso en cada regresion pero sin poderse generalizar de algun modo.