<a href="https://colab.research.google.com/github/lauravalentinamm/Tareas_Mineria_Python/blob/main/Taller_Regularizacion_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Taller Regularización y SVM


Para esta tarea consideramos un conjunto de observaciones sobre una serie de variedades de vino tinto y blanco que implican sus propiedades químicas y su clasificación por parte de los catadores. La industria del vino ha experimentado un gran crecimiento en los últimos tiempos debido al aumento del consumo social. El precio del vino depende de un concepto bastante abstracto de apreciación del vino por parte de los catadores, cuya opinión puede tener un alto grado de variabilidad. El precio del vino depende en cierta medida de este factor tan volátil. Otro factor clave en la certificación y evaluación de la calidad del vino son las pruebas fisicoquímicas, que se realizan en laboratorio y tienen en cuenta factores como la acidez, el nivel de pH, la presencia de azúcar y otras propiedades químicas. Para el mercado del vino, sería interesante que la calidad humana de la cata pudiera relacionarse con las propiedades químicas del vino para que el proceso de certificación y evaluación de la calidad estuviera más controlado.

Se dispone de dos conjuntos de datos, uno de los cuales se refiere al vino tinto y cuenta con 1.599 variedades diferentes, y el otro, al vino blanco, con 4.898 variedades. Todos los vinos se producen en una zona concreta de Portugal. Se recogen datos sobre 12 propiedades diferentes de los vinos, una de las cuales es la calidad, basada en datos sensoriales, y el resto son propiedades químicas de los vinos, como la densidad, la acidez, el contenido de alcohol, etc. Todas las propiedades químicas de los vinos son variables continuas. La calidad es una variable ordinal con una clasificación posible de 1 (peor) a 10 (mejor). Cada variedad de vino es catada por tres catadores independientes y la clasificación final asignada es la mediana de la clasificación dada por los catadores.

Se espera que un modelo predictivo desarrollado a partir de estos datos sirva de orientación a los viñedos en cuanto a la calidad y el precio que se espera obtener de sus productos sin depender en gran medida de la volatilidad de los catadores.


In [18]:
import pandas as pd
import numpy as np

In [24]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [26]:
data_r.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [20]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
4183,6.2,0.3,0.3,2.5,0.041,29.0,82.0,0.99065,3.31,0.61,11.8,7,white
5614,8.0,0.43,0.36,2.3,0.075,10.0,48.0,0.9976,3.34,0.46,9.4,5,red
2740,6.3,0.2,0.19,12.3,0.048,54.0,145.0,0.99668,3.16,0.42,9.3,6,white
4434,7.2,0.24,0.24,1.7,0.045,18.0,161.0,0.99196,3.25,0.53,11.2,6,white
4321,6.3,0.28,0.22,11.5,0.036,27.0,150.0,0.99445,3.0,0.33,10.6,6,white


In [21]:
data.quality.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

In [22]:
data.type.value_counts()

white    4898
red      1599
Name: type, dtype: int64

# Ejercicio 1

Mostrar la tabla de frecuencias de la calidad por tipo de vino.

In [11]:
(data 
  .groupby("type")
  .quality.value_counts())

type   quality
red    5           681
       6           638
       7           199
       4            53
       8            18
       3            10
white  6          2198
       5          1457
       7           880
       8           175
       4           163
       3            20
       9             5
Name: quality, dtype: int64

In [14]:
pd.crosstab(index=data["type"],     
                      columns=[data["quality"]])

quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10,53,681,638,199,18,0
white,20,163,1457,2198,880,175,5


# Regularización

# Ejercicio 2

* Entrenar una regresión lineal para predecir la calidad del vino (Continuo)

* Analice los coeficientes

* Evaluar el RMSE

In [None]:
data['type'] = data.type.map({'red':0, 'white':1})

In [None]:
y = data['quality'].values
X = data[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'type']].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
print(linreg.coef_)

[ 9.76223931e-02 -1.55047309e+00 -1.36418739e-01  6.67473085e-02
 -7.67939579e-01  3.99814849e-03 -1.05686554e-03 -1.13045446e+02
  5.15890499e-01  7.01081951e-01  2.09971675e-01 -4.24352933e-01]


In [None]:
y_pred = linreg.predict(X_test)

In [None]:
from sklearn import metrics
import numpy as np
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.7176907067288529


# Ejercicio 3

* Estimar una regresión de Ridge con alfa igual a 0,1 y 1.
* Compare y analice los coeficientes de Ridge con la regresión lineal
* Evaluar el RMSE

In [None]:
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
print(ridgereg.coef_)

[ 2.88566714e-02 -1.28227049e+00 -2.39357437e-02  2.97639425e-02
 -1.18663870e+00  3.80364255e-03 -1.28915306e-03 -3.82373057e+01
  2.08500233e-01  5.91549553e-01  2.50777308e-01 -1.56871332e-01]


In [None]:
y_pred = ridgereg.predict(X_test)


In [None]:
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.7199798419799741


In [None]:
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=1, normalize=True)
ridgereg.fit(X_train, y_train)
print(ridgereg.coef_)

[ 1.59777628e-03 -5.85134501e-01  1.48863414e-01  5.68749607e-03
 -1.27422144e+00  1.29552381e-03 -5.83083646e-04 -2.22851534e+01
  8.29715054e-02  3.00782372e-01  1.36293748e-01  4.29621834e-03]


In [None]:
y_pred = ridgereg.predict(X_test)

In [None]:
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.7607146212084632


COMPARACIÓN Y ANÁLISIS DE COEFICIENTES

De acuerdo al RMSE de ambos modelos, variando alpha, se tiene que el mejor modelo de Ridge es que tiene un alpha igual a 0.1

# Ejercicio 4

* Estimar una regresión lasso con alfa igual a 0,01, 0,1 y 1.
* Comparar los coeficientes con la regresión lineal
* Evaluar el RMSE

In [None]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

[ 0.         -0.88931213  0.          0.         -0.          0.
 -0.         -0.          0.          0.          0.25713438 -0.        ]


In [None]:
y_pred = lassoreg.predict(X_test)

In [None]:
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.7456928791733419


In [None]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.1, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.  0.]


In [None]:
y_pred = lassoreg.predict(X_test)

In [None]:
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.8709345888926285


In [None]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=1, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.  0.]


In [None]:
y_pred = lassoreg.predict(X_test)

In [None]:
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RMSE:  0.8709345888926285


COMPARACIÓN
De acuerdo al RMSE de ambos modelos, variando alpha, se tiene que el mejor modelo de Lasso es que tiene un alpha igual a 0.001

# Ejercicio 5

* Estandarizar las características (excepto la calidad del vino)

* Crear una variable objetivo binaria para cada tipo de vino

* Analizar los coeficientes

* * Evalúe con F1, AUC-ROC y log-loss

In [27]:
data_r = data_r.assign(type = 'red')
data_w = data_w.assign(type = 'white')

In [28]:
X_r = data_r.drop(['quality', 'type'], axis = 1)
y_r = data_r['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_r = LabelEncoder()
y_r = labelencoder_y_r.fit_transform(y_r)
y_r

X_w = data_w.drop(['quality', 'type'], axis = 1)
y_w = data_w['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_w = LabelEncoder()
y_w = labelencoder_y_w.fit_transform(y_w)
y_w

array([3, 3, 3, ..., 3, 4, 3])

In [32]:
from sklearn.model_selection import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.3, random_state = 0)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size = 0.3, random_state = 0)

In [33]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_r = sc.fit_transform(X_train_r)
X_test_r = sc.transform(X_test_r)

X_train_w = sc.fit_transform(X_train_w)
X_test_w = sc.transform(X_test_w)

In [34]:
from sklearn.svm import SVC
classifier_r = SVC(kernel = 'linear', random_state = 0)
classifier_r.fit(X_train_r, y_train_r)

classifier_w = SVC(kernel = 'linear', random_state = 0)
classifier_w.fit(X_train_w, y_train_w)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

# Ejercicio 6

* Estimar una regresión logística regularizada usando:
* C = 0.01, 0.1 & 1.0
* penalización = ['l1, 'l2']
* Compare los coeficientes y la puntuación f1.

Nota: Para los valores de C y Penalización, deben realizar todas las posibles combinaciones entre estas dos variables.

In [47]:
logreg_r_001l1 = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')
logreg_r_001l1.fit(X_train_r, y_train_r)
y_pred_r_001l1 = logreg_r_001l1.predict(X_test_r)

logreg_r_001l2 = LogisticRegression(C=0.01, penalty='l2',solver='liblinear')
logreg_r_001l2.fit(X_train_r, y_train_r)
y_pred_r_001l2 = logreg_r_001l2.predict(X_test_r)
print(logreg_r_001l1.coef_)
print("\n")
print(logreg_r_001l2.coef_)

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.          0.02751837  0.          0.          0.          0.
   0.09568896  0.          0.          0.         -0.44191946]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.         -0.03036697  0.          0.          0.          0.
   0.          0.          0.          0.          0.21376175]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]]


[[ 1.08784905e-02  6.15731737e-02 -7.62351654e-03 -1.54797090e-03
   2.46519675e-02 -2.66752597e-04 -2.01016674e-02 -1.77908080e-03
   1.98174666e-02 -7.06734295e-03  2.23015751e-03]
 [ 1.58198935e-03  1.28999683e-01 -7.857

In [48]:
logreg_r_01l1 = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg_r_01l1.fit(X_train_r, y_train_r)
y_pred_r_01l1 = logreg_r_01l1.predict(X_test_r)

logreg_r_01l2 = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg_r_01l2.fit(X_train_r, y_train_r)
y_pred_r_01l2 = logreg_r_01l2.predict(X_test_r)
print(logreg_r_01l1.coef_)
print("\n")
print(logreg_r_01l2.coef_)

[[ 0.          0.338891    0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.          0.495135    0.          0.          0.          0.
  -0.02989183  0.          0.06850841  0.          0.        ]
 [-0.06199971  0.24579868  0.         -0.03902332  0.10349935 -0.06518376
   0.49237662  0.          0.         -0.37105946 -0.81577781]
 [ 0.         -0.16697425 -0.06561349  0.         -0.01536912  0.0839927
  -0.3041385   0.01434631  0.          0.11882304  0.11813057]
 [ 0.11051327 -0.68115963  0.          0.05757191 -0.10423332  0.
  -0.17450709  0.          0.          0.32774722  0.77428364]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.42802444]]


[[ 0.09440392  0.28676027 -0.01596053 -0.00751322  0.12222222  0.02716955
  -0.11661408 -0.023818    0.11399427 -0.03407456  0.00377476]
 [ 0.07552104  0.42438916  0.05857471  0.11176416  0.09934833 -0.003646

In [49]:
logreg_r_1l1 = LogisticRegression(C=1, penalty='l1',solver='liblinear')
logreg_r_1l1.fit(X_train_r, y_train_r)
y_pred_r_1l1 = logreg_r_1l1.predict(X_test_r)


logreg_r_1l2 = LogisticRegression(C=1, penalty='l2',solver='liblinear')
logreg_r_1l2.fit(X_train_r, y_train_r)
y_pred_r_1l2 = logreg_r_1l2.predict(X_test_r)
print(logreg_r_1l1.coef_)
print("\n")
print(logreg_r_1l2.coef_)

[[ 0.          1.00343573 -0.14025681  0.          0.43310995  0.15500996
  -0.60294994  0.13941284  0.33217488  0.          0.        ]
 [ 0.07662218  0.68862998  0.12217693  0.20406729  0.14606722  0.00125185
  -0.39742741 -0.18115624  0.44230093  0.         -0.32996579]
 [-0.37557104  0.26205554  0.07229582 -0.18075093  0.15146849 -0.16887544
   0.61154478  0.30058227 -0.08735271 -0.49763167 -0.73679462]
 [-0.14834662 -0.26862457 -0.23917244 -0.1268596  -0.04558692  0.2155485
  -0.43649837  0.32446368 -0.14439845  0.13539343  0.3096397 ]
 [ 0.65391634 -0.75313436  0.          0.35745704 -0.24828709 -0.00705381
  -0.31891284 -0.52661127  0.30768008  0.56271529  0.62982712]
 [-0.45306215  0.          0.52922522 -0.16134754 -0.45476999  0.
  -0.41174272 -0.15301146 -0.49334066  0.34510344  1.04470398]]


[[ 4.77466802e-01  7.07523551e-01 -1.84272725e-01 -6.50846804e-02
   4.43192275e-01  3.01693523e-01 -5.89170025e-01  6.77000724e-04
   4.74455923e-01 -1.49253601e-01 -9.75552996e-02]
 

In [50]:
logreg_w_001l1 = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')
logreg_w_001l1.fit(X_train_w, y_train_w)
y_pred_w_001l1 = logreg_w_001l1.predict(X_test_w)


logreg_w_001l2 = LogisticRegression(C=0.01, penalty='l2',solver='liblinear')
logreg_w_001l2.fit(X_train_w, y_train_w)
y_pred_w_001l2 = logreg_w_001l2.predict(X_test_w)
print(logreg_w_001l1.coef_)
print("\n")
print(logreg_w_001l2.coef_)

[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.          0.01263959  0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.          0.26403383  0.          0.          0.          0.
   0.          0.          0.          0.         -0.72275365]
 [ 0.         -0.19894215  0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [ 0.         -0.08232186  0.          0.          0.          0.
   0.          0.          0.          0.          0.58715501]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.08956002]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]]


[[ 3.65446137e-02  1.65151767e-02 -7.93767184e-03  3.34567699e-03
   2.43040575e-02  1.6885418

In [51]:
logreg_w_01l1 = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg_w_01l1.fit(X_train_w, y_train_w)
y_pred_w_01l1 = logreg_w_01l1.predict(X_test_w)


logreg_w_01l2 = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg_w_01l2.fit(X_train_w, y_train_w)
y_pred_w_01l2 = logreg_w_01l2.predict(X_test_w)
print(logreg_w_01l1.coef_)
print("\n")
print(logreg_w_01l2.coef_)

[[ 0.06890872  0.          0.          0.          0.          0.
   0.02998304  0.          0.          0.          0.        ]
 [ 0.02823946  0.47664712  0.         -0.43362226  0.         -0.73885293
   0.          0.20529698  0.          0.         -0.43479798]
 [ 0.0571254   0.41954337  0.04540148 -0.19752113  0.02595598 -0.11865738
   0.09719706  0.         -0.04404472 -0.13432828 -1.06798765]
 [-0.03931973 -0.33480939  0.          0.07778067 -0.00557109  0.00522217
   0.          0.         -0.02231102  0.00606755  0.09358997]
 [ 0.         -0.39176056 -0.07231864  0.18056806 -0.2899718   0.07491936
   0.         -0.07415108  0.09472467  0.12955386  0.85459978]
 [ 0.         -0.04610691  0.          0.14653173  0.          0.22590035
   0.          0.          0.09541404  0.          0.878318  ]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]]


[[ 0.20752703  0.0774364  -0.05131946  0.02549177  0.1

In [52]:
logreg_w_1l1 = LogisticRegression(C=1, penalty='l1',solver='liblinear')
logreg_w_1l1.fit(X_train_w, y_train_w)
y_pred_w_1l1 = logreg_w_1l1.predict(X_test_w)


logreg_w_1l2 = LogisticRegression(C=1, penalty='l2',solver='liblinear')
logreg_w_1l2.fit(X_train_w, y_train_w)
y_pred_w_1l2 = logreg_w_1l2.predict(X_test_w)
print(logreg_w_1l1.coef_)
print("\n")
print(logreg_w_1l2.coef_)

[[ 0.83376981  0.17318554 -0.15104167  0.14581155  0.27548482  0.18785
   0.45075913  0.          0.080807    0.          0.53960137]
 [-0.1193914   0.51379746 -0.05331249 -1.47554647 -0.03896258 -0.89789427
  -0.09763956  1.68268087 -0.08721981 -0.09029182  0.        ]
 [-0.02891865  0.42983799  0.05532866 -0.50436916  0.0154752  -0.14031753
   0.11185708  0.44664535 -0.13369033 -0.18376712 -0.8955301 ]
 [-0.06292669 -0.34954976 -0.00988784  0.0711016  -0.01199187  0.02569783
  -0.02174604  0.04482222 -0.04956427  0.01983233  0.13155432]
 [ 0.32509365 -0.43242382 -0.10991767  1.03583953 -0.2933775   0.0803729
   0.02087049 -1.35562587  0.3550928   0.22569718  0.30463356]
 [ 0.22315608 -0.14934577  0.0463274   0.85455607 -0.02863665  0.35531441
  -0.13257272 -0.76507799  0.36365386  0.03778616  0.72667767]
 [ 1.0532889   0.          0.0605861   0.         -1.65306049  0.
   0.          0.          1.07217155 -0.04701793  0.84348114]]


[[ 0.70188166  0.17027977 -0.18509061  0.16239634 

Haciendo un análisis general de todos los coeficientes, se observa que a medida que la penalización es mayor (valores más pequeños de C), más pequeños son los coeficientes. Además, cuando la penalización es tipo l1 (lasso) los coeficientes de la mayoría de las variables son 0, cuando la penalización es muy alta.

# SVM

# Ejercicio 7

Teniendo en cuenta las mismas variables de la regresión logística:

* Crear un objetivo binario para cada tipo de vino
* Crear dos SVM lineales para los vinos blancos y tintos, respectivamente.


In [53]:
X_r = data_r.drop(['quality', 'type'], axis = 1)
y_r = data_r['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_r = LabelEncoder()
y_r = labelencoder_y_r.fit_transform(y_r)
y_r

X_w = data_w.drop(['quality', 'type'], axis = 1)
y_w = data_w['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_w = LabelEncoder()
y_w = labelencoder_y_w.fit_transform(y_w)
y_w

array([3, 3, 3, ..., 3, 4, 3])

In [55]:
from sklearn.model_selection import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.3, random_state = 0)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size = 0.3, random_state = 0)

In [56]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_r = sc.fit_transform(X_train_r)
X_test_r = sc.transform(X_test_r)

X_train_w = sc.fit_transform(X_train_w)
X_test_w = sc.transform(X_test_w)

In [57]:
from sklearn.svm import SVC
classifier_r = SVC(kernel = 'linear', random_state = 0)
classifier_r.fit(X_train_r, y_train_r)

classifier_w = SVC(kernel = 'linear', random_state = 0)
classifier_w.fit(X_train_w, y_train_w)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

# Ejercicio 8

* Probar las dos SVM utilizando los diferentes kernels ('poly', 'rbf', 'sigmoid')
* Evalúe con F1, AUC-ROC y log-loss


In [59]:
from sklearn.model_selection import cross_val_score
classifier_r_rbf = SVC(kernel = 'rbf', random_state = 0)
classifier_r_rbf.fit(X_train_r, y_train_r)
y_pred_r_rbf = classifier_r_rbf.predict(X_test_r)

accuracies_r_rbf = cross_val_score(estimator = classifier_r_rbf, X = X_train_r,
                             y = y_train_r, cv = 10)

classifier_r_poly = SVC(kernel = 'poly', random_state = 0)
classifier_r_poly.fit(X_train_r, y_train_r)
y_pred_r_poly = classifier_r_poly.predict(X_test_r)

accuracies_r_poly = cross_val_score(estimator = classifier_r_poly, X = X_train_r,
                             y = y_train_r, cv = 10)
classifier_r_sig = SVC(kernel = 'sigmoid', random_state = 0)
classifier_r_sig.fit(X_train_r, y_train_r)
y_pred_r_sig = classifier_r_sig.predict(X_test_r)

accuracies_r_sig = cross_val_score(estimator = classifier_r_sig, X = X_train_r,
                             y = y_train_r, cv = 10)

print("Precisión del SVM con kernel-rbf en base de vinos rojos :"+" "+str(accuracies_r_rbf.mean()))
print("Precisión del SVM con kernel-poly en base de vinos rojos :"+" "+str(accuracies_r_poly.mean()))
print("Precisión del SVM con kernel-sigmoid en base de vinos rojos :"+" "+str(accuracies_r_sig.mean()))



Precisión del SVM con kernel-rbf en base de vinos rojos : 0.6050112612612614
Precisión del SVM con kernel-poly en base de vinos rojos : 0.5817728442728443
Precisión del SVM con kernel-sigmoid en base de vinos rojos : 0.5219031531531532


In [60]:
classifier_w_rbf = SVC(kernel = 'rbf', random_state = 0)
classifier_w_rbf.fit(X_train_w, y_train_w)
y_pred_w_rbf = classifier_w_rbf.predict(X_test_w)

accuracies_w_rbf = cross_val_score(estimator = classifier_w_rbf, X = X_train_w,
                             y = y_train_w, cv = 10)

classifier_w_poly = SVC(kernel = 'poly', random_state = 0)
classifier_w_poly.fit(X_train_w, y_train_w)
y_pred_w_poly = classifier_w_poly.predict(X_test_w)

accuracies_w_poly = cross_val_score(estimator = classifier_w_poly, X = X_train_w,
                             y = y_train_w, cv = 10)

classifier_w_sig = SVC(kernel = 'sigmoid', random_state = 0)
classifier_w_sig.fit(X_train_w, y_train_w)
y_pred_w_sig = classifier_w_sig.predict(X_test_w)

accuracies_w_sig = cross_val_score(estimator = classifier_w_sig, X = X_train_w,
                             y = y_train_w, cv = 10)
print("Precisión del SVM con kernel-rbf en base de vinos blancos :"+" "+str(accuracies_w_rbf.mean()))
print("Precisión del SVM con kernel-poly en base de vinos blancos :"+" "+str(accuracies_w_poly.mean()))
print("Precisión del SVM con kernel-sigmoid en base de vinos blancos :"+" "+str(accuracies_w_sig.mean()))



Precisión del SVM con kernel-rbf en base de vinos blancos : 0.5688413209895488
Precisión del SVM con kernel-poly en base de vinos blancos : 0.53178695036912
Precisión del SVM con kernel-sigmoid en base de vinos blancos : 0.43785825106985155


# Ejercicio 9
* Utilizando el mejor SVM encontrar los parámetros que da el mejor rendimiento teniendo en cuenta los siguientes hiperparámetros
'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]
* Evalúe cada modelo con F1, AUC-ROC y log-loss

Nota: Para los valores de C y gamma, deben realizar todas las posibles combinaciones entre estas dos variables.

In [68]:
from sklearn.model_selection import GridSearchCV
parameters_r = [{'C': [0.1,1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]
grid_search_r = GridSearchCV(estimator = classifier_r_rbf,
                           param_grid = parameters_r,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search_r.fit(X_train_r, y_train_r)
best_accuracy_r = grid_search_r.best_score_
best_parameters_r = grid_search_r.best_params_
print(best_accuracy_r)
print(best_parameters_r)



0.6059202059202059
{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}


In [69]:
parameters_w = [{'C': [0.1,1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]
grid_search_w = GridSearchCV(estimator = classifier_w_rbf,
                           param_grid = parameters_w,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search_w.fit(X_train_w, y_train_w)
best_accuracy_w = grid_search_w.best_score_
best_parameters_w = grid_search_w.best_params_
print(best_accuracy_w)
print(best_parameters_w)



0.5644604709051542
{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}


# Ejercicio 10

Compare los resultados con todos los anteriores modelos y eliga ¿Cuál algoritmo con cuáles hiperparámetros es el que tiene mejor rendimiento según la puntuación F1, AUC-ROC y log-loss?

In [67]:
from sklearn.linear_model import LogisticRegression
logreg_r = LogisticRegression(solver='liblinear',C=1e9)
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
logreg_cv=GridSearchCV(logreg_r,grid,cv=10)
logreg_cv.fit(X_train_r,y_train_r)
                                        
best_accuracy_log_r = logreg_cv.best_score_
best_parameters_log_r = logreg_cv.best_params_         
                    
print(best_accuracy_log_r)
print(best_parameters_log_r)



0.5772844272844273
{'C': 0.1, 'penalty': 'l2'}


In [70]:
logreg_w = LogisticRegression(solver='liblinear',C=1e9)
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
logreg_cv=GridSearchCV(logreg_w,grid,cv=10)
logreg_cv.fit(X_train_w,y_train_w)
                                        
best_accuracy_log_w = logreg_cv.best_score_
best_parameters_log_w = logreg_cv.best_params_         

print(best_accuracy_log_w)
print(best_parameters_log_w)



0.5364584931717047
{'C': 1.0, 'penalty': 'l2'}


Como conclusión, para los dos tipos de vino (blanco y rojo) la medida de precisión es más alta cuando se implementa un modelo de SVM comparado con la regresión logística.