# LNR - Práctica 3
## Sesión 2: Clasificación automática

Nombre:
- Guillermo Ferrando Muñoz

Ejercicio:

Implementar un clasificador para los datos de la tarea de detección de estereotipos
en DETESTS. Es importante que se entren al menos dos clasificadores
diferentes y se comparen sus resultados.

### Carga de datos

Vamos a comenzar cargando los datos de DETESTS modificados con los resultados obtenidos en la anterior práctica de esta sesión, es decir, la columna "sentence" inicial es sustituida por los vectores de 100 componentes del word embedding y se le añade la columna de "stereotype", que será la variable respuesta. Estos datos los cargo directamente del fichero "S2_P3.csv" que he generado, lo incluiré en la entrega de la tarea.

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [2]:
df=pd.read_csv("S2_P3.csv")
df=df.drop(columns=['Unnamed: 0']) #para eliminar una columna inútil
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,stereotype
0,0.001318,0.003393,0.00234,-0.002296,0.004027,0.002442,0.000703,0.001406,-0.002482,0.000745,...,-0.001195,-0.000311,-0.000729,0.0015,-0.000136,-0.002057,0.004157,0.002924,0.003697,0
1,0.001163,0.00228,0.005396,-0.005706,0.003198,0.001721,0.00203,0.004187,-0.003248,0.007304,...,-0.002858,0.007483,-0.00148,0.00537,-0.002683,0.005325,-0.001532,0.000768,-0.002241,0
2,-0.001684,0.003351,-0.0003,0.000164,0.00015,-0.003695,0.001025,0.005774,0.001311,-0.000973,...,0.0003,0.000582,-0.004117,0.001826,0.00237,0.000677,-0.001318,-0.000238,0.002707,0
3,0.001581,-0.000432,0.002926,-0.001243,0.000831,-0.002557,0.003618,0.003611,0.00052,-0.000479,...,0.001939,-0.002411,-0.002368,0.001319,-0.000663,0.004117,-0.000327,0.003946,0.004706,0
4,-0.000496,-1.6e-05,0.00748,-0.002841,-0.001257,-0.001341,0.001828,0.00355,0.003153,0.001166,...,0.000466,0.005146,0.002355,0.002462,0.003463,0.002679,0.001108,-0.001922,0.000507,0
5,0.000717,0.001005,-0.000263,0.000679,0.001691,-0.003206,-0.002786,0.006954,-0.001253,0.001588,...,0.000191,-0.002558,-0.002527,0.007731,0.003119,-0.00392,0.001947,-0.00473,-0.002866,0
6,-0.004574,-0.001306,2e-05,-0.002135,-0.000803,0.000261,-0.000666,0.002697,0.000333,0.002623,...,-0.002448,0.001542,0.000938,0.005675,-0.000728,0.002071,-0.001132,0.002149,-0.001617,0
7,-0.007377,0.000183,0.004989,-0.006015,0.002102,-0.002705,-0.006664,0.009896,-0.009904,0.007932,...,0.008981,0.001597,5e-06,-0.006048,0.009836,0.00645,-0.00593,0.005494,0.005669,0
8,-0.000654,0.001425,0.00066,0.001815,-0.00299,-0.001975,0.000159,0.004599,-0.004364,0.000639,...,-0.00312,0.004801,-0.000113,0.003341,0.002509,0.002654,0.00055,0.001538,0.003378,0
9,0.004029,0.002029,0.004909,-0.000626,0.000693,-0.003865,0.001327,-0.008393,-9.7e-05,-0.005665,...,0.001387,0.005404,-0.005205,0.001415,0.006878,0.000329,-0.003058,-0.003888,0.002836,0


### ¿Desbalance de stereotype?

Una vez tenemos los datos cargados, vamos a consultar si "stereotype" (clase binaria) está desbalanceada. A simple vista, parece que la mayoría de las oraciones son de la clase 0:

In [3]:
print("Valores que toma la variable stereotype: ")
print(df["stereotype"].unique()) # es binaria
print()
print("Conteo de clases de stereotype: ")
df["stereotype"].value_counts()

Valores que toma la variable stereotype: 
[0 1]

Conteo de clases de stereotype: 


0    2946
1     871
Name: stereotype, dtype: int64

El 77% de los datos pertenecen a la clase 0 y el 23% a la clase 1, por tanto, **en efecto tenemos un desbalance**. Lo tendremos en cuenta a la hora de implementar los clasificadores.

### Aplicamos SMOTE para balancear

Con SMOTE oversampling creamos nuevas observaciones sintéticas de la clase minoritaria, en este caso la clase 1.

In [3]:
#!pip install imblearn
import imblearn
from imblearn.over_sampling import SMOTE
from collections import Counter

Partimos los datos en conjunto `X` y conjunto `y`:

In [4]:
X = df.drop("stereotype", axis=1).values
y = df["stereotype"].values

Aplicamos SMOTE:

In [5]:
oversample = SMOTE()
XSMOTE, ySMOTE = oversample.fit_resample(X, y)

In [6]:
Counter(ySMOTE) # ahora tenemos el mismo número de observaciones de la clase 0 y de la clase 1

Counter({0: 2946, 1: 2946})

In [7]:
res = pd.DataFrame(XSMOTE)
jij = pd.DataFrame(ySMOTE)
jij = jij.rename(columns={0:'stereotype'})
s3_p3 = res.join(jij)

In [8]:
s3_p3.to_csv("S3_P3.csv")

Al haber hecho esto, los últimos valores de `y` serán todos de clase 1. Esto puede ser perjudicial si esa es la parte de test que obtenemos al dividir los datos, por lo que habrá que barajarlos.

## Implementación de modelos

Abajo del todo añado unas breves conclusiones generales.

### Máquinas de soporte vectorial

Como primer clasificador, usaremos SVM.

In [9]:
import time
from sklearn import svm

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(XSMOTE, ySMOTE, 
                                    test_size=0.25, shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = svm.SVC(kernel='rbf', gamma='scale', C=4)
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.8659209436235042

Ha tardado 2.59 segundos


Con kenrel 'rbf' y el parámetro C=4, tenemos una puntuación de 0.87, es un buen resultado. Vamos a probar ahora qué resultado obtenemos con los datos sin balancear (sin hacer SMOTE):

In [10]:
tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = svm.SVC(kernel='rbf', gamma='scale', C=4)
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.5501298867606974

Ha tardado 1.38 segundos


Tenemos una considerable bajada en la puntuación, parece que el desbalanceo afecta mucho para SVM. Vamos a probar una vez más los datos desbalanceados, pero con un kernel diferente.  

In [25]:
tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = svm.SVC(kernel = "linear", C = 4)
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.43922489724016445

Ha tardado 0.39 segundos


Con un kernel lineal, los resultados empeoran más aún.

### Regresión logística

In [27]:
import time
from sklearn.linear_model import LogisticRegression

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(XSMOTE, ySMOTE, 
                                    test_size=0.25, shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = LogisticRegression()
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.6057523755627248

Ha tardado 0.11 segundos


Sin SMOTE:

In [28]:
import time
from sklearn.linear_model import LogisticRegression

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = LogisticRegression(class_weight= "balanced")
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.5642487923985414

Ha tardado 0.02 segundos


### Árboles de decisión

In [30]:
import time
from sklearn import tree

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(XSMOTE, ySMOTE, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.7184258643617021

Ha tardado 1.02 segundos


Sin SMOTE:

In [31]:
import time
from sklearn import tree

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = tree.DecisionTreeClassifier(class_weight= "balanced")
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.5371050068186247

Ha tardado 0.41 segundos


### Red neuronal

In [11]:
import time
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, accuracy_score

tini = time.process_time()

X_train, X_eval, y_train, y_test = train_test_split(XSMOTE, ySMOTE, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = MLPClassifier(random_state=1, max_iter=300)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_eval)

a=f1_score(y_test, y_pred, average=None)
print("F1", a)

a=f1_score(y_test, y_pred, average='macro') #Calculate metrics for each label, and find their unweighted mean. 
                                                #This does not take label imbalance into account.
print("macro F1", a)

a=f1_score(y_test, y_pred, average='micro') #Calculate metrics globally by counting the total true positives, 
                                                #false negatives and false positives.
print("micro F1", a)

print("Accuracy:", accuracy_score(y_test, y_pred, normalize=True))

print()
print('Métricas con validación cruzada: ')

macro = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1_macro')
print("macro F1", macro)

micro = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1_micro')
print("micro F1", micro)

tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

F1 [0.61587811 0.71806945]
macro F1 0.6669737800385434
micro F1 0.6748133061778683
Accuracy: 0.6748133061778683

Métricas con validación cruzada: 




NameError: name 'cross_val_score' is not defined

Sin SMOTE:

In [33]:
import time
from sklearn.neural_network import MLPClassifier

tini = time.process_time()

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25, 
                                                    shuffle = True, random_state=42) #shuffle = True para barajar los datos
clf = MLPClassifier(random_state=1, max_iter=300)
clf.fit(X_train, y_train)
pred = clf.predict(X_eval)
score = f1_score(y_eval, pred, average='macro')

print("Score: ",score)
tfin = time.process_time()
print()
print("Ha tardado", round(tfin-tini, 2), "segundos")

Score:  0.4666176558453198

Ha tardado 16.86 segundos




### Conclusiones generales

De los 4 tipos de métodos usados, los mejores resultados los han dado la máquina de soporte vectorial con kernel rbf y el árbol de decisión usando los datos generados con oversampling SMOTE, ambos con costes temporales similares. El desbalance afecta a todos los modelos casi en la misma magnitud; los resultados empeoran alrededor del 20% o más, lo que llama bastante a la atención.