# Clasificación supervisada

La clasificación supervisada en textos funciona conceptualmente de manera similar a la clasificación en otros problemas de Machine Learning con datos estructurados:

1. Se requiere preprocesar la información (en el caso de datos no estructurados, convertir los textos a TFIDF).
2. Dividir en entrenamiento y test el conjunto de textos.
3. Entrenar al modelo incluyendo el set de train.
4. Evaluación del modelo, lanzando la predicción sobre el conjunto de test y evaluándolo con los valores reales.

En este notebook, vamos a aplicar los distintos modelos que hemos visto en clase para clasificación. Puedes hacerlo en notebooks diferentes (cada uno de los modelos) o todos en el mismo. Sigue la secuencia de pasos anterior, aplicando correctamente las funciones necesarias en cada paso, para cada uno de los modelos:

- Clasificador ingenuo bayesiano
- SVM
- KNN
- Decision tree
- Random Forest

¿Cuál funciona mejor? ¿En qué métricas te has basado?

In [11]:
## Importación de librerías

import spacy
import pandas as pd
import sklearn

nlp_español = spacy.load('es_core_news_lg')  

## Random seed
random_num = 100

In [2]:
## Lectura de datos

datos = pd.read_csv("/Users/juan/Documents/Juan's MacBook Pro/CUNEF/Quinto/Informacion no Estructurada/Practica 1/hotel.csv")
print(datos.head())
print(datos.shape)
print(datos.columns)

                                                text  label
0  Es un gran hotel; el mejor de Asunción. Buenas...      3
1  hola. no suelo criticar jamas lo que paso pero...      3
2  Escogi meses antes de mi boda una habitacion p...      3
3  Voy a se Lo mas equitativo posible; porque soy...      3
4  Esta es una experiencia de septiembre de 2016;...      3
(200, 2)
Index(['text', 'label'], dtype='object')


### 1.1. Preprocesamiento y normalización
Vamos a separar los documentos y sus categorías. docs y categs son series de Pandas. Hay que separar las categorías de los documentos para usar estos últimos y obtener la matriz Tf-idf.

In [3]:
docs = datos.iloc[:,0] # extract column with review
categs = datos.iloc[:,-1] # extract column with sentiment

In [4]:
print("Datos es tipo: ", type(datos))
print("Docs es tipo: ", type(docs))
print("Categs es tipo: ", type(categs))

Datos es tipo:  <class 'pandas.core.frame.DataFrame'>
Docs es tipo:  <class 'pandas.core.series.Series'>
Categs es tipo:  <class 'pandas.core.series.Series'>


### 1.2. Obtención de las matrices BOW y Tf-idf

Obten la matriz TFIDF de todos los textos. Se puede obtener a partir de la matriz BOW.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [58]:
# tokenizamos los documentos y convertimos en matriz BOW
vectorizer = CountVectorizer(max_features=20)
BOW = vectorizer.fit_transform(datos['text'])

vocab = vectorizer.get_feature_names_out()
BOWdf = pd.DataFrame(BOW.toarray(), columns=vocab)
BOWdf

Unnamed: 0,con,de,del,desayuno,el,en,es,excelente,hotel,la,las,lo,los,muy,no,para,personal,que,un,una
0,0,1,0,0,7,2,5,1,2,2,0,1,0,1,0,1,0,2,3,2
1,4,17,4,0,11,8,4,0,4,19,5,4,3,2,6,2,1,16,4,2
2,0,4,0,0,2,1,0,0,1,1,1,0,0,1,0,1,0,3,0,1
3,0,2,0,0,1,1,1,0,0,3,0,1,0,0,0,0,0,2,1,1
4,5,20,6,0,16,7,1,0,6,25,4,5,2,2,14,8,2,34,5,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,1,1,0,0,0,1,0,1,0,0,0,0,1,3,0,0,2,0,0,0
196,2,6,3,1,6,1,1,3,1,6,0,3,1,0,0,2,2,4,1,3
197,0,4,0,1,0,0,0,1,0,0,0,0,0,5,0,0,1,0,0,0
198,1,3,2,0,3,0,1,0,1,4,0,1,1,0,0,0,0,3,0,0


In [59]:
# Construimos la matriz formato Tf-idf
tfifd_vec = TfidfVectorizer(max_features=20)
TFIDF = tfifd_vec.fit_transform(datos["text"])

vocab = tfifd_vec.get_feature_names_out()
TFIDFdf = pd.DataFrame(TFIDF.toarray(), columns=vocab)
TFIDFdf

Unnamed: 0,con,de,del,desayuno,el,en,es,excelente,hotel,la,las,lo,los,muy,no,para,personal,que,un,una
0,0.000000,0.081406,0.000000,0.000000,0.590522,0.177638,0.510090,0.125227,0.190191,0.163641,0.000000,0.133650,0.000000,0.087000,0.000000,0.135470,0.000000,0.201811,0.334947,0.278525
1,0.139741,0.415572,0.145796,0.000000,0.278658,0.213371,0.122540,0.000000,0.114225,0.466826,0.195419,0.160535,0.118024,0.052250,0.261945,0.081360,0.038330,0.484814,0.134108,0.083638
2,0.000000,0.582945,0.000000,0.000000,0.302050,0.159007,0.000000,0.000000,0.170244,0.146478,0.233006,0.000000,0.000000,0.155751,0.000000,0.242523,0.000000,0.541935,0.000000,0.249313
3,0.000000,0.361696,0.000000,0.000000,0.187411,0.197316,0.226639,0.000000,0.000000,0.545305,0.000000,0.296912,0.000000,0.000000,0.000000,0.000000,0.000000,0.448335,0.248035,0.309379
4,0.107597,0.301157,0.134711,0.000000,0.249669,0.115003,0.018871,0.000000,0.105540,0.378362,0.096299,0.123608,0.048467,0.032185,0.376490,0.200465,0.047222,0.634601,0.103260,0.154558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.264004,0.184733,0.000000,0.000000,0.000000,0.201555,0.000000,0.284173,0.000000,0.000000,0.000000,0.000000,0.297299,0.592280,0.000000,0.000000,0.579323,0.000000,0.000000,0.000000
196,0.174738,0.366811,0.273465,0.094044,0.380122,0.066702,0.076614,0.282132,0.071416,0.368677,0.000000,0.301110,0.098388,0.000000,0.000000,0.203472,0.191720,0.303116,0.083847,0.313754
197,0.000000,0.556064,0.000000,0.213848,0.000000,0.000000,0.000000,0.213848,0.000000,0.000000,0.000000,0.000000,0.000000,0.742843,0.000000,0.000000,0.217978,0.000000,0.000000,0.000000
198,0.173546,0.364307,0.362131,0.000000,0.377528,0.000000,0.152183,0.000000,0.141857,0.488214,0.000000,0.199370,0.195433,0.000000,0.000000,0.000000,0.000000,0.451571,0.000000,0.000000


### 2. Preparación de los subconjuntos de entrenamiento y test

Divide entre train y test, utilizando train_test_split.

In [60]:
# División mediante train_test_split. Test de 25%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(BOWdf, datos["label"], test_size = 0.25, random_state = random_num)

X_traintf, X_testtf, y_traintf, y_testtf = train_test_split(TFIDFdf, datos["label"], test_size = 0.25, random_state = random_num)



### 3. Entrenamiento del modelo: clasificador ingenuo bayesiano (MultinomialNB)

In [61]:
# Entrenamiento del clasificador NB
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, y_train)

In [62]:
clftf = MultinomialNB()
clftf.fit(X_traintf, y_traintf)

### 4. Evaluación del modelo.

Obtén la confusión matrix para evaluar el rendimiento del modelo, así como el accuracy (utilizando la función score).

In [63]:
# Predicción del set de test
y_pred = clf.predict(X_test)

y_predtf = clftf.predict(X_testtf)

In [64]:
# Confusion Matrix

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(clf.score(X_test, y_test))

[[15  8]
 [11 16]]
0.62


In [65]:
y_pred

array([5, 3, 5, 3, 5, 5, 3, 3, 3, 3, 5, 3, 5, 5, 3, 5, 5, 3, 5, 3, 3, 5,
       3, 5, 3, 5, 3, 3, 3, 5, 3, 5, 5, 3, 3, 3, 5, 3, 3, 5, 5, 5, 5, 3,
       3, 3, 5, 3, 5, 5])

In [66]:
print(confusion_matrix(y_testtf, y_predtf))
print(clf.score(X_testtf, y_testtf))

[[22  1]
 [22  5]]
0.52


In [67]:
y_predtf

array([3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3,
       3, 3, 5, 3, 3, 5])