## Sentiment Analysis

Este notebook presenta modelos para analisis de sentimientos para varios dominios

## 1. Importar Librerias

In [1]:
from glob import glob

import numpy as np
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB, MultinomialNB

from functions import create_sentiment_dataset, build_preprocess_pipeline, build_preprocess_pipeline_lexicon

## 2.0 Carga de datos

In [2]:
# Carga del data set - Cambiar segun sea necesario
files = glob('data/Multi Domain Sentiment/processed_acl/*/*')

In [3]:
df = create_sentiment_dataset(files)
df

Unnamed: 0,raw_text,label,text,folder,file
0,avid:1 your:1 horrible_book:1 wasted:1 use_it:...,negative,avid your horrible book wasted use it the...,books,negative.review
1,to_use:1 shallow:1 found:1 he_castigates:1 cas...,negative,to use shallow found he castigates castiga...,books,negative.review
2,avid:1 your:1 horrible_book:1 wasted:1 use_it:...,negative,avid your horrible book wasted use it the...,books,negative.review
3,book_seriously:1 we:1 days_couldn't:1 me_tell:...,negative,book seriously we days couldn't me tell st...,books,negative.review
4,"mass:1 only:1 he:2 help:1 ""jurisfiction"":1 lik...",negative,"mass only he help ""jurisfiction"" like wa...",books,negative.review
...,...,...,...,...,...
27672,the_last:1 well:1 gets:1 the_next:1 come:1 chi...,positive,the last well gets the next come china an...,kitchen,unlabeled.review
27673,through:1 them_ordered:1 so_cookies:1 won't_be...,positive,through them ordered so cookies won't be o...,kitchen,unlabeled.review
27674,i:1 is_great:1 god-daughter:1 get:1 cooking_it...,positive,i is great god-daughter get cooking it's ...,kitchen,unlabeled.review
27675,steel:5 the_edge:1 just_a:1 only_slightly:1 st...,negative,steel the edge just a only slightly straig...,kitchen,unlabeled.review


In [4]:
# Agrupar los reviews por categoria
df.groupby(['folder','file']).size().unstack().fillna(0)

file,negative.review,positive.review,unlabeled.review
folder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
books,1000,1000,4465
dvd,1000,1000,3586
electronics,1000,1000,5681
kitchen,1000,1000,5945


### Train - Test splitting

In [5]:
# Los datos de entrenamiento consisten de los reviews que estan marcados como positivos o negativos
train_data = df[df.file!='unlabeled.review'].reset_index(drop=True)
# El conjunto de pruebas consiste de los reviews que no estan marcados
test_data = df[df.file=='unlabeled.review'].reset_index(drop=True)

## 3.0 Clasificador por categoria

En esta seccion se va a construir un clasificador por cada una de las 4 categorias (Books/DVD/electronics/kitchen)

### TF - IDF

En los siguientes clasificadores se utiliza `tf-idf` para vectorizar el texto.

#### Logistic Regression

In [6]:
# Por cada categoría se crea un modelo y se analizan las características más importantes
for cate in tqdm(train_data['folder'].unique()):
    cate_train_data = train_data[train_data['folder'] == cate]
    cate_test_data = test_data[test_data['folder'] == cate]
    
    # Construir y ajustar el pipeline de preprocesamiento
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresión logística
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, solver='saga')
    cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    # Probar el modelo
    X_test_tfidf_transformed = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_tfidf_transformed)
    print(f"Resultados de clasificación para la categoría {cate}:")
    print(classification_report(cate_test_data['label'], y_pred))
    
    # Obtener los nombres de las características del vectorizador
    feature_names = tfidf_pipeline.named_steps['vectorizer'].get_feature_names_out()
    coef = cate_lr.coef_[0]  # Coeficiente del modelo entrenado para la primera clase (positiva o negativa)

    # Obtener las 10 características más importantes
    top_features_indices = np.argsort(coef)[-10:]
    top_features = [(feature_names[i], coef[i]) for i in top_features_indices]

    print(f"Características más importantes para la categoría {cate}:")
    for feature, weight in top_features:
        print(f"{feature}: {weight:.4f}")
    print("\n" + "-"*50 + "\n")

 25%|█████████████████████                                                               | 1/4 [00:08<00:26,  8.93s/it]

Resultados de clasificación para la categoría books:
              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      2201
    positive       0.82      0.83      0.83      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465

Características más importantes para la categoría books:
highly: 1.3807
favorite: 1.3809
recommend: 1.4730
love: 1.5758
loved: 1.5821
best: 1.7067
wonderful: 1.8561
easy: 1.8678
excellent: 2.5359
great: 2.5664

--------------------------------------------------



 50%|██████████████████████████████████████████                                          | 2/4 [00:17<00:17,  8.77s/it]

Resultados de clasificación para la categoría dvd:
              precision    recall  f1-score   support

    negative       0.85      0.79      0.82      1779
    positive       0.81      0.86      0.84      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586

Características más importantes para la categoría dvd:
fun: 1.3184
enjoy: 1.4308
season: 1.4876
family: 1.5869
loved: 1.6472
wonderful: 1.6526
excellent: 1.9895
love: 2.1912
best: 2.5460
great: 3.4329

--------------------------------------------------



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:24<00:08,  8.01s/it]

Resultados de clasificación para la categoría electronics:
              precision    recall  f1-score   support

    negative       0.85      0.84      0.84      2824
    positive       0.84      0.85      0.85      2857

    accuracy                           0.84      5681
   macro avg       0.84      0.84      0.84      5681
weighted avg       0.84      0.84      0.84      5681

Características más importantes para la categoría electronics:
fast: 1.5724
highly: 1.9789
works: 2.0644
easy: 2.0722
good: 2.1233
best: 2.1383
perfect: 2.3296
excellent: 2.8242
price: 3.1311
great: 4.6393

--------------------------------------------------



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:30<00:00,  7.71s/it]

Resultados de clasificación para la categoría kitchen:
              precision    recall  f1-score   support

    negative       0.85      0.87      0.86      2991
    positive       0.86      0.85      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945

Características más importantes para la categoría kitchen:
clean: 1.6883
little: 1.9548
works: 1.9760
ve: 2.0059
perfect: 2.2109
excellent: 2.2450
best: 2.6713
love: 3.6070
easy: 4.0915
great: 4.4956

--------------------------------------------------






### Naive Bayes

In [7]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador multinomial de naive bayes
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo y obtener las metricas
    X_test_transformed_tfidf = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_tfidf)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|█████████████████████                                                               | 1/4 [00:09<00:29,  9.70s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.81      0.82      0.82      2201
    positive       0.83      0.82      0.82      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465



 50%|██████████████████████████████████████████                                          | 2/4 [00:19<00:19,  9.69s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.82      0.83      0.83      1779
    positive       0.83      0.82      0.83      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:26<00:08,  8.65s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.84      0.84      0.84      2824
    positive       0.84      0.84      0.84      2857

    accuracy                           0.84      5681
   macro avg       0.84      0.84      0.84      5681
weighted avg       0.84      0.84      0.84      5681



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:32<00:00,  8.24s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.83      0.85      2991
    positive       0.84      0.86      0.85      2954

    accuracy                           0.85      5945
   macro avg       0.85      0.85      0.85      5945
weighted avg       0.85      0.85      0.85      5945






Se observa un resultado postivo para cada uno de las categorias pues en todos la precision es mayor al 80% para ambos clasificadores

## TF

Para los siguientes clasificadores se utiliza la frecuencia de los terminos para vectorizar el texto para usar como entrada a los modelos

### Logistic Regression

In [8]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                            class_weight=None, solver='saga',
                                            max_iter=1000, penalty='l2',
                                            tol=1e-2, C=1
                                            )

    cate_lr = logistic_estimator.fit(X_train_cnt_transformed, cate_train_data['label'])
    
    ## Probar el modelo usando el conjunto de pruebas
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|█████████████████████                                                               | 1/4 [00:10<00:30, 10.00s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.84      0.82      0.83      2201
    positive       0.83      0.85      0.84      2264

    accuracy                           0.83      4465
   macro avg       0.83      0.83      0.83      4465
weighted avg       0.83      0.83      0.83      4465



 50%|██████████████████████████████████████████                                          | 2/4 [00:18<00:18,  9.41s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.84      0.77      0.80      1779
    positive       0.79      0.85      0.82      1807

    accuracy                           0.81      3586
   macro avg       0.81      0.81      0.81      3586
weighted avg       0.81      0.81      0.81      3586

************* electronics *************


 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:26<00:08,  8.54s/it]

              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      2824
    positive       0.85      0.86      0.86      2857

    accuracy                           0.86      5681
   macro avg       0.86      0.86      0.86      5681
weighted avg       0.86      0.86      0.86      5681



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:33<00:00,  7.88s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      2991
    positive       0.85      0.86      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:33<00:00,  8.34s/it]


### Naive Bayes

In [9]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])

    # Clasificador de Naive Bayes multinomial
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo con el conjunto de pruebas
    
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|█████████████████████                                                               | 1/4 [00:10<00:30, 10.28s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.50      0.49      0.50      2201
    positive       0.52      0.53      0.52      2264

    accuracy                           0.51      4465
   macro avg       0.51      0.51      0.51      4465
weighted avg       0.51      0.51      0.51      4465



 50%|██████████████████████████████████████████                                          | 2/4 [00:19<00:18,  9.42s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.49      0.54      0.52      1779
    positive       0.50      0.45      0.47      1807

    accuracy                           0.49      3586
   macro avg       0.49      0.49      0.49      3586
weighted avg       0.49      0.49      0.49      3586



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:26<00:08,  8.35s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.49      0.52      0.50      2824
    positive       0.49      0.45      0.47      2857

    accuracy                           0.49      5681
   macro avg       0.49      0.49      0.49      5681
weighted avg       0.49      0.49      0.49      5681



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:32<00:00,  7.50s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.87      0.84      0.86      2991
    positive       0.84      0.87      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:32<00:00,  8.09s/it]


Se observa un peor rendimiento usando `tf` especialmente usando Naive Bayes para la categoria de electronics y dvd. Lo que podria explicarse porque la terminologia usada para las resenas de estos productos no logra ser suficiente para el analisis de sentimientos.

## Lexicons

En los modelos siguientes se utiliza una representacion usando un puntaje de positivo/negativo a partir de un lexicon y en base al texto.

### Regresion lineal

In [10]:
# Por cada categoría se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder'] == cate]
    cate_test_data = test_data[test_data['folder'] == cate]
    
    # Construir y ajustar el pipeline de preprocesamiento
    pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
    X_train_sentiment = pipeline.fit_transform(cate_train_data['text'])

    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression()
    cate_lr = logistic_estimator.fit(X_train_sentiment, cate_train_data['label'])
    
    # Probar el modelo con el conjunto de pruebas
    X_test_sentiment = pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_sentiment)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|█████████████████████                                                               | 1/4 [00:04<00:14,  4.98s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.59      0.64      0.62      2201
    positive       0.62      0.57      0.60      2264

    accuracy                           0.61      4465
   macro avg       0.61      0.61      0.61      4465
weighted avg       0.61      0.61      0.61      4465



 50%|██████████████████████████████████████████                                          | 2/4 [00:09<00:09,  4.53s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.60      0.62      0.61      1779
    positive       0.62      0.59      0.60      1807

    accuracy                           0.61      3586
   macro avg       0.61      0.61      0.61      3586
weighted avg       0.61      0.61      0.61      3586



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:12<00:04,  4.14s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.59      0.66      0.62      2824
    positive       0.62      0.56      0.59      2857

    accuracy                           0.61      5681
   macro avg       0.61      0.61      0.61      5681
weighted avg       0.61      0.61      0.61      5681



100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.11s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.64      0.67      0.65      2991
    positive       0.65      0.62      0.63      2954

    accuracy                           0.64      5945
   macro avg       0.64      0.64      0.64      5945
weighted avg       0.64      0.64      0.64      5945






### Naive Bayes

In [11]:
# Por cada categoría se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    # Extraer los datos de entrenamiento y prueba para la categoría específica
    cate_train_data = train_data[train_data['folder'] == cate]
    cate_test_data = test_data[test_data['folder'] == cate]
    
    # Construir y ajustar el pipeline de preprocesamiento basado en léxicos
    pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
    X_train_lex_transformed = pipeline.fit_transform(cate_train_data['text'])
    y_train = cate_train_data['label']
    
    # Entrenar el modelo Naive Bayes
    nb_estimator = GaussianNB()
    cate_nb = nb_estimator.fit(X_train_lex_transformed, y_train)
    
    # Transformar los datos de prueba y realizar predicciones
    X_test_lex_transformed = pipeline.transform(cate_test_data['text'])
    y_test = cate_test_data['label']
    y_pred = cate_nb.predict(X_test_lex_transformed)
    
    # Imprimir el reporte de clasificación para cada categoría
    print(f'************* {cate} *************')
    print(classification_report(y_test, y_pred))

 25%|█████████████████████                                                               | 1/4 [00:04<00:14,  4.89s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.59      0.66      0.62      2201
    positive       0.62      0.55      0.58      2264

    accuracy                           0.60      4465
   macro avg       0.60      0.60      0.60      4465
weighted avg       0.60      0.60      0.60      4465



 50%|██████████████████████████████████████████                                          | 2/4 [00:09<00:09,  4.57s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.58      0.71      0.64      1779
    positive       0.63      0.49      0.55      1807

    accuracy                           0.60      3586
   macro avg       0.60      0.60      0.59      3586
weighted avg       0.60      0.60      0.59      3586



 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:13<00:04,  4.21s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.58      0.71      0.64      2824
    positive       0.63      0.49      0.55      2857

    accuracy                           0.60      5681
   macro avg       0.60      0.60      0.59      5681
weighted avg       0.60      0.60      0.59      5681

************* kitchen *************


100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.20s/it]

              precision    recall  f1-score   support

    negative       0.63      0.70      0.66      2991
    positive       0.65      0.58      0.62      2954

    accuracy                           0.64      5945
   macro avg       0.64      0.64      0.64      5945
weighted avg       0.64      0.64      0.64      5945






# 4.0 Clasificador para todas las categorias

Ahora construimos un solo clasificador para todas las categorias donde se determina si el review es positivo o negativo unicamente

### Preprocesamiento

In [12]:
# Se construye el pipeline de procesamiento para todo el conjunto de datos de entrenamiento
tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(train_data['text'])
X_train_tfidf_transformed = tfidf_pipeline.transform(train_data['text'])

cnt_pipeline = build_preprocess_pipeline('count').fit(train_data['text'])
X_train_cnt_transformed = cnt_pipeline.transform(train_data['text'])

lex_pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
X_train_lex_transformed = lex_pipeline.fit_transform(train_data['text'])

## TF - IDF

En esta seccion usamos `tf-idf` como metodo de vectorizacion del texto

### Regresion Logistica

In [13]:
# Se utiliza clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y arrojan los resultados 
X_test_transformed_tfidf = tfidf_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



In [14]:
# Obtener los nombres de las características del vectorizador
feature_names = tfidf_pipeline.named_steps['vectorizer'].get_feature_names_out()
coef = cate_lr.coef_[0]  # Hay un solo conjunto de coeficientes para clasificación binaria

# Obtener las 10 características más importantes (positivas y negativas)
top_positive_indices = np.argsort(coef)[-10:]  # Las características más importantes para la clase positiva
top_negative_indices = np.argsort(coef)[:10]   # Las características más importantes para la clase negativa

top_positive_features = [(feature_names[i], coef[i]) for i in top_positive_indices]
top_negative_features = [(feature_names[i], coef[i]) for i in top_negative_indices]

print("Características más importantes para clasificar como positivo:")
for feature, weight in top_positive_features:
    print(f"{feature}: {weight:.4f}")

print("\nCaracterísticas más importantes para clasificar como negativo:")
for feature, weight in top_negative_features:
    print(f"{feature}: {weight:.4f}")

Características más importantes para clasificar como positivo:
wonderful: 2.9428
works: 3.0035
highly: 3.0108
price: 3.2014
perfect: 3.7345
love: 4.4127
easy: 4.7173
best: 5.0155
excellent: 5.6955
great: 7.1480

Características más importantes para clasificar como negativo:
waste: -4.8423
bad: -4.5380
disappointed: -4.2791
worst: -4.1176
poor: -4.0052
boring: -3.6060
disappointing: -3.3968
terrible: -3.1828
disappointment: -3.0579
return: -2.9255


### Naive Bayes

In [15]:
# Clasificador de Naive Bayes multinomial
nb_estimator = MultinomialNB(alpha=1.0)

cate_nb = nb_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y se imprimen los resultados
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



Ambos modelos arrojan modelos muy buenos donde la precision es del `0.85`

## TF

Ahora se utiliza una matriz con la frecuencia de los terminos como entrada de los modelos

### Regresion Logistica

In [16]:
# Clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Probar el modelo
X_test_transformed_cnt = cnt_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.86      0.84      0.85      9795
    positive       0.84      0.86      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



### Naive Bayes

In [17]:
# Clasificador de Naive Bayes multinomial
logistic_estimator = MultinomialNB(alpha=1.0)

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Test the model
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      9795
    positive       0.83      0.83      0.83      9882

    accuracy                           0.83     19677
   macro avg       0.83      0.83      0.83     19677
weighted avg       0.83      0.83      0.83     19677



Con Naive Bayes se obtiene un resultado ligeramente peor comparado con el resto de los modelos, aunque no se evidencia la misma dificultad que al clasificar por categoria.

## Lexicons

Ahora usamos caracteristicas extraidas del lexicon que corresponden a un puntaje de positivo/negativo para cada review.

### Regresion logistica

In [18]:
# Clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_lex_transformed, train_data['label'])

## Probar el modelo
X_test_transformed_lex = lex_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_lex)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.63      0.58      0.61      9795
    positive       0.62      0.66      0.64      9882

    accuracy                           0.62     19677
   macro avg       0.62      0.62      0.62     19677
weighted avg       0.62      0.62      0.62     19677



Dado que el feature de score que se extrae del texto a partir del lexicon es bastante sencillo, no se obtiene los mismos resultados que usando bolsa de palabras, pero si existe la posibilidad de mejorar el modelo.

### Naive Bayes

In [19]:
# Construir y ajustar el pipeline de preprocesamiento basado en léxicos para el conjunto consolidado
pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
X_train_lex_transformed = pipeline.fit_transform(train_data['text'])
y_train = train_data['label']

# Entrenar el modelo Naive Bayes en todo el conjunto de datos de entrenamiento
nb_estimator = GaussianNB()
consolidated_nb = nb_estimator.fit(X_train_lex_transformed, y_train)

# Transformar los datos de prueba y realizar predicciones
X_test_lex_transformed = pipeline.transform(test_data['text'])
y_test = test_data['label']
y_pred = consolidated_nb.predict(X_test_lex_transformed)

# Imprimir el reporte de clasificación para el conjunto consolidado
print('************* Clasificador Consolidado *************')
print(classification_report(y_test, y_pred))

************* Clasificador Consolidado *************
              precision    recall  f1-score   support

    negative       0.59      0.70      0.64      9795
    positive       0.64      0.52      0.57      9882

    accuracy                           0.61     19677
   macro avg       0.61      0.61      0.61     19677
weighted avg       0.61      0.61      0.61     19677

