## Sentiment Analysis

Este notebook presenta modelos para analisis de sentimientos para varios dominios

## 1. Importar Librerias

In [1]:
from glob import glob
from functions import create_sentiment_dataset, build_preprocess_pipeline
from tqdm import tqdm

from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression


## 2.0 Carga de datos

In [2]:
# Carga del data set - Cambiar segun sea necesario
files = glob('data/Multi Domain Sentiment/processed_acl/*/*')

In [3]:
df = create_sentiment_dataset(files)
df

Unnamed: 0,raw_text,label,text,folder,file
0,avid:1 your:1 horrible_book:1 wasted:1 use_it:...,negative,avid your horrible book wasted use it the...,books,negative.review
1,to_use:1 shallow:1 found:1 he_castigates:1 cas...,negative,to use shallow found he castigates castiga...,books,negative.review
2,avid:1 your:1 horrible_book:1 wasted:1 use_it:...,negative,avid your horrible book wasted use it the...,books,negative.review
3,book_seriously:1 we:1 days_couldn't:1 me_tell:...,negative,book seriously we days couldn't me tell st...,books,negative.review
4,"mass:1 only:1 he:2 help:1 ""jurisfiction"":1 lik...",negative,"mass only he help ""jurisfiction"" like wa...",books,negative.review
...,...,...,...,...,...
27672,the_last:1 well:1 gets:1 the_next:1 come:1 chi...,positive,the last well gets the next come china an...,kitchen,unlabeled.review
27673,through:1 them_ordered:1 so_cookies:1 won't_be...,positive,through them ordered so cookies won't be o...,kitchen,unlabeled.review
27674,i:1 is_great:1 god-daughter:1 get:1 cooking_it...,positive,i is great god-daughter get cooking it's ...,kitchen,unlabeled.review
27675,steel:5 the_edge:1 just_a:1 only_slightly:1 st...,negative,steel the edge just a only slightly straig...,kitchen,unlabeled.review


In [4]:
# Agrupar los reviews por categoria
df.groupby(['folder','file']).size().unstack().fillna(0)

file,negative.review,positive.review,unlabeled.review
folder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
books,1000,1000,4465
dvd,1000,1000,3586
electronics,1000,1000,5681
kitchen,1000,1000,5945


### Train - Test splitting

In [5]:
# Los datos de entrenamiento consisten de los reviews que estan marcados como positivos o negativos
train_data = df[df.file!='unlabeled.review'].reset_index(drop=True)
# El conjunto de pruebas consiste de los reviews que no estan marcados
test_data = df[df.file=='unlabeled.review'].reset_index(drop=True)

## 3.0 Clasificador por categoria

En esta seccion se va a construir un clasificador por cada una de las 4 categorias (Books/DVD/electronics/kitchen)

### TF - IDF

En los siguientes clasificadores se utiliza `tf-idf` para vectorizar el texto.

#### Logistic Regression

In [6]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                            class_weight=None, solver='saga',
                                            max_iter=1000, penalty='l2',
                                            tol=1e-2, C=1
                                            )

    cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo y obtener las metricas
    
    X_test_transformed_tfidf = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_transformed_tfidf)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:07<00:23,  7.75s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      2201
    positive       0.83      0.83      0.83      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465



 50%|█████     | 2/4 [00:15<00:15,  7.51s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.85      0.79      0.82      1779
    positive       0.81      0.86      0.84      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586



 75%|███████▌  | 3/4 [00:21<00:06,  6.83s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.85      0.84      0.84      2824
    positive       0.84      0.85      0.85      2857

    accuracy                           0.84      5681
   macro avg       0.85      0.84      0.84      5681
weighted avg       0.84      0.84      0.84      5681



100%|██████████| 4/4 [00:26<00:00,  6.26s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      2991
    positive       0.86      0.85      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



100%|██████████| 4/4 [00:26<00:00,  6.63s/it]


### Naive Bayes

In [7]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador multinomial de naive bayes
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo y obtener las metricas
    X_test_transformed_tfidf = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_tfidf)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:07<00:23,  7.84s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.81      0.82      0.82      2201
    positive       0.83      0.82      0.82      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465



 50%|█████     | 2/4 [00:14<00:14,  7.40s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.82      0.83      0.82      1779
    positive       0.83      0.83      0.83      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586



 75%|███████▌  | 3/4 [00:20<00:06,  6.46s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.84      0.84      0.84      2824
    positive       0.84      0.84      0.84      2857

    accuracy                           0.84      5681
   macro avg       0.84      0.84      0.84      5681
weighted avg       0.84      0.84      0.84      5681



100%|██████████| 4/4 [00:24<00:00,  6.23s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.83      0.85      2991
    positive       0.84      0.86      0.85      2954

    accuracy                           0.85      5945
   macro avg       0.85      0.85      0.85      5945
weighted avg       0.85      0.85      0.85      5945






Se observa un resultado postivo para cada uno de las categorias pues en todos la precision es mayor al 80% para ambos clasificadores

## TF

Para los siguientes clasificadores se utiliza la frecuencia de los terminos para vectorizar el texto para usar como entrada a los modelos

### Logistic Regression

In [8]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                            class_weight=None, solver='saga',
                                            max_iter=1000, penalty='l2',
                                            tol=1e-2, C=1
                                            )

    cate_lr = logistic_estimator.fit(X_train_cnt_transformed, cate_train_data['label'])
    
    ## Probar el modelo usando el conjunto de pruebas
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:02<00:08,  2.79s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      2824
    positive       0.86      0.87      0.86      2857

    accuracy                           0.86      5681
   macro avg       0.86      0.86      0.86      5681
weighted avg       0.86      0.86      0.86      5681



 50%|█████     | 2/4 [00:05<00:05,  2.59s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      2991
    positive       0.85      0.86      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



 75%|███████▌  | 3/4 [00:08<00:03,  3.06s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.84      0.77      0.80      1779
    positive       0.79      0.85      0.82      1807

    accuracy                           0.81      3586
   macro avg       0.81      0.81      0.81      3586
weighted avg       0.81      0.81      0.81      3586



100%|██████████| 4/4 [00:13<00:00,  3.32s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.84      0.81      0.82      2201
    positive       0.82      0.84      0.83      2264

    accuracy                           0.83      4465
   macro avg       0.83      0.83      0.83      4465
weighted avg       0.83      0.83      0.83      4465






### Naive Bayes

In [8]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de Naive Bayes multinomial
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo con el conjunto de pruebas
    
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:06<00:19,  6.53s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.52      0.50      0.51      2201
    positive       0.53      0.54      0.53      2264

    accuracy                           0.52      4465
   macro avg       0.52      0.52      0.52      4465
weighted avg       0.52      0.52      0.52      4465



 50%|█████     | 2/4 [00:12<00:12,  6.35s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.51      0.45      0.48      1779
    positive       0.51      0.56      0.54      1807

    accuracy                           0.51      3586
   macro avg       0.51      0.51      0.51      3586
weighted avg       0.51      0.51      0.51      3586



 75%|███████▌  | 3/4 [00:17<00:05,  5.62s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.50      0.52      0.51      2824
    positive       0.51      0.50      0.50      2857

    accuracy                           0.51      5681
   macro avg       0.51      0.51      0.51      5681
weighted avg       0.51      0.51      0.51      5681



100%|██████████| 4/4 [00:21<00:00,  5.47s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.87      0.84      0.86      2991
    positive       0.84      0.88      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945






Se observa un peor rendimiento usando `tf` especialmente usando Naive Bayes para la categoria de electronics y dvd, lo que tambien puede sugerir que debido a que estas dos categorias podrian llegar a ser similares, solo el conteo de los terminos no es suficiente para capturar toda la informacion relevante a la hora de determinar si corresponde a DVD y electronics.

## Lexicons

# 4.0 Clasificador para todas las categorias

Ahora construimos un solo clasificador para todas las categorias donde se determina si el review es positivo o negativo unicamente

### Preprocesamiento

In [9]:
# Se construye el pipeline de procesamiento para todo el conjunto de datos de entrenamiento
tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(train_data['text'])
X_train_tfidf_transformed = tfidf_pipeline.transform(train_data['text'])

cnt_pipeline = build_preprocess_pipeline('count').fit(train_data['text'])
X_train_cnt_transformed = cnt_pipeline.transform(train_data['text'])

## TF - IDF

En esta seccion usamos `tf-idf` como metodo de vectorizacion del texto

### Logistic Regression

In [11]:
# Se utiliza clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y arrojan los resultados 
X_test_transformed_tfidf = tfidf_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



### Naive Bayes

In [12]:
# Clasificador de Naive Bayes multinomial
nb_estimator = MultinomialNB(alpha=1.0)

cate_nb = nb_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y se imprimen los resultados
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



Ambos modelos arrojan modelos muy buenos donde la precision es del `0.85`

## TF

Ahora se utiliza una matriz con la frecuencia de los terminos como entrada de los modelos

### Logistic Regression

In [13]:
# Clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Probar el modelo
X_test_transformed_cnt = cnt_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.86      0.84      0.85      9795
    positive       0.84      0.86      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



### Naive Bayes

In [14]:
# Clasificador de Naive Bayes multinomial
logistic_estimator = MultinomialNB(alpha=1.0)

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Test the model
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      9795
    positive       0.83      0.83      0.83      9882

    accuracy                           0.83     19677
   macro avg       0.83      0.83      0.83     19677
weighted avg       0.83      0.83      0.83     19677



Con Naive Bayes se obtiene un resultado ligeramente peor comparado con el resto de los modelos, aunque no se evidencia la misma dificultad que al clasificar por categoria.

## Lexicons