## Sentiment Analysis

Este notebook presenta modelos para analisis de sentimientos para varios dominios

## 1. Importar Librerias

In [1]:
from glob import glob
from functions import create_sentiment_dataset, build_preprocess_pipeline, build_preprocess_pipeline_lexicon
from tqdm import tqdm

from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression


## 2.0 Carga de datos

In [2]:
# Carga del data set - Cambiar segun sea necesario
files = glob('data/Multi Domain Sentiment/processed_acl/*/*')

In [3]:
df = create_sentiment_dataset(files)
df

Unnamed: 0,raw_text,label,text,folder,file
0,i_forget:1 is_no:1 no_special:1 old:2 messy:1 ...,negative,i forget is no no special old messy probl...,kitchen,unlabeled.review
1,lasted_less:1 a_chance:1 chance_to:1 the_motor...,negative,lasted less a chance chance to the motor g...,kitchen,unlabeled.review
2,cooper_cooler:1 bottles:1 i:1 cooler:1 (2-3_mi...,positive,cooper cooler bottles i cooler (2-3 mins ...,kitchen,unlabeled.review
3,the_idea:1 quick_marinate:1 to_clean-up.:1 con...,negative,the idea quick marinate to clean-up. contai...,kitchen,unlabeled.review
4,small_i:1 though_only:1 craft_i:1 full_grip:1 ...,negative,small i though only craft i full grip my h...,kitchen,unlabeled.review
...,...,...,...,...,...
27672,z:10 only:1 course_of:1 no:5 help:1 plenty:1 l...,positive,z only course of no help plenty like he...,dvd,positive.review
27673,well:1 i:1 interesting_as:1 raiders:1 liked_th...,positive,well i interesting as raiders liked this ...,dvd,positive.review
27674,this_movie:1 is_very:1 enjoys_a:1 you're:1 yet...,positive,this movie is very enjoys a you're yet ver...,dvd,positive.review
27675,episodes_ommitted:1 show:2 gareth's:1 america:...,positive,episodes ommitted show gareth's america te...,dvd,positive.review


In [4]:
# Agrupar los reviews por categoria
df.groupby(['folder','file']).size().unstack().fillna(0)

file,negative.review,positive.review,unlabeled.review
folder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
books,1000,1000,4465
dvd,1000,1000,3586
electronics,1000,1000,5681
kitchen,1000,1000,5945


### Train - Test splitting

In [5]:
# Los datos de entrenamiento consisten de los reviews que estan marcados como positivos o negativos
train_data = df[df.file!='unlabeled.review'].reset_index(drop=True)
# El conjunto de pruebas consiste de los reviews que no estan marcados
test_data = df[df.file=='unlabeled.review'].reset_index(drop=True)

## 3.0 Clasificador por categoria

En esta seccion se va a construir un clasificador por cada una de las 4 categorias (Books/DVD/electronics/kitchen)

### TF - IDF

En los siguientes clasificadores se utiliza `tf-idf` para vectorizar el texto.

#### Logistic Regression

In [6]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                            class_weight=None, solver='saga',
                                            max_iter=1000, penalty='l2',
                                            tol=1e-2, C=1
                                            )

    cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo y obtener las metricas
    
    X_test_transformed_tfidf = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_transformed_tfidf)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:03<00:11,  3.82s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      2991
    positive       0.86      0.85      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



 50%|█████     | 2/4 [00:08<00:08,  4.16s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.85      0.84      0.84      2824
    positive       0.84      0.85      0.85      2857

    accuracy                           0.85      5681
   macro avg       0.85      0.85      0.85      5681
weighted avg       0.85      0.85      0.85      5681



 75%|███████▌  | 3/4 [00:14<00:05,  5.03s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      2201
    positive       0.82      0.83      0.83      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465



100%|██████████| 4/4 [00:20<00:00,  5.01s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.85      0.80      0.82      1779
    positive       0.81      0.86      0.84      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586






### Naive Bayes

In [7]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(cate_train_data['text'])
    X_train_tfidf_transformed = tfidf_pipeline.transform(cate_train_data['text'])
    
    # Clasificador multinomial de naive bayes
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo y obtener las metricas
    X_test_transformed_tfidf = tfidf_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_tfidf)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:03<00:11,  3.72s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.83      0.85      2991
    positive       0.84      0.86      0.85      2954

    accuracy                           0.85      5945
   macro avg       0.85      0.85      0.85      5945
weighted avg       0.85      0.85      0.85      5945



 50%|█████     | 2/4 [00:08<00:08,  4.13s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.84      0.84      0.84      2824
    positive       0.84      0.84      0.84      2857

    accuracy                           0.84      5681
   macro avg       0.84      0.84      0.84      5681
weighted avg       0.84      0.84      0.84      5681



 75%|███████▌  | 3/4 [00:14<00:05,  5.02s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.81      0.83      0.82      2201
    positive       0.83      0.82      0.82      2264

    accuracy                           0.82      4465
   macro avg       0.82      0.82      0.82      4465
weighted avg       0.82      0.82      0.82      4465



100%|██████████| 4/4 [00:19<00:00,  4.95s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.82      0.83      0.82      1779
    positive       0.83      0.83      0.83      1807

    accuracy                           0.83      3586
   macro avg       0.83      0.83      0.83      3586
weighted avg       0.83      0.83      0.83      3586






Se observa un resultado postivo para cada uno de las categorias pues en todos la precision es mayor al 80% para ambos clasificadores

## TF

Para los siguientes clasificadores se utiliza la frecuencia de los terminos para vectorizar el texto para usar como entrada a los modelos

### Logistic Regression

In [8]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])
    
    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                            class_weight=None, solver='saga',
                                            max_iter=1000, penalty='l2',
                                            tol=1e-2, C=1
                                            )

    cate_lr = logistic_estimator.fit(X_train_cnt_transformed, cate_train_data['label'])
    
    ## Probar el modelo usando el conjunto de pruebas
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:03<00:11,  3.80s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      2991
    positive       0.85      0.86      0.86      2954

    accuracy                           0.86      5945
   macro avg       0.86      0.86      0.86      5945
weighted avg       0.86      0.86      0.86      5945



 50%|█████     | 2/4 [00:08<00:09,  4.58s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      2824
    positive       0.86      0.87      0.86      2857

    accuracy                           0.86      5681
   macro avg       0.86      0.86      0.86      5681
weighted avg       0.86      0.86      0.86      5681



 75%|███████▌  | 3/4 [00:15<00:05,  5.40s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.84      0.81      0.82      2201
    positive       0.82      0.84      0.83      2264

    accuracy                           0.83      4465
   macro avg       0.83      0.83      0.83      4465
weighted avg       0.83      0.83      0.83      4465



100%|██████████| 4/4 [00:21<00:00,  5.25s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.84      0.77      0.80      1779
    positive       0.79      0.85      0.82      1807

    accuracy                           0.81      3586
   macro avg       0.81      0.81      0.81      3586
weighted avg       0.81      0.81      0.81      3586






### Naive Bayes

In [9]:
# Por cada categoria se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder']==cate]
    cate_test_data = test_data[test_data['folder']==cate]
    
    # Pipeline de preprocesamiento de datos usado tambien para el notebook de 20N
    cnt_pipeline = build_preprocess_pipeline('count').fit(cate_train_data['text'])
    X_train_cnt_transformed = cnt_pipeline.transform(cate_train_data['text'])

    # Clasificador de Naive Bayes multinomial
    nb_estimator = MultinomialNB(alpha=1.0)

    cate_nb = nb_estimator.fit(X_train_tfidf_transformed, cate_train_data['label'])
    
    ## Probar el modelo con el conjunto de pruebas
    
    X_test_transformed_cnt = cnt_pipeline.transform(cate_test_data['text'])
    y_pred = cate_nb.predict(X_test_transformed_cnt)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:03<00:11,  3.80s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.51      0.45      0.48      2991
    positive       0.50      0.57      0.53      2954

    accuracy                           0.51      5945
   macro avg       0.51      0.51      0.51      5945
weighted avg       0.51      0.51      0.51      5945



 50%|█████     | 2/4 [00:08<00:08,  4.19s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.51      0.42      0.46      2824
    positive       0.51      0.61      0.56      2857

    accuracy                           0.51      5681
   macro avg       0.51      0.51      0.51      5681
weighted avg       0.51      0.51      0.51      5681



 75%|███████▌  | 3/4 [00:14<00:05,  5.08s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.49      0.48      0.48      2201
    positive       0.50      0.50      0.50      2264

    accuracy                           0.49      4465
   macro avg       0.49      0.49      0.49      4465
weighted avg       0.49      0.49      0.49      4465



100%|██████████| 4/4 [00:20<00:00,  5.00s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.80      0.85      0.83      1779
    positive       0.84      0.79      0.82      1807

    accuracy                           0.82      3586
   macro avg       0.82      0.82      0.82      3586
weighted avg       0.82      0.82      0.82      3586






Se observa un peor rendimiento usando `tf` especialmente usando Naive Bayes para la categoria de electronics y dvd. Lo que podria explicarse porque la terminologia usada para las resenas de estos productos no logra ser suficiente para el analisis de sentimientos.

## Lexicons

En los modelos siguientes se utiliza una representacion usando un puntaje de positivo/negativo a partir de un lexicon y en base al texto.

### Regresion lineal

In [10]:
# Por cada categoría se crea un modelo
for cate in tqdm(train_data['folder'].unique()):
    
    cate_train_data = train_data[train_data['folder'] == cate]
    cate_test_data = test_data[test_data['folder'] == cate]
    
    # Construir y ajustar el pipeline de preprocesamiento
    pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
    X_train_sentiment = pipeline.fit_transform(cate_train_data['text'])

    # Clasificador de regresion logistica
    logistic_estimator = LogisticRegression()
    cate_lr = logistic_estimator.fit(X_train_sentiment, cate_train_data['label'])
    
    # Probar el modelo con el conjunto de pruebas
    X_test_sentiment = pipeline.transform(cate_test_data['text'])
    y_pred = cate_lr.predict(X_test_sentiment)
    
    print(f'************* {cate} *************')
    print(classification_report(cate_test_data['label'], y_pred))

 25%|██▌       | 1/4 [00:02<00:06,  2.10s/it]

************* kitchen *************
              precision    recall  f1-score   support

    negative       0.64      0.67      0.65      2991
    positive       0.65      0.62      0.63      2954

    accuracy                           0.64      5945
   macro avg       0.64      0.64      0.64      5945
weighted avg       0.64      0.64      0.64      5945



 50%|█████     | 2/4 [00:04<00:04,  2.24s/it]

************* electronics *************
              precision    recall  f1-score   support

    negative       0.59      0.66      0.62      2824
    positive       0.62      0.56      0.59      2857

    accuracy                           0.61      5681
   macro avg       0.61      0.61      0.61      5681
weighted avg       0.61      0.61      0.61      5681



 75%|███████▌  | 3/4 [00:07<00:02,  2.62s/it]

************* books *************
              precision    recall  f1-score   support

    negative       0.59      0.64      0.62      2201
    positive       0.62      0.57      0.60      2264

    accuracy                           0.61      4465
   macro avg       0.61      0.61      0.61      4465
weighted avg       0.61      0.61      0.61      4465



100%|██████████| 4/4 [00:10<00:00,  2.53s/it]

************* dvd *************
              precision    recall  f1-score   support

    negative       0.60      0.62      0.61      1779
    positive       0.62      0.59      0.60      1807

    accuracy                           0.61      3586
   macro avg       0.61      0.61      0.61      3586
weighted avg       0.61      0.61      0.61      3586






# 4.0 Clasificador para todas las categorias

Ahora construimos un solo clasificador para todas las categorias donde se determina si el review es positivo o negativo unicamente

### Preprocesamiento

In [11]:
# Se construye el pipeline de procesamiento para todo el conjunto de datos de entrenamiento
tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(train_data['text'])
X_train_tfidf_transformed = tfidf_pipeline.transform(train_data['text'])

cnt_pipeline = build_preprocess_pipeline('count').fit(train_data['text'])
X_train_cnt_transformed = cnt_pipeline.transform(train_data['text'])

lex_pipeline = build_preprocess_pipeline_lexicon('data/lexicon/SentiWordNet_3.0.0.txt')
X_train_lex_transformed = lex_pipeline.fit_transform(train_data['text'])

## TF - IDF

En esta seccion usamos `tf-idf` como metodo de vectorizacion del texto

### Regresion Logistica

In [12]:
# Se utiliza clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y arrojan los resultados 
X_test_transformed_tfidf = tfidf_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



### Naive Bayes

In [13]:
# Clasificador de Naive Bayes multinomial
nb_estimator = MultinomialNB(alpha=1.0)

cate_nb = nb_estimator.fit(X_train_tfidf_transformed, train_data['label'])

## Se prueba el modelo y se imprimen los resultados
y_pred = cate_lr.predict(X_test_transformed_tfidf)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      9795
    positive       0.85      0.85      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



Ambos modelos arrojan modelos muy buenos donde la precision es del `0.85`

## TF

Ahora se utiliza una matriz con la frecuencia de los terminos como entrada de los modelos

### Regresion Logistica

In [14]:
# Clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Probar el modelo
X_test_transformed_cnt = cnt_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.86      0.84      0.85      9795
    positive       0.84      0.86      0.85      9882

    accuracy                           0.85     19677
   macro avg       0.85      0.85      0.85     19677
weighted avg       0.85      0.85      0.85     19677



### Naive Bayes

In [15]:
# Clasificador de Naive Bayes multinomial
logistic_estimator = MultinomialNB(alpha=1.0)

cate_lr = logistic_estimator.fit(X_train_cnt_transformed, train_data['label'])

## Test the model
y_pred = cate_lr.predict(X_test_transformed_cnt)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.82      0.83      0.82      9795
    positive       0.83      0.83      0.83      9882

    accuracy                           0.83     19677
   macro avg       0.83      0.83      0.83     19677
weighted avg       0.83      0.83      0.83     19677



Con Naive Bayes se obtiene un resultado ligeramente peor comparado con el resto de los modelos, aunque no se evidencia la misma dificultad que al clasificar por categoria.

## Lexicons

Ahora usamos caracteristicas extraidas del lexicon que corresponden a un puntaje de positivo/negativo para cada review.

### Regresion logistica

In [16]:
# Clasificador de regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight=None, solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2, C=1
                                        )

cate_lr = logistic_estimator.fit(X_train_lex_transformed, train_data['label'])

## Probar el modelo
X_test_transformed_lex = lex_pipeline.transform(test_data['text'])
y_pred = cate_lr.predict(X_test_transformed_lex)

print(classification_report(test_data['label'], y_pred))

              precision    recall  f1-score   support

    negative       0.60      0.66      0.63      9795
    positive       0.63      0.58      0.60      9882

    accuracy                           0.62     19677
   macro avg       0.62      0.62      0.61     19677
weighted avg       0.62      0.62      0.61     19677



Dado que el feature de score que se extrae del texto a partir del lexicon es bastante sencillo, no se obtiene los mismos resultados que usando bolsa de palabras, pero si existe la posibilidad de mejorar el modelo.