# Detección de reseñas positivas y negativas usando BERT - Trabajo final

En este proyecto, vamos a aplicar técnicas avanzadas de procesamiento del lenguaje natural (NLP) utilizando BERT (Bidirectional Encoder Representations from Transformers) para analizar y clasificar reseñas de clientes de una tienda de ropa como positivas o negativas. Utilizaremos un conjunto de datos que contiene reseñas etiquetadas previamente para entrenar y evaluar nuestro modelo.

<center><img src="../img/fashion_boutique.webp" alt="" title="FashionBoutique" width="150" /></center>





## 1. Problema a resolver: 

**Problema:** _Reducir las devoluciones de una tienda de ropa local, para poder yo como dueño de la tienda ver cuales son las causas por los que el usuario devuelve la prenda. Así poder mejorar las caracteristicas de las prendas o qué tipo de ropa quieren los compradores que tenga en dicha tienda_

In [74]:
import pandas as pd
# esto es para extraer los datos de ese archivo .tgz
import tarfile
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [75]:
# Cargar el dataset original
current_dir = os.getcwd()
print(current_dir)

file_path = os.path.join(current_dir, '..', 'data', 'raw', 'balanced_review.csv')
print(file_path)

C:\Users\JuanManuelMorenoRuiz\Desktop\Cursos\BOOTCAMP IA + BIG DATA\iabigdatareviews\notebooks
C:\Users\JuanManuelMorenoRuiz\Desktop\Cursos\BOOTCAMP IA + BIG DATA\iabigdatareviews\notebooks\..\data\raw\balanced_review.csv


In [76]:
# Cargar el dataset original
df = pd.read_csv(file_path)

# Visualizar las primeras filas para entender la estructura
print(df.head())
print(df.shape[0])

   overall                                         reviewText  \
0        5  I'm enjoying these, the fit is about right and...   
1        5  these are the coolest little board shorts.  Th...   
2        5  beautiful ring for the money, looks very expen...   
3        5  I love this dress! Fantastic quality, great st...   
4        5  These Ecco golf shoes are differently shaped t...   

              summary  
0    Very comfortable  
1   funky board short  
2          Five Stars  
3   I love this dress  
4  Still A Great Fit!  
792000


In [77]:
df = pd.read_csv(file_path)
# Eliminamos la columna 'summary'
df = df[['overall', 'reviewText']]

# Verificamos la estructura actualizada del dataset
print(df.head())

   overall                                         reviewText
0        5  I'm enjoying these, the fit is about right and...
1        5  these are the coolest little board shorts.  Th...
2        5  beautiful ring for the money, looks very expen...
3        5  I love this dress! Fantastic quality, great st...
4        5  These Ecco golf shoes are differently shaped t...


In [78]:
# Guardamos el dataset modificado en un nuevo archivo CSV

output_dir = os.path.join(current_dir, '..', 'data', 'modified' ,'balanced_review_modified.csv')

print(output_dir)

C:\Users\JuanManuelMorenoRuiz\Desktop\Cursos\BOOTCAMP IA + BIG DATA\iabigdatareviews\notebooks\..\data\modified\balanced_review_modified.csv


In [79]:
df.to_csv(output_dir, index=False)


In [80]:
overall_1_df = df.loc[df['overall'] == 1]


In [81]:
print(overall_1_df.sample(5))  # Muestra 5 registros aleatorios


        overall                                         reviewText
311478        1  I'm a true size small...reading 2 other review...
261836        1  Not what I expected\nReally bad\nUncomfortable...
68177         1               Small and the crotch ripped quickly.
453575        1  Awesome coat and awesome price. Very warm and ...
45291         1                   It ripped within two days of use


In [82]:
print(df)

        overall                                         reviewText
0             5  I'm enjoying these, the fit is about right and...
1             5  these are the coolest little board shorts.  Th...
2             5  beautiful ring for the money, looks very expen...
3             5  I love this dress! Fantastic quality, great st...
4             5  These Ecco golf shoes are differently shaped t...
...         ...                                                ...
791995        1  Its just like every other shaper uncomfortable...
791996        1  The fit is fine and they're comfortable. Howev...
791997        1             correct size but pops open very easily
791998        1  Only rated this with one star because you didn...
791999        1   The product came without the tags on the side!!!

[792000 rows x 2 columns]


In [83]:
num_filas = df.shape[0]
print(num_filas)

792000


In [84]:
output_dir = os.path.join(current_dir, '..', 'data', 'modified' ,'balanced_review_modified_binario.csv')
print(output_dir)

C:\Users\JuanManuelMorenoRuiz\Desktop\Cursos\BOOTCAMP IA + BIG DATA\iabigdatareviews\notebooks\..\data\modified\balanced_review_modified_binario.csv


In [85]:
# Cargar el dataset
df = pd.read_csv(output_dir)
#output_dir = os.path.join(current_dir, '..', 'data', 'modified' ,'balanced_review_modified_no_header.csv')
#Exportar sin cabeceras
#df.to_csv(output_dir, index=False, header=False)




In [86]:
df = pd.read_csv(output_dir)

In [87]:

# Dividir en conjunto de entrenamiento y prueba
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [88]:
print(f'Tamaño del conjunto de entrenamiento: {len(train_df)}')
print(f'Tamaño del conjunto de prueba: {len(test_df)}')

Tamaño del conjunto de entrenamiento: 633600
Tamaño del conjunto de prueba: 158400


In [89]:
output_dir_train = os.path.join(current_dir, '..', 'data', 'modified' ,'train.csv')
output_dir_test = os.path.join(current_dir, '..', 'data', 'modified' ,'test.csv')

In [90]:
train_df.to_csv(output_dir_train, index=False)
test_df.to_csv(output_dir_test, index=False)

print('Los conjuntos de entrenamiento y prueba han sido guardados como train.csv y test.csv')


Los conjuntos de entrenamiento y prueba han sido guardados como train.csv y test.csv


In [91]:
column_0 = train_df.iloc[:, 0]  # Obtener la columna 0 como una Serie
numeric_values = pd.to_numeric(column_0, errors='coerce')
non_numeric_count = numeric_values.isna().sum()
print(f"Número de elementos no numéricos en la columna 0: {non_numeric_count}")






Número de elementos no numéricos en la columna 0: 0


In [93]:
#train_df = pd.read_csv(output_dir_train)
train_df = pd.read_csv(output_dir_train, header=0, names=['overall', 'reviewText'])

print(train_df.head())

   overall                                         reviewText
0        2  It is a beautiful top! Very soft and very comf...
1        1  It's much lighter than the other New Balance r...
2        1  I can only use this purse when I have very lit...
3        1  I like the style and comfort of the watch.  Th...
4        1  Lost my Driver License when the back pocket un...


In [95]:
column_0 = train_df.iloc[:, 0]  # Obtener la columna 0 como una Serie
numeric_values = pd.to_numeric(column_0, errors='coerce')
non_numeric_count = numeric_values.isna().sum()
print(f"Número de elementos no numéricos en la columna 0: {non_numeric_count}")


Número de elementos no numéricos en la columna 0: 0


In [96]:
null_count = train_df.iloc[:, 0].isnull().sum()
print(f"Número de valores nulos en la columna 0: {null_count}")


Número de valores nulos en la columna 0: 0


In [97]:
numeric_values = pd.to_numeric(train_df.iloc[:, 0], errors='coerce')
non_numeric_count = numeric_values.isna().sum()

# Verificar valores nulos
null_count = train_df.iloc[:, 0].isnull().sum()

print(f"Número de elementos no numéricos en la columna 0: {non_numeric_count}")
print(f"Número de valores nulos en la columna 0: {null_count}")

Número de elementos no numéricos en la columna 0: 0
Número de valores nulos en la columna 0: 0


In [98]:
#test_df = pd.read_csv(output_dir_test)
test_df = pd.read_csv(output_dir_test, header=0, names=['overall', 'reviewText'])

print(test_df.head())






   overall                                         reviewText
0        2  My son was so happy with these shoes. I only g...
1        1  I bought a large and it does not quite do the ...
2        1  Color is not as advertised. It's way darker in...
3        1        Its definitely a tote sized bag not a purse
4        1  I'm 5'10 140 lbs. I am no means a medium or a ...


In [99]:
train_df['overall'] = (train_df['overall'] == 2).astype(int)


In [100]:
test_df['overall'] = (test_df['overall'] == 2).astype(int)


In [101]:
print(train_df.head())
print(test_df.head())

   overall                                         reviewText
0        1  It is a beautiful top! Very soft and very comf...
1        0  It's much lighter than the other New Balance r...
2        0  I can only use this purse when I have very lit...
3        0  I like the style and comfort of the watch.  Th...
4        0  Lost my Driver License when the back pocket un...
   overall                                         reviewText
0        1  My son was so happy with these shoes. I only g...
1        0  I bought a large and it does not quite do the ...
2        0  Color is not as advertised. It's way darker in...
3        0        Its definitely a tote sized bag not a purse
4        0  I'm 5'10 140 lbs. I am no means a medium or a ...


In [142]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

In [143]:
try:
    import pyarrow
    import datasets

    print("Pyarrow version:", pyarrow.__version__)
    print("Datasets version:", datasets.__version__)

except ImportError as e:
    print(f"Error importing libraries: {e}")
    # Intentar instalar las bibliotecas necesarias
    import os
    os.system('pip install pyarrow==12.0.0 datasets==2.14.4')
    import pyarrow
    import datasets

Pyarrow version: 12.0.0
Datasets version: 2.14.4


In [145]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

print(train_dataset[:5])
print(test_dataset)

{'label': [1, 0, 0, 0, 0], 'text': ['It is a beautiful top! Very soft and very comfortable. The only problem is it is a bit big in the bust area for me so I should have ordered a size down.', "It's much lighter than the other New Balance running shoes, but I like the weight.  It could have been a bit longer, as my longest toe nearly touches the end.  The width (across my toes) is very snug, and could have been  a little wider.", 'I can only use this purse when I have very little in my purse as the clasp does not like to stay closed.', 'I like the style and comfort of the watch.  The time is always off and the date is never right.  Time seems to get about 3 - 5 mins off every couple of days.  The date is always off.  I have to manually change the date every day.', 'Lost my Driver License when the back pocket unglued by itself.\n\nI wish they have used a better glue! :(\n\nNeed a new DL now! :X']}
Dataset({
    features: ['label', 'text'],
    num_rows: 158400
})


In [153]:
from transformers import BertTokenizer

# Cargar el tokenizador
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

# Función de tokenización
def tokenize_function(examples):
    texts = examples['text']
    return tokenizer(texts, padding='max_length', truncation=True, max_length=512)



# Eliminar los índices adjuntos si existen
train_dataset = train_dataset.remove_columns([col for col in train_dataset.column_names if 'index' in col])
test_dataset = test_dataset.remove_columns([col for col in test_dataset.column_names if 'index' in col])


# Tokenizar el dataset de entrenamiento
train_dataset = train_dataset.map(tokenize_function, batched=True)

# Tokenizar el dataset de prueba
test_dataset = test_dataset.map(tokenize_function, batched=True)


Map:   0%|▏                                                              | 2000/633600 [00:02<10:38, 989.30 examples/s]


ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

In [None]:
# Aplicar la función de tokenización a los datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

In [136]:

inputs = ['This is a sample text for tokenization.', 'How are you doing?']

# Tokenizar los textos
tokens = [tokenizer.tokenize(text) for text in inputs]

print(tokens)

[['this', 'is', 'a', 'sample', 'text', 'for', 'token', '##ization', '.'], ['how', 'are', 'you', 'doing', '?']]


In [121]:

df_train = pd.DataFrame({
    'label': [0, 1, 1, 0],
    'text': ['Texto de ejemplo 1', 'Otro texto para entrenar', 'Un tercer texto', 'Último texto']
})

df_test = pd.DataFrame({
    'label': [1, 0, 1],
    'text': ['Texto de prueba 1', 'Otro texto de prueba', 'Último texto de prueba']
})


In [None]:
# Convertir los DataFrames a datasets de Hugging Face
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

In [146]:
# Mostrar algunas filas para verificar el formato
print("Train Dataset Sample:")
print(train_dataset[:5])
print("Test Dataset Sample:")
print(test_dataset[:5])

Train Dataset Sample:
{'label': [1, 0, 0, 0, 0], 'text': ['It is a beautiful top! Very soft and very comfortable. The only problem is it is a bit big in the bust area for me so I should have ordered a size down.', "It's much lighter than the other New Balance running shoes, but I like the weight.  It could have been a bit longer, as my longest toe nearly touches the end.  The width (across my toes) is very snug, and could have been  a little wider.", 'I can only use this purse when I have very little in my purse as the clasp does not like to stay closed.', 'I like the style and comfort of the watch.  The time is always off and the date is never right.  Time seems to get about 3 - 5 mins off every couple of days.  The date is always off.  I have to manually change the date every day.', 'Lost my Driver License when the back pocket unglued by itself.\n\nI wish they have used a better glue! :(\n\nNeed a new DL now! :X']}
Test Dataset Sample:
{'label': [1, 0, 0, 0, 0], 'text': ['My son was 

In [None]:
# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)


In [None]:
# Entrenamiento y evaluación del modelo
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [None]:
trainer.train()

In [None]:
# Evaluación
results = trainer.evaluate()
print(results)


In [None]:
# Guardar el modelo entrenado
model.save_pretrained("./sentiment_model")
tokenizer.save_pretrained("./sentiment_model")

In [None]:
# Ejemplo de clasificación de una nueva reseña
example_review = "This is the best product I have ever used!"
prediction = classifier(example_review)
print(prediction)