# Curso NLP + Transformers

<img src="https://yaelmanuel.com/wp-content/uploads/2021/12/platzi-banner-logo-matematicas.png" width="500px">

---

# Análisis de Reseñas de Mercado Libre 📦🔍

## 0) Dependencias 📚

In [59]:
!pip install transformers
!pip install wordcloud



In [60]:
import re
import io

import pandas as pd
import numpy as np

from transformers import pipeline
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

## 1) NLP Pipelines 🙌

**Configurar pipelines preentrenados**

In [61]:
# Pipeline de nuestro finetuned model para análisis de sentimiento
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="cabustillo13/roberta-base-bne-platzi-project-nlp-con-transformers"
)

# Pipeline para NER en español
ner_pipeline = pipeline(
    "ner",
    model="mrm8488/bert-spanish-cased-finetuned-ner",
    tokenizer="mrm8488/bert-spanish-cased-finetuned-ner"
)

Device set to use cuda:0
Some weights of the model checkpoint at mrm8488/bert-spanish-cased-finetuned-ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


## 2) Funcionalidad 🦾

**Análisis de Inputs**

In [None]:
# Función para limpiar el texto
def clean(text):
    # Eliminar textos entre corchetes (ej.: etiquetas)
    text = re.sub(r'\[.*?\]', '', text)

    # Eliminar URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Eliminar etiquetas HTML
    text = re.sub(r'<.*?>+', '', text)

    # Eliminar espacios extras al inicio y final
    text = text.strip()

    return text

In [None]:
# Función para reconstruir una entidad a partir de los tokens del NER
def reconstruct_entity(ner_tokens):
    """
    Reconstruye una entidad a partir de una lista de tokens de NER.
    Si un token empieza con "##", se une al token anterior sin espacio.
    """
    entity = ""
    for token in ner_tokens:
        word = token['word']
        if word.startswith("##"):
            entity += word[2:]
        else:
            if entity:
                entity += " " + word
            else:
                entity += word
    return entity

In [None]:
# Función para procesar la salida del NER y agrupar tokens en entidades completas
def process_ner_output(ner_results):
    """
    Procesa la salida del NER ignorando el tipo de entidad y devuelve un diccionario
    con una única clave "entities" cuyo valor es la entidad reconstruida a partir de todos los tokens.
    """
    # Reconstruir la entidad a partir de todos los tokens de la lista
    combined = reconstruct_entity(ner_results)
    return {"entities": combined}

In [None]:
# Función para analizar un solo texto
def analyze_text(input_text):
    input_text = clean(input_text)
    sentiment = sentiment_pipeline(input_text)
    ner_results = ner_pipeline(input_text)
    processed_ner = process_ner_output(ner_results)
    return sentiment, processed_ner

In [None]:
# Función para analizar un archivo CSV
def analyze_csv(file_obj):
    df = pd.read_csv(file_obj.name)
    if "review_body" not in df.columns:
        return "Error: No se encontró la columna 'review_body'.", None, None
    texts = df["review_body"].astype(str).tolist()

    # Limpiar cada reseña
    cleaned_texts = [clean(text) for text in texts]

    # Obtener análisis de sentimiento y NER para cada reseña limpia
    sentiments = [sentiment_pipeline(text) for text in cleaned_texts]
    ner_all = [process_ner_output(ner_pipeline(text)) for text in cleaned_texts]

    # Extraer las entidades detectadas (valor) de cada reseña
    ner_words = []
    for ner_result in ner_all:
        # ner_result es un diccionario con la clave "entities"
        ner_words.append(ner_result["entities"])

    # Unir todas las entidades en un solo string
    combined_ner_text = " ".join(ner_words)

    # Generar wordcloud basado en las entidades detectadas
    wc = WordCloud(stopwords=STOPWORDS, background_color="white", width=800, height=400).generate(combined_ner_text)
    buf = io.BytesIO()
    wc.to_image().save(buf, format="PNG")
    buf.seek(0)

    # Convertir a imagen PIL para que Gradio lo pueda mostrar
    image = Image.open(buf)

    return sentiments, ner_all, image

## 3) Interfaz Gráfica ✨

In [66]:
!pip install gradio



In [67]:
import gradio as gr

**Construir la interfaz**

In [69]:
with gr.Blocks(theme=gr.themes.Citrus()) as demo:
    gr.Markdown("## Aplicación de Análisis de Reseñas de Mercado Libre 📦🔍")

    with gr.Tab("Análisis de Texto"):
        gr.Markdown("### Ingrese una reseña de texto")
        text_input = gr.Textbox(label="Texto de Reseña", placeholder="Escribe aquí la reseña...")
        sentiment_output = gr.JSON(label="Análisis de Sentimiento")
        ner_output = gr.JSON(label="Entidades Reconocidas (NER)")
        analyze_btn = gr.Button("Analizar Texto")
        analyze_btn.click(analyze_text, inputs=text_input, outputs=[sentiment_output, ner_output])

    with gr.Tab("Análisis de CSV"):
        gr.Markdown("### Suba un archivo CSV con una columna 'review_body'")
        csv_input = gr.File(label="Archivo CSV")
        csv_sentiment_output = gr.JSON(label="Análisis de Sentimiento (por Reseña)")
        csv_ner_output = gr.JSON(label="Entidades Reconocidas (por Reseña)")
        wc_output = gr.Image(label="WordCloud (Entidades)")
        analyze_csv_btn = gr.Button("Analizar CSV")
        analyze_csv_btn.click(analyze_csv, inputs=csv_input, outputs=[csv_sentiment_output, csv_ner_output, wc_output])


## 4) Ready! 🚀

**Iniciar instancia**

In [70]:
demo.launch(debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://01f15f5aa6cfd03780.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://01f15f5aa6cfd03780.gradio.live




Obtener versiones de librerías necesarias

In [71]:
!pip3 freeze > requirements.txt