<a href="https://colab.research.google.com/github/gmauricio-toledo/NLP-MCD/blob/main/13-LLM_Prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using LLMs

En esta notebook realizaremos la tarea de An치lisis de Sentimientos usando un LLM de la librer칤a `transformers` de Hugging Face. Probaremos varios modelos y t칠cnicas para hacer la tarea.

## Dataset

In [None]:
!gdown 18kGdlhOiQNS61wUK7uPbdquKL3XJrgzf

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import pandas as pd

imdb_df = pd.read_csv('IMDB.csv')
display(imdb_df)

y = LabelEncoder().fit_transform(imdb_df['sentiment'].values)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(imdb_df['review'].values, y, test_size=0.2, random_state=642, stratify=y)
X_train_raw, X_val_raw, y_train, y_val = train_test_split(X_train_raw, y_train, test_size=0.25, random_state=473, stratify=y_train)
print(f"Training set size: {len(X_train_raw)}")
print(f"Validation set size: {len(X_val_raw)}")
print(f"Test set size: {len(X_test_raw)}")

## Preprocesamiento

En los modelos de lenguaje modernos como BERT y sus sucesores, el preprocesamiento tradicional del texto (como la eliminaci칩n de stopwords, lematizaci칩n o stemming) ya no es necesario ni recomendable.

Estos modelos est치n dise침ados para entender el contexto y la estructura del lenguaje tal como aparece en el texto crudo, incluyendo palabras funcionales que aportan significado contextual.

Sin embargo, s칤 es com칰n limpiar el texto de artefactos no ling칲칤sticos, como etiquetas HTML, c칩digos de escape, URLs, o caracteres especiales irrelevantes.

En su lugar, el preprocesamiento se limita generalmente a la tokenizaci칩n mediante el tokenizador espec칤fico del modelo (por ejemplo, WordPiece para BERT), la adici칩n de tokens especiales ([CLS], [SEP]), y el relleno o truncamiento de secuencias para ajustarlas a una longitud fija.

Conservar el texto original permite al modelo aprovechar al m치ximo su capacidad contextual y sem치ntica.



In [None]:
import re
import html

def clean_text(text):
    # Decodificar entidades HTML
    text = html.unescape(text)
    # Eliminar etiquetas HTML
    text = re.sub(r'<[^>]+>', ' ', text)
    # Normalizar espacios
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [None]:
train_docs = [clean_text(doc) for doc in X_train_raw]
test_docs = [clean_text(doc) for doc in X_test_raw]
val_docs = [clean_text(doc) for doc in X_val_raw]

## Sampleo

In [None]:
num_training_docs = 300
num_validation_docs = 1000

sample_train_docs, _, sample_train_labels, _ = train_test_split(train_docs, y_train,
                                                                train_size=num_training_docs,
                                                                random_state=777,
                                                                stratify=y_train)

sample_val_docs, _, sample_val_labels, _ = train_test_split(val_docs, y_val,
                                                            train_size=num_validation_docs,
                                                            random_state=777,
                                                            stratify=y_val)

## Modelo

Probemos algunos modelos:

* [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct): Modelo de Microsoft con 3.8B par치metros, 128K tokens context length, vocabulario de 32064 tokens, entrenado en agosto/2024.
* [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct): Modelo de Qwen con 1.54B par치metros, 32,768 tokens context length, Multilingual support for over 29 languages.

Observa que ahora usamos la clase Clase `AutoModelForCausalLM` de HuggingFace Transformers, es para modelos de lenguaje generativo. Carga autom치ticamente la arquitectura correcta seg칰n el nombre del modelo.

Tipos de tarea:
* Generaci칩n de texto
* Completaci칩n de prompts
* Chatbots
* Predicci칩n del siguiente token

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "Qwen/Qwen2.5-1.5B-Instruct"
# model_name = "Qwen/Qwen-7B-Chat"
# model_name = "mosaicml/mpt-7b""
# model_name = "tiiuae/falcon-7b-instruct"
model_name = "microsoft/Phi-3.5-mini-instruct"
# model_name = "HuggingFaceH4/zephyr-7b-beta"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

Si usamos los documentos sin quitar stopwords

In [None]:
from sklearn.model_selection import train_test_split

docs, _, labels, _ = train_test_split(sample_val_docs,
                                      sample_val_labels,
                                      train_size=200,
                                      stratify=sample_val_labels,
                                      random_state=707)

## Sentiment Analysis

### Zero shot

Con GPU y 200 ejemplos, tarda alrededor de 1 minuto

In [None]:
docs[0]

Haremos este [prompt](https://claude.ai/share/3b52a333-a017-4de0-9970-201b824a2b52) de forma iterada, solamente sobre un n칰mero peque침o de ejemplos

In [None]:
prompt = "Hi, what can you tell me about your self?"
messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16
)
generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

In [None]:
responses = []

for k,sentence in enumerate(docs):
    prompt = "I want to perform a binary sentiment analysis task on the following text, determine if the sentiment is positive or negative. Respond only 'positive' or 'negative'. The text is: " + sentence
    messages = [
        {"role": "system", "content": "You are a helpful assistant performing binary sentiment analysis."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=16
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"{k+1}/{num_training_docs}")
    responses.append(response)

print(responses)

### Few shot

Obtenemos unos pocos documentos de ejemplo

In [None]:
sample_docs = sample_val_docs[:3].copy()
sample_labels = sample_val_labels[:3].copy()

In [None]:
print(sample_docs)
print(sample_labels)

In [None]:
len(docs)

Tarda alrededor de 7 minutos

In [None]:
responses = []

label_to_text = {0: "negative", 1: "positive"}

for k, sentence in enumerate(docs):
    prompt = ""
    for label, text in zip(sample_labels[1:], sample_docs[1:]):
        prompt += f"\n{text} // {label_to_text[label]}\n"
    prompt += f"\n{sentence} // "

    messages = [
        {"role": "system", "content": "You are a helpful assistant performing binary sentiment analysis. You must respond ONLY with 'positive' or 'negative', nothing else."},
        {"role": "user", "content": prompt}
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=3,  # Reducido ya que solo necesitamos una palabra
        num_return_sequences=1,
        do_sample=False,   # Para hacerlo m치s determinista
        pad_token_id=tokenizer.eos_token_id
    )

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    # Limpiar la respuesta para obtener solo "positive" o "negative"
    response = response.strip().lower()
    if "positive" in response:
        response = "positive"
    elif "negative" in response:
        response = "negative"
    else:
        # Si no es claro, usar una respuesta por defecto
        response = "unknown"

    print(f"{k+1}/{len(docs)}: {response}")
    responses.append(response)

## Evaluaci칩n

In [None]:
def encode(x):
    if x.lower() == "positive":
        return 1
    else:
        return 0

y_pred = [encode(x) for x in responses]

Zero-shot:

* Phi-3.5-mini-instruct: 94%
* Qwen2.5 1.5B: ?

Few shot:
* Phi-3.5-mini-instruct: 91%
* Qwen2.5 1.5B: ?

In [None]:
len(docs), len(labels)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

print(f"Accuracy: {accuracy_score(labels, y_pred)}")
plt.figure()
sns.heatmap(confusion_matrix(labels, y_pred), annot=True)
plt.show()

## Further Explorations

Est치 t칠cnica *Dynamic Zero-Shot Categorization* podr칤amos usarla en m치s tareas:
* Topic Modeling
* Information Extraction

---

游댮 Exploraciones adicionales
* Explorar el efecto de los ejemplos para el few shot:
    * La longitud de los textos de ejemplo
    * El n칰mero de ejemplos
* Explorar m치s LLM.
* Explorar diferentes prompts.
* Explorar el [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) de [*sentiment analysis*](https://huggingface.co/blog/sentiment-analysis-python) de HuggingFace, hay muchos [modelos](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) para escoger.
* Usar los embeddings generados por el modelo y aplicar algoritmos de ML.
