<a href="https://colab.research.google.com/github/gmauricio-toledo/NLP-MCD/blob/main/15-LLM-SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using LLMs

En esta notebook realizaremos la tarea de Análisis de Sentimientos usando un LLM de la librería `transformers` de Hugging Face. Probaremos varios modelos y técnicas para hacer la tarea.

## Dataset

In [None]:
!gdown 18kGdlhOiQNS61wUK7uPbdquKL3XJrgzf

Downloading...
From: https://drive.google.com/uc?id=18kGdlhOiQNS61wUK7uPbdquKL3XJrgzf
To: /content/IMDB.csv
100% 66.2M/66.2M [00:03<00:00, 21.2MB/s]


In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import pandas as pd

imdb_df = pd.read_csv('IMDB.csv')
display(imdb_df)

y = LabelEncoder().fit_transform(imdb_df['sentiment'].values)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(imdb_df['review'].values, y, test_size=0.2, random_state=642, stratify=y)
X_train_raw, X_val_raw, y_train, y_val = train_test_split(X_train_raw, y_train, test_size=0.25, random_state=473, stratify=y_train)
print(f"Training set size: {len(X_train_raw)}")
print(f"Validation set size: {len(X_val_raw)}")
print(f"Test set size: {len(X_test_raw)}")

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


Training set size: 30000
Validation set size: 10000
Test set size: 10000


## Preprocesamiento

In [None]:
import nltk
from nltk import word_tokenize
import re
from string import punctuation

# nltk.download('punkt') # este ya va de salida
nltk.download('punkt_tab')
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

stopwords.append('br')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def clean_text(text):
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r"`['+`+]", '', text)
    tokenized_text = [x for x in word_tokenize(text) if (x.lower() not in stopwords) and (x.lower() not in punctuation)]
    return ' '.join(tokenized_text)

In [None]:
train_docs = [clean_text(x) for x in X_train_raw]
val_docs = [clean_text(x) for x in X_val_raw]
test_raw = [clean_text(x) for x in X_test_raw]

## Sampleo

In [None]:
num_training_docs = 300
num_validation_docs = 1000

sample_train_docs, _, sample_train_labels, _ = train_test_split(train_docs, y_train,
                                                                train_size=num_training_docs,
                                                                random_state=777,
                                                                stratify=y_train)

sample_val_docs, _, sample_val_labels, _ = train_test_split(val_docs, y_val,
                                                            train_size=num_validation_docs,
                                                            random_state=777,
                                                            stratify=y_val)

## Modelo

Probemos algunos modelos:

* [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct): Modelo de Microsoft con 3.8B parámetros, 128K tokens context length, vocabulario de 32064 tokens, entrenado en agosto/2024.
* [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct): Modelo de Qwen con 1.54B parámetros, 32,768 tokens context length, Multilingual support for over 29 languages.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "Qwen/Qwen2.5-1.5B-Instruct"
# model_name = "Qwen/Qwen-7B-Chat"
# model_name = "databricks/dolly-v2-3b"  # no funciona el chat template
# model_name = "mosaicml/mpt-7b""
# model_name = "tiiuae/falcon-7b-instruct"
model_name = "microsoft/Phi-3.5-mini-instruct"
# model_name = "HuggingFaceH4/zephyr-7b-beta"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Si usamos los documentos sin quitar stopwords

In [None]:
docs = X_train_raw.copy()[:num_training_docs]

Si usamos los documentos quitando stopwords

In [10]:
docs = sample_train_docs.copy()

## Sentiment Analysis

### Zero shot

Con GPU y 200 ejemplos, tarda alrededor de 1 minuto

In [None]:
responses = []

for k,sentence in enumerate(docs):
    prompt = "I want to perform a binary sentiment analysis task on the following text, determine if the sentiment is positive or negative. Respond only 'positive' or 'negative'. The text is: " + sentence
    messages = [
        {"role": "system", "content": "You are Qwen. You are a helpful assistant performing binary sentiment analysis."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=16
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"{k+1}/{num_training_docs} done")
    responses.append(response)

print(responses)

1/300 done
2/300 done
3/300 done
4/300 done
5/300 done
6/300 done
7/300 done
8/300 done
9/300 done
10/300 done
11/300 done
12/300 done
13/300 done
14/300 done
15/300 done
16/300 done
17/300 done
18/300 done
19/300 done
20/300 done
21/300 done
22/300 done
23/300 done
24/300 done
25/300 done
26/300 done
27/300 done
28/300 done
29/300 done
30/300 done
31/300 done
32/300 done
33/300 done
34/300 done
35/300 done
36/300 done
37/300 done
38/300 done
39/300 done
40/300 done
41/300 done
42/300 done
43/300 done
44/300 done
45/300 done
46/300 done
47/300 done
48/300 done
49/300 done
50/300 done
51/300 done
52/300 done
53/300 done
54/300 done
55/300 done
56/300 done
57/300 done
58/300 done
59/300 done
60/300 done
61/300 done
62/300 done
63/300 done
64/300 done
65/300 done
66/300 done
67/300 done
68/300 done
69/300 done
70/300 done
71/300 done
72/300 done
73/300 done
74/300 done
75/300 done
76/300 done
77/300 done
78/300 done
79/300 done
80/300 done
81/300 done
82/300 done
83/300 done
84/300 done
8

### Few shot

Obtenemos unos pocos documentos de ejemplo

In [11]:
sample_docs = sample_val_docs[:3].copy()
sample_labels = sample_val_labels[:3].copy()

In [12]:
print(len(sample_docs))
print(sample_labels)

3
[1 1 0]


In [13]:
responses = []

label_to_text = {0: "negative", 1: "positive"}

for k,sentence in enumerate(docs):
    prompt = ""
    for label,text in zip(sample_labels[1:], sample_docs[1:]):
        prompt += f"\n{text} // {label_to_text[label]}\n"
    prompt += f"\text {sentence} // "
    print(prompt)
    break
    messages = [
        {"role": "system", "content": "You are Qwen. You are a helpful assistant performing binary sentiment analysis."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=16
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"{k+1}/{num_training_docs} done")
    responses.append(response)

print(responses)


Wow .... 's long time since 've last seen hilarious movie like one 've never great fanatic French movies ever since fell love beauty acting skills Catherine Deneuve decided see movies ... however n't think one would fantastic turned Lucky bought even though doubts really `` feel-good-time '' film class quality great social topics moral drama 's involved close today 's modern way living shown beautiful realistic also liked dancing scene men 's room lot favourite rather timid attempt Catherine Deneuve sing ..... brings way lots grace modesty time ... tempting would also like express respect admiration Line Renaud played fantastic role n't even know acted .... 've known music n't wait longer go see movie ... 'll surprised many ways // positive

Filmed documentary style pretty well tell participants coached recently divorced wannabe film maker Myles Berkowitz sees chance liven love life step movie biz time intends make documentary piece finding love filming twenty dates including ramifica

## Evaluación

In [None]:
def encode(x):
    if x.lower() == "positive":
        return 1
    else:
        return 0

y_pred = [encode(x) for x in responses]

* Qwen2.5 1.5B + Quitar Stopwords: 80%
* Phi + Quitar Stopwords: 80%
* Qwen2.5 1.5B + Dejar Stopwords: 51.5%

Few shot:
* Qwen2.5 1.5B + Quitar Stopwords: 50.6 %
* Phi + Quitar Stopwords: 80%

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(sample_train_labels, y_pred)

0.8066666666666666

## Further Explorations

* Explorar el efecto de los ejemplos para el few shot:
    * La longitud de los textos de ejemplo
    * El número de ejemplos
* Explorar más LLM.
* Explorar diferentes prompts.
* Explorar el [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) de [*sentiment analysis*](https://huggingface.co/blog/sentiment-analysis-python) de HuggingFace, hay muchos [modelos](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) para escoger.
* Usar los embeddings generados por el modelo y aplicar algoritmos de ML.
