<a href="https://colab.research.google.com/github/johnnycleiton07/llm-studies/blob/main/Implementando_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Implementando a técnica de LoRA

LoRA (Low-Rank Adaptation) é uma técnica de ajuste fino eficiente para grandes modelos de linguagem, como LLMs. Em vez de ajustar todos os parâmetros do modelo, LoRA adapta apenas uma pequena fração deles usando matrizes de baixa classificação. Isso reduz significativamente o custo computacional e de armazenamento, mantendo a eficácia do ajuste fino para tarefas específicas.

O objetivo desse notebook sera implementar a técnica de fine-tunning usando LoRA em um dataset bem conhecido, o `imdb`.

* O dataset IMDB (Internet Movie Database) é um conjunto de dados amplamente utilizado para tarefas de processamento de linguagem natural, especialmente para análise de sentimentos. Ele contém 50.000 resenhas de filmes, divididas igualmente entre positivas e negativas, com 25.000 para treinamento e 25.000 para teste. As resenhas são rotuladas para classificação binária, facilitando o treinamento e a avaliação de modelos de aprendizado de máquina para tarefas de sentimento.

##Configurações iniciais

In [1]:
!pip install datasets
!pip install accelerate

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m297.0/547.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB

In [2]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch.nn as nn
import pandas as pd

* Carregando o banco de dados

In [3]:
dataset = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

* Converte o conjunto de dados de treinamento (`dataset["train"]`) em um DataFrame do Pandas e exibe as primeiras cinco linhas.

In [4]:
df_train = pd.DataFrame(dataset["train"])
print(df_train.head())

                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0


* Inicializa um tokenizador e um modelo BERT pré-treinados para classificação de sequência, com duas classes de saída.

In [5]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


* Classe base do LoRA - define uma camada LoRA em PyTorch, que adiciona uma transformação de baixa rank ao input para ajustar a saída.

In [6]:
class LoRA(nn.Module):
    def __init__(self, input_dim, rank=4):
        super(LoRA, self).__init__()
        self.rank = rank
        self.A = nn.Parameter(torch.randn(input_dim, rank))
        self.B = nn.Parameter(torch.randn(rank, input_dim))

    def forward(self, x):
        return x + torch.matmul(torch.matmul(x, self.A), self.B)

* Define um modelo BERT modificado com uma camada LoRA, adaptando as saídas do BERT antes de classificá-las e opcionalmente calculando a perda se rótulos forem fornecidos.

In [7]:
class BertWithLoRA(nn.Module):
    def __init__(self, model, lora_rank=4):
        super(BertWithLoRA, self).__init__()
        self.bert = model.bert
        self.lora = LoRA(self.bert.config.hidden_size, rank=lora_rank)
        self.classifier = model.classifier

    def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
        outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_states = outputs.last_hidden_state
        adapted_states = self.lora(hidden_states)
        logits = self.classifier(adapted_states[:, 0, :])

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.classifier.out_features), labels.view(-1))

        return {"loss": loss, "logits": logits}

* Preprocessa os exemplos tokenizando o texto, aplica padding e truncamento, renomeia a coluna "label" para "labels" e configura o formato do conjunto de dados para PyTorch. (geralmente demora)

In [8]:
def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch", columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

* Configura um modelo BERT com LoRA para treinamento, define parâmetros de treinamento e avaliação, e inicializa um `Trainer` para gerenciar o processo de treinamento e avaliação.

In [9]:
model_with_lora = BertWithLoRA(model)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model_with_lora,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)

In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.3027,0.290157


TrainOutput(global_step=3125, training_loss=0.439372099609375, metrics={'train_runtime': 3081.6026, 'train_samples_per_second': 8.113, 'train_steps_per_second': 1.014, 'total_flos': 0.0, 'train_loss': 0.439372099609375, 'epoch': 1.0})

In [11]:
#salva os pesos do modelo BERT com LoRA em um arquivo chamado "model_with_lora.pth".
torch.save(model_with_lora.state_dict(), "model_with_lora.pth")

In [12]:
#avalia o modelo usando o `Trainer` e imprime os resultados da avaliação.
results = trainer.evaluate()
print(results)

{'eval_loss': 0.2901572585105896, 'eval_runtime': 670.6102, 'eval_samples_per_second': 37.279, 'eval_steps_per_second': 4.66, 'epoch': 1.0}


##Testando o Modelo

In [13]:
#inicializa um tokenizador e um modelo BERT pré-treinados para a tarefa de classificação de sequência com duas classes.
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
#cria uma instância do modelo BERT com LoRA e carrega os pesos salvos do arquivo "model_with_lora.pth".
model_with_lora = BertWithLoRA(model)
model_with_lora.load_state_dict(torch.load("/content/model_with_lora.pth"))

<All keys matched successfully>

* Define uma função que usa o modelo BERT com LoRA para prever o sentimento de um texto de revisão, retornando "positivo" ou "negativo" com base na classe predita.

In [15]:
def predict_sentiment(review_text):
    inputs = tokenizer(review_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    model_with_lora.eval()
    with torch.no_grad():
        outputs = model_with_lora(**inputs)
    logits = outputs["logits"]
    predicted_class = torch.argmax(logits, dim=1).item()
    sentiment = "positive" if predicted_class == 1 else "negative"
    return sentiment

In [16]:
reviews = [
    "This movie was fantastic! The storyline was gripping and the characters were well-developed.",
    "I did not enjoy this film. The plot was predictable and the acting was mediocre at best."
]

* Percorre uma lista de revisões, usa a função `predict_sentiment` para prever o sentimento de cada revisão e imprime o texto da revisão junto com o sentimento correspondente.

In [17]:
for review in reviews:
    sentiment = predict_sentiment(review)
    print(f"Review: {review}\nSentiment: {sentiment}\n")

Review: This movie was fantastic! The storyline was gripping and the characters were well-developed.
Sentiment: positive

Review: I did not enjoy this film. The plot was predictable and the acting was mediocre at best.
Sentiment: negative

