<a href="https://colab.research.google.com/github/guilherme-argentino/fiap-ia4devs-techchallenge-fase3/blob/main/Fase3_TechChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning do Modelo BERT com AmazonTitles-1.3M

Neste notebook, realizaremos o fine-tuning do modelo BERT (`bert-base-uncased`) usando o dataset "The AmazonTitles-1.3M". O objetivo é treinar o modelo para que ele consiga gerar descrições de produtos com base em seus títulos.

---

**NOTA AO PROFESSOR:** Como não há tempo hábil para corrigir o código e rodar completamente o modelo, faremos apenas um diagnóstico no ítem **14. Diagnóstico da Predição**

In [1]:
# Configurações e Variáveis Globais

# Configurações de arquivo e diretório
ARQUIVO_TREINAMENTO = 'Datasets/LF-AmazonTitles-1.3M/trn.json.gz'
ARQUIVO_TESTE = 'Datasets/LF-AmazonTitles-1.3M/tst.json.gz'
DIRETORIO_CHECKPOINTS_LOCAL = './checkpoints'
DIRETORIO_CHECKPOINTS_COLAB = '/content/drive/MyDrive/FIAP/1IADT/Fase-3/tc/checkpoints'
MODELO_FINAL = './FIAP-1IADT-Grupo28'

# Configurações de modelo
MODELO_BASE = 'bert-base-uncased'
NUM_LABELS = 2

# Configurações de treinamento
TAMANHO_BLOCO = 10000
MAX_LENGTH = 128
NUM_TRAIN_EPOCHS = 3
PER_DEVICE_TRAIN_BATCH_SIZE = 16
SAVE_STEPS = 1000
SAVE_TOTAL_LIMIT = 2

# Configurações de output
OUTPUT_DIR = './results'
LOGGING_DIR = './logs'

### 1. Instalar dependências


In [2]:
# Instalar as bibliotecas necessárias
%pip install datasets transformers torch pandas 'transformers[torch]' gdown huggingface_hub
%pip install --upgrade jupyter ipywidgets tqdm

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00

In [5]:
import sys
import subprocess

def install_and_import(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", package])

# Install required packages
packages = ['jupyter', 'ipywidgets', 'tqdm']
for package in packages:
    install_and_import(package)

# Restart the kernel after running this cell
print("Please restart the Jupyter kernel to apply changes.")

Please restart the Jupyter kernel to apply changes.


## 2. Importar as Bibliotecas e preparar o Ambiente

In [6]:
import os
import torch
import gzip
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import json
import sys
from huggingface_hub import login

if 'google.colab' in sys.modules:
    install_and_import('google-auth-oauthlib')

# Verificar se temos acesso a uma GPU no Colab
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


**CASO QUEIRA APENAS TESTAR O MODELO TREINADO, PULE PARA O PASSO 13**

---



## 3. Carregar o Tokenizer e o Modelo BERT

In [None]:
# Carregar o tokenizer BERT
tokenizer = BertTokenizer.from_pretrained(MODELO_BASE)

# Carregar o modelo BERT para classificação
model = BertForSequenceClassification.from_pretrained(MODELO_BASE, num_labels=NUM_LABELS)
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## 4. Classe Dataset para Gerenciamento de Dados

In [None]:
class AmazonTitlesDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


## 5. Leitura em Chunks e Tokenização (para JSONL compactado em GZIP)
Vamos ajustar a função de leitura para processar arquivos JSONL. Cada linha é um objeto JSON separado, então a função simplesmente lê uma linha por vez e cria blocos (chunks).

In [None]:
def ler_arquivo_em_blocos_jsonl_gz(caminho_arquivo, tamanho_bloco=10000):
    with gzip.open(caminho_arquivo, 'rt') as f:  # 'rt' para ler como texto
        bloco = []
        for i, linha in enumerate(f):
            bloco.append(json.loads(linha.strip()))  # Lê uma linha como JSON
            if (i + 1) % tamanho_bloco == 0:
                yield bloco
                bloco = []
        if bloco:
            yield bloco

def processar_e_tokenizar_chunk(chunk, max_length=128):
    titles = [item['title'] for item in chunk]
    descriptions = [item['content'] for item in chunk]

    # Concatenar título e descrição
    inputs = [f"{title} [SEP] {description}" for title, description in zip(titles, descriptions)]

    # Tokenização
    encodings = tokenizer(inputs, truncation=True, padding=True, max_length=max_length)

    # Exemplo de rótulos fictícios; substitua conforme necessário (ERRO GRAVE)
    labels = [1] * len(chunk)

    return encodings, labels

# Função para detectar o ambiente de execução
def detectar_ambiente():
    if 'google.colab' in sys.modules:
        return 'colab_gpu' if torch.cuda.is_available() else 'colab_cpu'
    else:
        return 'local'

# Função para obter o diretório de checkpoints
def obter_diretorio_checkpoints():
    ambiente = detectar_ambiente()
    if ambiente.startswith('colab'):
        try:
            from google.colab import drive # type: ignore
            drive.mount('/content/drive')
            return DIRETORIO_CHECKPOINTS_COLAB
        except ImportError:
            print("Não foi possível importar o módulo 'drive'. Verifique se está no ambiente Colab.")
            return None
    else:
        return './checkpoints'

def carregar_ultimo_checkpoint():
    checkpoint_dir = obter_diretorio_checkpoints()

    # Procurar por checkpoints existentes
    checkpoints = sorted(
        [c for c in os.listdir(checkpoint_dir) if c.startswith('checkpoint_')],
        key=lambda x: int(x.split('_')[-1])  # Extrai o número do checkpoint e ordena numericamente
    )

    if checkpoints:
        # Pega o último checkpoint
        ultimo_checkpoint = checkpoints[-1]
        ultimo_chunk = int(ultimo_checkpoint.split('_')[-1])

        # Carrega o modelo e tokenizer do último checkpoint
        modelo = BertForSequenceClassification.from_pretrained(f'{checkpoint_dir}/{ultimo_checkpoint}')
        tokenizer = BertTokenizer.from_pretrained(f'{checkpoint_dir}/{ultimo_checkpoint}')

        return modelo, tokenizer, ultimo_chunk

    return None, None, -1  # Retorna -1 para indicar que nenhum chunk foi processado ainda

def salvar_progresso(model, tokenizer, chunk_idx):
    checkpoint_dir = obter_diretorio_checkpoints()

    # Salva o modelo e tokenizer no checkpoint atual
    model.save_pretrained(f'{checkpoint_dir}/checkpoint_{chunk_idx}')
    tokenizer.save_pretrained(f'{checkpoint_dir}/checkpoint_{chunk_idx}')

    # Salva o progresso do chunk processado
    with open(f'{checkpoint_dir}/ultimo_chunk.txt', 'w') as f:
        f.write(str(chunk_idx))

    checkpoints = sorted(
        [c for c in os.listdir(checkpoint_dir) if c.startswith('checkpoint_')],
        key=lambda x: int(x.split('_')[-1])  # Extrai o número do nome e ordena numericamente
    )

    for checkpoint in checkpoints[:-2]:
        os.system(f'rm -rf {checkpoint_dir}/{checkpoint}')

## 6. Configuração do Treinamento


In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,                                   # Diretório de saída para os resultados
    num_train_epochs=NUM_TRAIN_EPOCHS,                       # Número de épocas
    per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE, # Tamanho do batch
    save_steps=SAVE_STEPS,                                   # Salvar checkpoints a cada 1000 passos
    save_total_limit=SAVE_TOTAL_LIMIT,                       # Limite de dois checkpoints salvos
    logging_dir=LOGGING_DIR,                                 # Diretório de logs
)


## 7. Função de Treinamento por Chunk


In [None]:
def treinar_com_chunk(encodings, labels):
    dataset = AmazonTitlesDataset(encodings, labels)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
    )
    # Treina o modelo usando o chunk atual
    trainer.train()


## 8. Baixar dados o Google Drive

Vamos baixar os dados do Google Drive para acessar os arquivos que contêm os dados de treinamento e teste.

In [None]:
!mkdir -p Datasets
!cd Datasets; gdown 12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK;
!cd Datasets; unzip LF-Amazon-1.3M.raw.zip; mkdir -p LF-AmazonTitles-1.3M/raw; mv LF-Amazon-1.3M/* LF-AmazonTitles-1.3M; rmdir LF-Amazon-1.3M

Downloading...
From (original): https://drive.google.com/uc?id=12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK
From (redirected): https://drive.google.com/uc?id=12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK&confirm=t&uuid=af90855d-e267-431e-8d2c-4666b46a1166
To: /media/ntfs/plots002/Projetos/multi/1IADT/tc/Datasets/LF-Amazon-1.3M.raw.zip
100%|████████████████████████████████████████| 890M/890M [00:33<00:00, 26.5MB/s]
Archive:  LF-Amazon-1.3M.raw.zip
   creating: LF-Amazon-1.3M/
  inflating: LF-Amazon-1.3M/lbl.json.gz  
  inflating: LF-Amazon-1.3M/trn.json.gz  
  inflating: LF-Amazon-1.3M/filter_labels_test.txt  
  inflating: LF-Amazon-1.3M/tst.json.gz  
  inflating: LF-Amazon-1.3M/filter_labels_train.txt  


## 9. Processar e Treinar em Chunks
A função principal que faz a leitura do arquivo JSONL em chunks e realiza o fine-tuning do modelo BERT em cada chunk.

In [None]:
# Processar o arquivo compactado trn.json.gz e realizar o fine-tuning em chunks
caminho_arquivo = ARQUIVO_TREINAMENTO

ambiente = detectar_ambiente()
print(f"Ambiente detectado: {ambiente}")

# Carregar o último checkpoint, se existir
modelo_carregado, tokenizer_carregado, ultimo_chunk_processado = carregar_ultimo_checkpoint()
if modelo_carregado is not None:
    model = modelo_carregado
    tokenizer = tokenizer_carregado
    print(f"Retomando o treinamento a partir do chunk {ultimo_chunk_processado + 1}")
else:
    # Inicializar modelo e tokenizer se não houver checkpoint
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    ultimo_chunk_processado = -1  # Nenhum chunk processado ainda

# Processar chunks
for i, chunk in enumerate(ler_arquivo_em_blocos_jsonl_gz(caminho_arquivo, tamanho_bloco=TAMANHO_BLOCO), start=1):
    # Pular chunks estritamente menores que o último chunk processado
    if i <= ultimo_chunk_processado:
        print(f"Chunk {i} já processado. Pulando...")
        continue

    print(f"Processando chunk {i}")

    encodings, labels = processar_e_tokenizar_chunk(chunk)

    # Executar treinamento no chunk atual
    treinar_com_chunk(encodings, labels)

    # Salvar o progresso após cada chunk
    salvar_progresso(model, tokenizer, i)

    print(f"Treinamento com chunk {i} completo.")

Ambiente detectado: local
Retomando o treinamento a partir do chunk 58
Chunk 1 já processado. Pulando...
Chunk 2 já processado. Pulando...
Chunk 3 já processado. Pulando...
Chunk 4 já processado. Pulando...
Chunk 5 já processado. Pulando...
Chunk 6 já processado. Pulando...
Chunk 7 já processado. Pulando...
Chunk 8 já processado. Pulando...
Chunk 9 já processado. Pulando...
Chunk 10 já processado. Pulando...
Chunk 11 já processado. Pulando...
Chunk 12 já processado. Pulando...
Chunk 13 já processado. Pulando...
Chunk 14 já processado. Pulando...
Chunk 15 já processado. Pulando...
Chunk 16 já processado. Pulando...
Chunk 17 já processado. Pulando...
Chunk 18 já processado. Pulando...
Chunk 19 já processado. Pulando...
Chunk 20 já processado. Pulando...
Chunk 21 já processado. Pulando...
Chunk 22 já processado. Pulando...
Chunk 23 já processado. Pulando...
Chunk 24 já processado. Pulando...
Chunk 25 já processado. Pulando...
Chunk 26 já processado. Pulando...
Chunk 27 já processado. Pula

  return F.linear(input, self.weight, self.bias)


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 58 completo.
Processando chunk 59


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 59 completo.
Processando chunk 60


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 60 completo.
Processando chunk 61


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 61 completo.
Processando chunk 62


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 62 completo.
Processando chunk 63


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 63 completo.
Processando chunk 64


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 64 completo.
Processando chunk 65


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 65 completo.
Processando chunk 66


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 66 completo.
Processando chunk 67


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 67 completo.
Processando chunk 68


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 68 completo.
Processando chunk 69


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 69 completo.
Processando chunk 70


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 70 completo.
Processando chunk 71


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 71 completo.
Processando chunk 72


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 72 completo.
Processando chunk 73


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 73 completo.
Processando chunk 74


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 74 completo.
Processando chunk 75


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 75 completo.
Processando chunk 76


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 76 completo.
Processando chunk 77


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 77 completo.
Processando chunk 78


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 78 completo.
Processando chunk 79


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 79 completo.
Processando chunk 80


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 80 completo.
Processando chunk 81


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 81 completo.
Processando chunk 82


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 82 completo.
Processando chunk 83


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 83 completo.
Processando chunk 84


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 84 completo.
Processando chunk 85


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 85 completo.
Processando chunk 86


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 86 completo.
Processando chunk 87


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 87 completo.
Processando chunk 88


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 88 completo.
Processando chunk 89


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 89 completo.
Processando chunk 90


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 90 completo.
Processando chunk 91


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 91 completo.
Processando chunk 92


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 92 completo.
Processando chunk 93


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 93 completo.
Processando chunk 94


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 94 completo.
Processando chunk 95


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 95 completo.
Processando chunk 96


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 96 completo.
Processando chunk 97


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 97 completo.
Processando chunk 98


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 98 completo.
Processando chunk 99


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 99 completo.
Processando chunk 100


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 100 completo.
Processando chunk 101


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 101 completo.
Processando chunk 102


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 102 completo.
Processando chunk 103


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 103 completo.
Processando chunk 104


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 104 completo.
Processando chunk 105


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 105 completo.
Processando chunk 106


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 106 completo.
Processando chunk 107


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 107 completo.
Processando chunk 108


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 108 completo.
Processando chunk 109


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 109 completo.
Processando chunk 110


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 110 completo.
Processando chunk 111


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 111 completo.
Processando chunk 112


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 112 completo.
Processando chunk 113


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 113 completo.
Processando chunk 114


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 114 completo.
Processando chunk 115


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 115 completo.
Processando chunk 116


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 116 completo.
Processando chunk 117


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 117 completo.
Processando chunk 118


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 118 completo.
Processando chunk 119


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 119 completo.
Processando chunk 120


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 120 completo.
Processando chunk 121


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 121 completo.
Processando chunk 122


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 122 completo.
Processando chunk 123


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 123 completo.
Processando chunk 124


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 124 completo.
Processando chunk 125


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 125 completo.
Processando chunk 126


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 126 completo.
Processando chunk 127


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 127 completo.
Processando chunk 128


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 128 completo.
Processando chunk 129


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 129 completo.
Processando chunk 130


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 130 completo.
Processando chunk 131


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 131 completo.
Processando chunk 132


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 132 completo.
Processando chunk 133


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 133 completo.
Processando chunk 134


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 134 completo.
Processando chunk 135


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 135 completo.
Processando chunk 136


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 136 completo.
Processando chunk 137


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 137 completo.
Processando chunk 138


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 138 completo.
Processando chunk 139


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 139 completo.
Processando chunk 140


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 140 completo.
Processando chunk 141


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 141 completo.
Processando chunk 142


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 142 completo.
Processando chunk 143


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 143 completo.
Processando chunk 144


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 144 completo.
Processando chunk 145


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 145 completo.
Processando chunk 146


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 146 completo.
Processando chunk 147


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 147 completo.
Processando chunk 148


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 148 completo.
Processando chunk 149


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 149 completo.
Processando chunk 150


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 150 completo.
Processando chunk 151


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 151 completo.
Processando chunk 152


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 152 completo.
Processando chunk 153


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 153 completo.
Processando chunk 154


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 154 completo.
Processando chunk 155


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 155 completo.
Processando chunk 156


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 156 completo.
Processando chunk 157


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 157 completo.
Processando chunk 158


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 158 completo.
Processando chunk 159


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 159 completo.
Processando chunk 160


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 160 completo.
Processando chunk 161


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 161 completo.
Processando chunk 162


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 162 completo.
Processando chunk 163


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 163 completo.
Processando chunk 164


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 164 completo.
Processando chunk 165


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 165 completo.
Processando chunk 166


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 166 completo.
Processando chunk 167


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 167 completo.
Processando chunk 168


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 168 completo.
Processando chunk 169


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 169 completo.
Processando chunk 170


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 170 completo.
Processando chunk 171


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 171 completo.
Processando chunk 172


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 172 completo.
Processando chunk 173


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 173 completo.
Processando chunk 174


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 174 completo.
Processando chunk 175


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 175 completo.
Processando chunk 176


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 176 completo.
Processando chunk 177


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 177 completo.
Processando chunk 178


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 178 completo.
Processando chunk 179


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 179 completo.
Processando chunk 180


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 180 completo.
Processando chunk 181


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 181 completo.
Processando chunk 182


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 182 completo.
Processando chunk 183


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 183 completo.
Processando chunk 184


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 184 completo.
Processando chunk 185


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 185 completo.
Processando chunk 186


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 186 completo.
Processando chunk 187


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 187 completo.
Processando chunk 188


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 188 completo.
Processando chunk 189


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 189 completo.
Processando chunk 190


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 190 completo.
Processando chunk 191


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 191 completo.
Processando chunk 192


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 192 completo.
Processando chunk 193


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 193 completo.
Processando chunk 194


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 194 completo.
Processando chunk 195


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 195 completo.
Processando chunk 196


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 196 completo.
Processando chunk 197


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 197 completo.
Processando chunk 198


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 198 completo.
Processando chunk 199


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 199 completo.
Processando chunk 200


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 200 completo.
Processando chunk 201


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 201 completo.
Processando chunk 202


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 202 completo.
Processando chunk 203


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 203 completo.
Processando chunk 204


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 204 completo.
Processando chunk 205


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 205 completo.
Processando chunk 206


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 206 completo.
Processando chunk 207


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 207 completo.
Processando chunk 208


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 208 completo.
Processando chunk 209


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 209 completo.
Processando chunk 210


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 210 completo.
Processando chunk 211


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 211 completo.
Processando chunk 212


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 212 completo.
Processando chunk 213


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 213 completo.
Processando chunk 214


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 214 completo.
Processando chunk 215


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 215 completo.
Processando chunk 216


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 216 completo.
Processando chunk 217


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 217 completo.
Processando chunk 218


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 218 completo.
Processando chunk 219


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 219 completo.
Processando chunk 220


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 220 completo.
Processando chunk 221


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 221 completo.
Processando chunk 222


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 222 completo.
Processando chunk 223


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 223 completo.
Processando chunk 224


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 224 completo.
Processando chunk 225


Step,Training Loss
500,0.0
1000,0.0
1500,0.0


Treinamento com chunk 225 completo.


In [None]:
# Processar o arquivo compactado trn.json.gz e realizar o fine-tuning em chunks
caminho_arquivo = ARQUIVO_TREINAMENTO

ambiente = detectar_ambiente()
print(f"Ambiente detectado: {ambiente}")

i = 1

# Processar chunks
for i, chunk in enumerate(ler_arquivo_em_blocos_jsonl_gz(caminho_arquivo, tamanho_bloco=TAMANHO_BLOCO), start=1):
  continue


print(f"Total de chunks: {i}")

Ambiente detectado: local
Total de chunks: 225


## 10. Salvar o Modelo Fine-Tuned
Depois de processar todos os chunks e realizar o fine-tuning do modelo, salvamos o modelo treinado.

In [None]:
# Salvar o modelo fine-tuned e o tokenizer
model.save_pretrained(MODELO_FINAL)
tokenizer.save_pretrained(MODELO_FINAL)

print("Modelo fine-tuned salvo com sucesso.")

Modelo fine-tuned salvo com sucesso.


## 11. Validação do Modelo Fine-Tuned

In [None]:
# Carregar o modelo fine-tuned
model = BertForSequenceClassification.from_pretrained(MODELO_FINAL)
tokenizer = BertTokenizer.from_pretrained(MODELO_FINAL)
model.to(device)


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
# Função de avaliação
def avaliar_modelo(model, dataloader):
    model.eval()
    total_correct = 0
    total_samples = 0

    with torch.no_grad():
        for batch in dataloader:
            inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
            labels = batch['labels'].to(device)

            outputs = model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

            total_correct += (predictions == labels).sum().item()
            total_samples += labels.size(0)

    accuracy = total_correct / total_samples
    return accuracy


In [None]:
# Executar a avaliação
total_accuracy = 0
total_batches = 0

for chunk in ler_arquivo_em_blocos_jsonl_gz(ARQUIVO_TESTE):
    encodings, labels = processar_e_tokenizar_chunk(chunk)
    dataset = AmazonTitlesDataset(encodings, labels)
    dataloader = DataLoader(dataset, batch_size=32)

    accuracy = avaliar_modelo(model, dataloader)
    total_accuracy += accuracy
    total_batches += 1

    print(f"Acurácia do batch: {accuracy:.4f}")

average_accuracy = total_accuracy / total_batches
print(f"Acurácia média: {average_accuracy:.4f}")

Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do batch: 1.0000
Acurácia do 

## 12. Hospedando modelo no Hugging Face

In [None]:
nome_hf = 'rrantz/FIAP-1IADT-Grupo28'

# Get the secret key from environment variables
token_hf = '...'


login(token_hf)
model.push_to_hub(nome_hf)
tokenizer.push_to_hub(nome_hf)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/guilherme/.cache/huggingface/token
Login successful


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rrantz/FIAP-1IADT-Grupo28/commit/20b5ec31db0ca85c4a24c1887c46da539bae9b21', commit_message='Upload tokenizer', commit_description='', oid='20b5ec31db0ca85c4a24c1887c46da539bae9b21', pr_url=None, repo_url=RepoUrl('https://huggingface.co/rrantz/FIAP-1IADT-Grupo28', endpoint='https://huggingface.co', repo_type='model', repo_id='rrantz/FIAP-1IADT-Grupo28'), pr_revision=None, pr_num=None)

## 13. Experimentando modelo

### Carga do modelo:

In [8]:
nome_hf = 'rrantz/FIAP-1IADT-Grupo28'

# Carregar o modelo BERT para classificação
model_fine_tuned = BertForSequenceClassification.from_pretrained(nome_hf)
tokenizer_fine_tuned = BertTokenizer.from_pretrained(nome_hf)
model_fine_tuned.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### Tokenização:

In [17]:
texto = "Será que errei o treinamento?"

# Tokenizando o texto
inputs = tokenizer_fine_tuned(texto, return_tensors="pt", padding=True, truncation=True, max_length=128)

# Movendo os tensores para o dispositivo (GPU ou CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}

### Inferência com o Modelo:

In [18]:
# Colocando o modelo em modo de avaliação (evita atualizações nos pesos)
model_fine_tuned.eval()

# Desativando a computação de gradientes (necessário apenas durante o treinamento)
with torch.no_grad():
    outputs = model_fine_tuned(**inputs)

# Pegando os logits das previsões
logits = outputs.logits

# Convertendo os logits em probabilidades
probabilidades = torch.nn.functional.softmax(logits, dim=-1)

# Obtendo a classe com maior probabilidade
predicao = torch.argmax(probabilidades, dim=-1)

# Mostrando a predição
print(f"Classe prevista: {predicao.item()}, Probabilidades: {probabilidades}")


Classe prevista: 1, Probabilidades: tensor([[1.1245e-12, 1.0000e+00]], device='cuda:0')


### Análise de inferência

In [19]:
inputs = tokenizer_fine_tuned("Frase aleatória", return_tensors="pt", padding=True, truncation=True, max_length=128)
inputs = {key: value.to(device) for key, value in inputs.items()}

with torch.no_grad():
    outputs = model_fine_tuned(**inputs)
    logits = outputs.logits
    probabilidades = torch.nn.functional.softmax(logits, dim=-1)
    predicao = torch.argmax(probabilidades, dim=-1)

print(f"Predição: {predicao.item()}, Probabilidades: {probabilidades}")

Predição: 1, Probabilidades: tensor([[1.1263e-12, 1.0000e+00]], device='cuda:0')


## 14. Diagnóstico da predição

 Como os títulos foram todos treinados com rótulos iguais (no caso, todos os exemplos com `labels = [1] * len(chunk)`), o problema é que o modelo não teve como aprender a distinguir entre diferentes classes. O modelo apenas aprendeu a prever sempre o rótulo "1", já que essa era a única informação disponível durante o treinamento.



### Problema
Treinar um modelo com rótulos iguais faz com que ele não tenha como aprender a diferença entre as entradas, já que ele não tem diversidade de classes para aprender. Isso resulta em um comportamento onde o modelo sempre prevê a mesma classe, independentemente do input.

### O que deveria ter sido feito:
Definir rótulos corretos: Se você está fazendo uma tarefa de classificação, os rótulos devem refletir as categorias ou classes distintas que você deseja que o modelo aprenda a prever. Isso significa que para cada título ou descrição, você deve fornecer um rótulo que represente sua categoria.

Exemplo:

- Se o dataset está classificando produtos da Amazon, cada título de produto deve estar associado a uma categoria específica (ex: "Eletrônicos", "Livros", "Roupas", etc.).
- Os rótulos devem ser variados e representar essas categorias.
