# Tarefa B

Nosso objetivo é dar fine tuning em um modelo que seja capaz de completar texto à partir do posicionamento político do indivíduo. Antes de tudo, vamos importar um modelo e testar. O escolhido foi o Qwen 2.5 de 1.5B parâmetros (que é o que coube em minha placa de vídeo).

Abaixo temos um exemplo de uso do qween para completar texto. Nota-se a necessidade de fazer quantização dos bits do modelo para caber na placa de video durante o treinamento.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "Qwen/Qwen2.5-1.5B"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map={"": 0},
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear4bit(in_features=1536, out_features=256, bias=True)
          (v_proj): Linear4bit(in_features=1536, out_features=256, bias=True)
          (o_proj): Linear4bit(in_features=1536, out_features=1536, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear4bit(in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear4bit(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((1

Segue alguns testes:

In [2]:
def complete(text, limit_tokens=256):
    pretext = "<system>You're an internet user commenting on a blog.</system> "
    text = pretext + text
    inputs = tokenizer(
        text,
        return_tensors="pt"
    ).to(device)
    
    with torch.no_grad():
        generated = model.generate(
            **inputs,              # Agora inputs é um dict válido
            max_new_tokens=limit_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.9
        )
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)[len(pretext)+3:]

In [13]:
message = "<e>I love my nation because"
print(complete(message))

I love my nation because itOPOPOPANOPNewsNewsNewsOPANAOPANNewOPANPresidentOPOPANItOPOPOPOPOPNewsOPOPOPOPANNewsOPOPOPANOPOPANOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPANTheOPOPOPOPNewsNewsOPOPOPOPOPOPOPOPOPOPOPOPOPOPANAOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPANPInOPOPOPOPANNewsOPOPOPANInOPNewsOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPOPNewsOPOPOPANAANTheOPOPOPOPNewsOPOPOPANNewsOPOPOPOPOPOPOPANANANTheOPOPOPOPOPOPOPANAANNewsOpInANPresidentOPOPOPOPANANATheOPOPOPANDemocraticNewsOPOPANAOPOPANNewsOPOPOPOPOPOPOPOPOPANDemocratsOPOPOPOPANDemocratsOPANDemocraticOPOPOPOPOPOPOPOPOPOPOPOPOPOPNewsNewsNewsOPOPOPANPresidentOPOPOPOPOPOPOPOPOPOPOPOPOPOPOP


# 1) Importando os dados

Okay, precisamos agora importar os dados. Vou modificar o código retirado da tarefa A para importar diretamente os dados dada posição política.

In [3]:
def import_json(path):
    from json import loads
    from os import listdir

    # Pega o nome das pastas e retira as que tem . antes
    names = list(
        filter(
            lambda x: x if not x.startswith(".") else None,
            listdir(path)
        )
    )

    # Para cada pasta abre ela e coloca o conteúdo em um dicionário, depois retorna
    data = []
    for name in names:
        file_name = f"{path}/{name}"
        text = "".join(open(file_name).readlines())
        json = loads(text)
        data.append({
            "content":f"<{json["label_text"][0]}>{json["content"]}",
            "label":json["label"],
        })
    return data

In [None]:
Vamos também preparar o dataset 

In [4]:
from datasets import Dataset

train_data = import_json("../train_json/")
eval_data = import_json("../dev_json/")

train_set = Dataset.from_list(train_data)
eval_set = Dataset.from_list(eval_data)

In [7]:
print(eval_set[0])

{'content': '<l>For the last couple of days, the lonely corner of Overland Avenue and Santa Fe Street in El Paso, Texas, has become home for Luis Cubillan, 41, and his family after leaving Venezuela over a month ago.\n\n“Welcome to our home,” Cubillan comically told VICE News while folding an old rug they used as a mattress to sleep at night. “This is my wife, my two daughters, my two grandsons, and my stepson.”', 'label': 0}


# 2) Escolhendo o tokenizador

Utilizares uma técnica chamada Token-Based Bias Injection, onde vamos introduzir viés ao modelo adicionando um token especial representando o alinhamneto político durante o treinamento.

Para isso, precisamos criar esses tokens especiais e avisar o modelo que ele existe (pois isso vai alterar os embeddings).

In [5]:
tokenizer.add_tokens(["<l>", "<c>", "<r>"])
model.resize_token_embeddings(len(tokenizer))

Embedding(151668, 1536)

# 3) Criando base de dados de teste

Vamos criar duas bases de dados. Vai funcionar assim: dado a base de testes, vou pegar as primeiras 5 palavras de cada mensagem, e o modelo vai gerar com um limite de 256 tokens. Vou gerar frases com o modelo antes de sofrer fine tuning e depois de sofrer fine tuning.

Após isso utilizaremos o classificador da tarefa A para verificar se o modelo seguiu o viés definido.

In [6]:
def make_test_base(dataset, filename):
    d = []
    for phrase in dataset:
        message = " ".join(phrase["content"].split()[:5])
        d.append({
            "content": complete(message),
            "label": phrase["label"]
        })
    with open(filename, 'w') as f:
        json.dump(pretrained_data, f)

In [8]:
make_test_base(eval_data[:100], "pretrained_data.json")

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for

NameError: name 'json' is not defined

# 4) Fine-tuning

Utilizaremos o peft, que é uma família de técnicas de fine-tunig. Mais especificamente o LoRa, que vai criar módulos e enxertá-los no modelo. Ao invés de treinar os pesos do modelo, treinamos apenas o peso dos módulos. Será necessário também desativar algumas otimizações do uso da GPU por conta da falta de memória.

In [7]:
from peft import LoraConfig, get_peft_model, PeftModel

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules="all-linear",
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.enable_input_require_grads()
model.gradient_checkpointing_enable()

Vamos utilizar o STFTrainer para fazer o treinamento para nós. Precisamos apenas otimizar os parâmetros.

In [8]:
from transformers import EarlyStoppingCallback
from trl import SFTTrainer, SFTConfig

config = SFTConfig(
    output_dir="outputs/",
    num_train_epochs=5,
    per_device_train_batch_size=4,  
    gradient_accumulation_steps=8,       
    warmup_steps=0,                        
    learning_rate=2e-4,                    
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    dataset_text_field="content",
    warmup_ratio= 0.03,
    weight_decay= 0.01,
    adam_beta2= 0.999,
    max_grad_norm= 1.0,           
    
    max_steps=5000,                         
    logging_steps=1,                      
    save_steps=5000,                      
    save_total_limit=2,
   
    fp16=True,                            
    gradient_checkpointing=True,         
    packing=False,                        
)

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=train_set,
    eval_dataset=eval_set,
    args=config,
)

Adding EOS to train dataset:   0%|          | 0/45066 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/45066 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/45066 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/5008 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/5008 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/5008 [00:00<?, ? examples/s]

In [9]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
1,3.3007
2,3.2063
3,3.3265
4,3.1618
5,3.2823
6,3.4023
7,3.1413
8,3.1644
9,3.2237
10,3.1882




TrainOutput(global_step=5000, training_loss=2.2984503690481186, metrics={'train_runtime': 54909.1224, 'train_samples_per_second': 2.914, 'train_steps_per_second': 0.091, 'total_flos': 1.477849491222958e+17, 'train_loss': 2.2984503690481186, 'epoch': 3.5488595011981894})

In [10]:
trainer.save_model("modelo_taskB")



In [19]:
model = trainer.model
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151668, 1536)
        (layers): ModuleList(
          (0-27): 28 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=1536, out_features=1536, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1536, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=1536, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.L

In [None]:
make_test_base(eval_data[:100], "finetuned_data.json")