### Montando o Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Instalação de dependências:

In [2]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install triton
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install transformers datasets


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-wohw1btu/unsloth_fd6c8e75a8dc42afa131cc68a170f52e
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-wohw1btu/unsloth_fd6c8e75a8dc42afa131cc68a170f52e
  Resolved https://github.com/unslothai/unsloth.git to commit a0acecb50f39d9b62a144684be9ed9e3c3755a1f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.11-py3-none-any.whl.metadata (8.4 kB)
Collecting transformers>=4.45.1 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[

### Importando Bibliotecas

In [3]:
import json
import pandas as pd
from datasets import load_dataset
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers import pipeline

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### Configurações do modelo para fine-tuning

In [4]:
max_seq_length = 512
dtype = None
load_in_4bit = True

### Modelos Compativeis

In [47]:
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B",
    "unsloth/Phi-3-mini-4k-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
]

### Formatando o Dataset de produtos

In [5]:
def format_dataset_into_model_input(data):
    # Inicializando as listas para armazenar os dados
    instructions = []
    inputs = []
    outputs = []

    # Alterando a formatação para título e descrição
    for example in data:
        title = example['title']
        content = example['content']
        instruction = "Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers."

        instructions.append(instruction)
        inputs.append(title)
        outputs.append(content)

    # Criando o dicionário final
    formatted_data = {
        "instruction": instructions,
        "input": inputs,
        "output": outputs
    }

    # Salvando o resultado em um arquivo JSON (caso necessário)
    formatted_json_path = "/content/drive/MyDrive/Notebooks/trn.json/formatted_amazon_product_data.json"
    with open(formatted_json_path, 'w') as output_file:
        json.dump(formatted_data, output_file, indent=4)

    print(f"Dataset formatado salvo em: {formatted_json_path}")

    return formatted_json_path

### Declarando Constantes e Formatando o Dataset da Amazon

In [6]:
# Caminho do arquivo JSON original
DATA_PATH = '/content/drive/MyDrive/Notebooks/trn.json/trn.json'

# Carregar o dataset do arquivo JSON
dataset = load_dataset("json", data_files=DATA_PATH)

# Caminho do arquivo CSV que será gerado
OUTPUT_PATH_DATASET = format_dataset_into_model_input(dataset['train'] if 'train' in dataset else dataset)


Generating train split: 0 examples [00:00, ? examples/s]

Dataset formatado salvo em: /content/drive/MyDrive/Notebooks/trn.json/formatted_amazon_product_data.json


### Carregando o modelo pré treinado

In [7]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

==((====))==  Unsloth 2024.9.post3: Fast Mistral patching. Transformers = 4.45.1.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

### Preparando a inferência do modelo ainda não ajustado com Fine Tuning

In [8]:
FastLanguageModel.for_inference(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((40

### Função para testar o modelo

In [9]:
def generate_description(title):
    input_text = f"Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.\nTítulo: {title}\nDescrição:"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=128)
    description = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return description

### Testando o modelo sem fine tuning

In [10]:
title = "Smartphone Samsung Android 64GB"
description = generate_description(title)
print(f"Título: {title}\nDescrição: {description}")

Título: Smartphone Samsung Android 64GB
Descrição: Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.
Título: Smartphone Samsung Android 64GB
Descrição:

The Samsung Galaxy S20 FE is a powerful smartphone that offers a range of features and capabilities to enhance your mobile experience. With its 6.5-inch Super AMOLED display, you can enjoy vibrant and clear visuals for your favorite movies, games, and more. The phone is powered by a Qualcomm Snapdragon 865 processor, ensuring smooth and efficient performance.

The Galaxy S20 FE features a triple-lens camera system, including a 12MP ultra-wide lens, a 12MP wide-angle lens, and an 8MP


In [11]:
title = "Smartphone iphone 64GB"
description = generate_description(title)
print(f"Título: {title}\nDescrição: {description}")

Título: Smartphone iphone 64GB
Descrição: Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.
Título: Smartphone iphone 64GB
Descrição:

The iPhone 64GB is a powerful and versatile smartphone that offers a range of features and capabilities to enhance your daily life. With its sleek design, advanced technology, and user-friendly interface, the iPhone 64GB is a popular choice among consumers.

One of the standout features of the iPhone 64GB is its large storage capacity, which allows you to store a vast amount of data, including photos, videos, music, and apps. This makes it an ideal choice for those who enjoy capturing and sharing their memories, as well as those who enjoy streaming music and videos on


### Ajustes finos do LoRA

In [12]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # Aumentado para melhorar a capacidade de ajuste
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,  # Aumentado para ajustar melhor os parâmetros
    lora_dropout=0.1,  # Introduzido dropout para evitar overfitting
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.9.post3 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


### Prompt para a descrição dos produtos

In [13]:
alpaca_prompt = """Below is an instruction outlining the task, accompanied by a title that provides additional context. Write a response that appropriately concludes the description.

### Instrução:
{}

### Título:
{}

### Descrição:
{}"""

EOS_TOKEN = tokenizer.eos_token

### Função para formatar o Prompt

In [14]:
def formatting_prompts_func(data):
    instructions = data["instruction"]
    inputs = data["input"]
    outputs = data["output"]

    texts = []

    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

### Carregar o Dataset

In [15]:
dataset = load_dataset("json", data_files=OUTPUT_PATH_DATASET)

print(dataset)
print(dataset['train'].column_names)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 2248619
    })
})
['instruction', 'input', 'output']


In [16]:
print(dataset['train'][0])  # Verifique uma amostra do dataset

{'instruction': 'Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.', 'input': 'Girls Ballet Tutu Neon Pink', 'output': 'High quality 3 layer ballet tutu. 12 inches in length'}


In [17]:
formatted_dataset = dataset['train'].map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/2248619 [00:00<?, ? examples/s]

In [18]:
formatted_dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 2248619
})

In [19]:
dataset['train'][0]

{'instruction': 'Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.',
 'input': 'Girls Ballet Tutu Neon Pink',
 'output': 'High quality 3 layer ballet tutu. 12 inches in length'}

### Configurações do Trainer

In [20]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=100,  # Aumento dos warmup steps para melhorar a estabilidade do treinamento
        max_steps=60,  # Aumento no número de passos de treinamento
        learning_rate=1e-4,  # Taxa de aprendizado ajustada para um ajuste mais refinado
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,  # Log mais frequente para monitorar o treinamento
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/2248619 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


### Treinando o Modelo

In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,248,619 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
10,2.4985
20,1.9797
30,1.2401
40,1.1984
50,1.2764
60,1.1402


### Salvando o Modelo

In [22]:
# Salvar o modelo ajustado
model.save_pretrained("/content/drive/MyDrive/Notebooks/trn.json/lora_model")
tokenizer.save_pretrained("/content/drive/MyDrive/Notebooks/trn.json/lora_model")

('/content/drive/MyDrive/Notebooks/trn.json/lora_model/tokenizer_config.json',
 '/content/drive/MyDrive/Notebooks/trn.json/lora_model/special_tokens_map.json',
 '/content/drive/MyDrive/Notebooks/trn.json/lora_model/tokenizer.model',
 '/content/drive/MyDrive/Notebooks/trn.json/lora_model/added_tokens.json',
 '/content/drive/MyDrive/Notebooks/trn.json/lora_model/tokenizer.json')

### Reutilizando o modelo já treinado

In [23]:
# Caminho para o modelo salvo
model_path = "/content/drive/MyDrive/Notebooks/trn.json/lora_model"

# Carregar o modelo e o tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(model_path)

# Preparar o modelo para inferência
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2024.9.post3: Fast Mistral patching. Transformers = 4.45.1.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load /content/drive/MyDrive/Notebooks/trn.json/lora_model as a legacy tokenizer.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32768, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj

### Testando o modelo com fine tuning

In [24]:
title = "Smartphone Samsung Android 64GB"
description = generate_description(title)
print(f"Título: {title}\nDescrição: {description}")

Título: Smartphone Samsung Android 64GB
Descrição: Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.
Título: Smartphone Samsung Android 64GB
Descrição:

### Product Description

Samsung Galaxy S6 Edge 64GB

### From the Manufacturer

##### Samsung Galaxy S6 Edge 64GB

The Samsung Galaxy S6 Edge is a smartphone that is designed to be a work of art. The Galaxy S6 Edge is the first smartphone to feature an edge-to-edge display, which is curved on both sides. The curved display is not only beautiful, but it also makes the phone easier to hold and use. The Galaxy S6 Edge is also the first smartphone to feature a metal and glass design.


In [25]:
title = "Smartphone iphone 64GB"
description = generate_description(title)
print(f"Título: {title}\nDescrição: {description}")

Título: Smartphone iphone 64GB
Descrição: Provide a detailed and engaging product description with a unique, creative style that highlights its key features and benefits, making it appealing to potential buyers.
Título: Smartphone iphone 64GB
Descrição:

### Product Description

iPhone 6 Plus is an iPhone that's big by every measure, yet remarkably thin and light. With a 5.5-inch Retina HD display and advanced technologies in every part of the phone, iPhone 6 Plus lets you see and do more than ever before.

### Box Contains

iPhone 6 Plus, EarPods with Remote and Mic, Lightning to USB Cable, USB Power Adapter, Documentation

### From the Manufacturer

iPhone 6 Plus is an iPhone that's big by every measure,
