<a href="https://colab.research.google.com/github/maitreya-v/Synapse_ResearchPaper/blob/maitreya/%F0%9F%92%AE_GenerAds_%F0%9F%92%AEFine_tuning_BLOOM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune a BLOOM-based ad generation model using `peft`, `transformers` and `bitsandbytes`

We can use the [Product Descriptions and Ads Dataset](https://huggingface.co/datasets/c-s-ale/Product-Descriptions-and-Ads) to fine-tune BLOOM to be able to generate simple ads based off of product names, and descriptions! Perfect for Twitter or Instagram!

### Overview of PEFT and LoRA:

Based on some awesome new research [here](https://github.com/huggingface/peft), we can leverage techniques like PEFT and LoRA to train/fine-tune large models a lot more efficiently. 

It can't be explained much better than the overview given in the above link: 

```
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of
pre-trained language models (PLMs) to various downstream applications without 
fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often 
prohibitively costly. In this regard, PEFT methods only fine-tune a small 
number of (extra) model parameters, thereby greatly decreasing the 
computational and storage costs. Recent State-of-the-Art PEFT techniques 
achieve performance comparable to that of full fine-tuning.
```

### Install requirements

First, run the cells below to install the requirements:

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

### Model loading

Here let's load the `bloom-1b7` model!

In [46]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")



### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

### Preprocessing

We can simply load our dataset from 🤗 Hugging Face with the `load_dataset` method!

In [47]:
import transformers
from datasets import load_dataset

dataset_name = "c-s-ale/Product-Descriptions-and-Ads"
product_name = "product"
product_desc = "description"
product_ad = "ad"

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [48]:
dataset = load_dataset(dataset_name)



  0%|          | 0/2 [00:00<?, ?it/s]

In [49]:
dataset['train'][:3]

{'product': [' Harem pants', ' Fringe skirt', ' Gingham dress'],
 'description': [' A style of pants with a dropped crotch, loose-fitting legs, and a gathered waistband for a unique, bohemian look.',
  ' A skirt featuring fringe detailing on the bottom, creating movement and fun.',
  ' A dress featuring a two-toned checkered pattern, often associated with picnics and summery outfits.'],
 'ad': ['Discover Harem Pants! Unique, stylish bohemian vibes with a dropped crotch & loose legs. Comfy meets chic - elevate your wardrobe. Limited stock - shop now!',
  'Introducing our fabulous Fringe Skirt! Step out in style with eye-catching fringe detailing that adds flair and movement. Perfect for any occasion, create unforgettable memories with this chic piece.',
  "Introducing the Gingham Dress: Timeless & Chic! 💕 Step into summer with this must-have, two-toned checkered dress. From picnics to parties, it's your go-to look. Shop now for unbeatable style!"]}

In [71]:
filename='/content/drive/MyDrive/Research Paper Dataset/maitreya.csv'

In [72]:
import pandas as pd

def dataframe_to_dict(df):
    # Create an empty dictionary to store the lists
    result_dict = {
        'repository': [],
        'pulls': [],
        'commits': [],
        'readme': [],
        'release_notes': []
    }
    
    # Iterate over the rows in the DataFrame and append the values to the corresponding lists
    for _, row in df.iterrows():
        result_dict['repository'].append(row['repository'])
        result_dict['pulls'].append(row['pulls'])
        result_dict['commits'].append(row['commits'])
        result_dict['readme'].append(row['readme'])
        result_dict['release_notes'].append(row['release_notes'])
    
    return result_dict

In [73]:
import pandas as pd
import numpy as np
dataset = pd.read_csv(filename)

# convert each row to a dictionary and store them in a list
data_list = []
for index, row in dataset.iterrows():
    data_list.append(row.to_dict())

# convert the list of dictionaries to a Pandas Series
dataset = pd.Series(data_list)

# print the first element of the series
print((dataset[0]))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



We want to put our data in the form:

```
Below is a product and description, please write an ad for this product.

### Product and Description:
PRODUCT NAME AND DESCRIPTION HERE

### Ad:
OUR AD HERE
```

This way, we can prompt our model well and receive the responses we want!

This is what fine-tuning, and prompt-engineering, is really all about!

In [74]:
# Define a new function to encode the prompt
def generate_prompt(repository: str, pulls: str, readme: str, commits: str, release_notes: str) -> str:
  prompt = f"Below is the repository name,pull request messages, commit messages and release notes.\n\n### Product and Description:\n{repository}: {pulls}: {commits}: {release_notes}\n\n### Readme:\n{readme}"
  return prompt

# def encode_prompt(prompt, max_length=512):
#   # print(encoded)
#   # encoded = tokenizer.encode(prompt, truncation=True, padding=True, max_length=max_length)
#   return {'input_ids': tokenizer.encode(prompt, truncation=True, padding=True, max_length=max_length), 'attention_mask': [1] * len(prompt)}
#   # return {'input_ids': encoded['input_ids'], 'attention_mask': encoded['attention_mask']}

def encode_prompt(prompt, max_length=512):
    encoded = tokenizer.encode_plus(prompt, truncation=True, padding=True, max_length=max_length)
    print(type(encoded))
    print(encoded)
    return {'input_ids': encoded['input_ids'], 'attention_mask': encoded['attention_mask']}



# Map the dataset using the new function
dataset = dataset.map(lambda samples: encode_prompt(generate_prompt(samples['repository'], samples['pulls'], samples['readme'], samples['commits'], samples['release_notes']), max_length=512))

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for param in model.base_model.parameters():
    param.requires_grad = False


# Create a DataLoader object from the dataset
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)

# Train the model using the Trainer class
trainer = transformers.Trainer(
    model=model, 
    train_dataloader=dataloader,
    args=transformers.TrainingArguments(
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=100, 
        learning_rate=1e-3, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()


# Train the model using the Trainer class
# trainer = transformers.Trainer(
#     model=model, 
#     train_dataset=dataset,
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=4, 
#         gradient_accumulation_steps=4,
#         warmup_steps=100, 
#         max_steps=100, 
#         learning_rate=1e-3, 
#         fp16=True,
#         logging_steps=1, 
#         output_dir='outputs',
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
# )

# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# my_df = dataset.to_frame().reset_index(drop=True)
# chunked_df = pd.np.array_split(my_df, 10)

# for chunk in chunked_df:
#     # Convert the DataFrame back to a Series object if necessary
#     chunked_series = chunk.squeeze()
#     trainer.train(chunked_series)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [111757, 632, 368, 55655, 4040, 15, 183280, 8821, 29866, 15, 29852, 29866, 530, 26963, 34383, 6149, 105311, 38518, 530, 224575, 189, 33613, 18, 4778, 17490, 29, 115379, 15921, 4105, 3262, 54072, 150487, 15701, 3331, 25932, 861, 10400, 664, 42015, 14039, 3727, 42015, 14039, 18, 69327, 139704, 4636, 1130, 132829, 6216, 4005, 632, 267, 17721, 78622, 613, 189255, 26996, 613, 97663, 427, 4005, 490, 4231, 18564, 427, 337, 2322, 530, 21907, 11435, 427, 41261, 8256, 64724, 18882, 23741, 361, 503, 2233, 361, 42015, 14039, 78622, 43219, 6997, 109517, 361, 3702, 127892, 44637, 85274, 124078, 17, 13292, 12657, 58226, 15270, 861, 219681, 147899, 3702, 127892, 71898, 104616, 3702, 127892, 77076, 40136, 19428, 17, 9757, 3262, 15980, 64362, 21723, 43834, 30347, 664, 15796, 18077, 14652, 30306, 18, 8240, 6013, 29733, 4073, 52674, 18077, 77076, 17721, 13512, 6222, 6281, 58246, 361, 229302, 530, 24985, 3359, 951, 3335, 181452, 388

In [14]:
# # Define a new function to encode the prompt
# def generate_prompt(repository: str, pulls: str, readme: str, commits: str, release_notes: str) -> str:
#   prompt = f"Below is the repository name,pull request messages, commit messages and release notes.\n\n### Pull request messages,Commit messages and Release notes:\n{repository}: {pulls}: {commits}: {release_notes}\n\n### Readme:\n{readme}"
#   return prompt

# def encode_prompt(prompt, max_length=512):
#   encoded = tokenizer.encode(prompt, truncation=True, padding=True, max_length=max_length)
#   return {'input_ids': encoded['input_ids'], 'attention_mask': encoded['attention_mask']}

# # Map the dataset using the new function
# dataset = dataset.map(lambda samples: encode_prompt(generate_prompt(samples['repository'], samples['pulls'], samples['readme'], samples['commits'], samples['release_notes']), max_length=512))

# # Train the model using the Trainer class
# trainer = transformers.Trainer(
#     model=model, 
#     train_dataset=dataset,
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=4, 
#         gradient_accumulation_steps=4,
#         warmup_steps=100, 
#         max_steps=100, 
#         learning_rate=1e-3, 
#         fp16=True,
#         logging_steps=1, 
#         output_dir='outputs',
#         max_length=512
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
# )
# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# trainer.train()


In [15]:
# def generate_prompt(repository: str, pulls: str, readme: str, commits: str, release_notes: str) -> str:
#   prompt = f"Below is the repository name,pull request messages, commit messages and release notes.\n\n### Product and Description:\n{repository}: {pulls}: {commits}: {release_notes}\n\n### Readme:\n{readme}"
#   return prompt

# dataset = dataset.map(lambda samples: tokenizer(generate_prompt(samples['repository'], samples['pulls'], samples['readme'], samples['commits'], samples['release_notes'])))

In [16]:
# trainer = transformers.Trainer(
#     model=model, 
#     train_dataset=dataset,
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=4, 
#         gradient_accumulation_steps=4,
#         warmup_steps=100, 
#         max_steps=100, 
#         learning_rate=1e-3, 
#         fp16=True,
#         logging_steps=1, 
#         output_dir='outputs'
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
# )
# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# trainer.train()

## Share adapters on the 🤗 Hub

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [17]:
HUGGING_FACE_USER_NAME = "MaitreyaV"

In [18]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [20]:
model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/GenerAd-AI", use_auth_token=True)

adapter_model.bin:   0%|          | 0.00/12.6M [00:00<?, ?B/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/MaitreyaV/GenerAd-AI/commit/15f32ce5ae3ee59d7fac791227a4208eb35accaf', commit_message='Upload model', commit_description='', oid='15f32ce5ae3ee59d7fac791227a4208eb35accaf', pr_url=None, pr_revision=None, pr_num=None)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [21]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = f"{HUGGING_FACE_USER_NAME}/GenerAd-AI"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Downloading (…)/adapter_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/12.6M [00:00<?, ?B/s]

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference!

### Take it for a spin!

In [34]:
from IPython.display import display, Markdown

def make_inference(repository, pulls):
  # batch = tokenizer(f"Below is the repository name,pull request messages, commit messages and release notes.\n\n### Product and Description:\n{repository}: {pulls}: {commits}: {release_notes}\n\n### Readme:", return_tensors='pt')
  batch = tokenizer(f"Below is the repository name and pull request messages.\n\n### Repository name and pull request messages:\n{repository}: {pulls}\n\n### Readme:", return_tensors='pt')

  with torch.cuda.amp.autocast():
    # output_tokens = model.generate(**batch, max_new_tokens=50)
    output_tokens = model.generate(**batch, max_new_tokens=20)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [35]:
open_pulls=pd.read_csv('/content/drive/MyDrive/Research Paper Dataset/open_pulls.csv')

In [36]:
your_repository_name_here = open_pulls['repository'][3]
your_repo_pulls_here = open_pulls['open_pulls'][3]
# your_repo_commits_here = ""
# your_repo_release_notes_here = ""

# make_inference(your_repository_name_here, your_repo_pulls_here,your_repo_commits_here,your_repo_release_notes_here)
make_inference(your_repository_name_here, your_repo_pulls_here)



### Example in Training Set

In [None]:
batch = tokenizer("### Product and Description:\n Lace-up sandals: Shoes featuring laces or ties that wrap around the foot and, in some cases, the ankle.\n\n### Ad:", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

### Example outside of Training Set

In [None]:
batch = tokenizer("### Product and Description:\nSundress: A flowery yellow sundress with blue polka dots. \n\n### Ad:", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

### Example outside of immediate domain

In [None]:
batch = tokenizer("### Product and Description:\n A new Lexus: A luxury automobile with grey paint and tinted windows.\n\n### Ad:", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

As you can see by fine-tuning for few steps we have almost recovered the exact quote from the training data.