# Instruction Backtranslation

[YouTube Explanation](https://www.youtube.com/watch?v=_284G3zvmr4)
 
**Objective:** Enhance the capability of a language model in following instructions accurately.

**Methodology:**

1. **Initial Setup:** Begin with a language model finetuned on a small dataset (seed data).
2. **Self-Augmentation:** Utilize this seed model to generate instructional prompts based on a large web corpus, effectively creating new training data.
3. **Self-Curation:** Select the highest quality training examples from these generated prompts for further use.
4. **Model Finetuning:** Use these curated high-quality examples to further train the language model, enhancing its ability to follow complex instructions.
5. **Iterative Improvement:** Repeat the augmentation and curation steps to progressively improve the model's performance.

**Outcome:** This iterative training approach aims to develop a language model that excels at understanding and executing instructions, outperforming existing models on benchmark tests without relying on advanced training data.

Implementing the SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION [paper](https://arxiv.org/pdf/2308.06259)

## Downloading the DataSet

[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [1]:
import pandas as pd

splits = {'train': 'openassistant_best_replies_train.jsonl', 'test': 'openassistant_best_replies_eval.jsonl'}
df = pd.read_json("hf://datasets/timdettmers/openassistant-guanaco/" + splits["train"], lines=True)

In [2]:
# Here we see that all data is present in the same column, we need to split this into two columns - User Prompt and AI Response
df.head()

Unnamed: 0,text
0,### Human: Can you write a short introduction ...
1,### Human: ¿CUales son las etapas del desarrol...
2,### Human: Can you explain contrastive learnin...
3,### Human: I want to start doing astrophotogra...
4,### Human: Método del Perceptrón biclásico: de...


In [3]:
df.loc[0]['text']

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

Here we notice that the above text has three splits: Starts with a human prompt, then the assistant responds, then we also see the Human's second question

In [4]:
# Creating a new list that has the content split at the `### Assistant:` token
content = df['text'].str.split('### Assistant:', expand=True)

In [5]:
content

Unnamed: 0,0,1,2,3
0,### Human: Can you write a short introduction ...,"""Monopsony"" refers to a market structure wher...",,
1,### Human: ¿CUales son las etapas del desarrol...,Jean Piaget fue un psicólogo suizo que propus...,,
2,### Human: Can you explain contrastive learnin...,Sure! Let's say you want to build a model whi...,It has to do with a process called differenti...,
3,### Human: I want to start doing astrophotogra...,Getting started in astrophotography can seem ...,Astrophotography can be a fun and rewarding h...,
4,### Human: Método del Perceptrón biclásico: de...,El método del Perceptrón biclásico es un algo...,,
...,...,...,...,...
9841,### Human: What are the risks when you do not ...,Not executing an ankle-brachial index (ABI) e...,,
9842,### Human: 为什么不能关灯玩手机？,出於各種原因，不建議在黑暗中玩手機。 首先，開車時使用手機會分散注意力，讓人難以專注於手頭...,,
9843,### Human: ¿De qué está hecho el plástico?,El plástico está hecho de una variedad de mat...,La dureza de los plásticos proviene de la com...,
9844,### Human: Nortzuk ziren Harry Potteren gurasoak?,Harry Potterren gurasoak James eta Lily Potte...,,


Here we notice that some rows even have three values for the second column, let's observe the raw data for that text.

In [6]:
df.loc[2]['text']

'### Human: Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?### Assistant: Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\'t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model

**Here we notice that this is a multi step conversation. We need to get rid of such rows, to ensure we just have rows that contain one user prompt and one AI response.**

## Preparing the dataset

In [7]:
list = df['text'].tolist()

In [8]:
list[0]

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

In [9]:
def count_conversation_tokens(text):
    human_count = text.count("### Human:")
    assistant_count = text.count("### Assistant:")
    return human_count, assistant_count

In [10]:
# Creating new lists to store data of single human - assistant conversations
human_prompt_list = []
assistant_response_list = []

# Counting the occurrance of the tokens in a given string
for conversation in list:
    human_count, assistant_count = count_conversation_tokens(conversation)
    
    # Proceed only if there is one human and one assistant token in this string
    if human_count == 1 and assistant_count == 1:
        
        # Split the string by the token `### Assistant:`
        parts = conversation.split("### Assistant:")
        
        # we can directly add the assistant part to the list
        assistant_response_list.append(parts[1])
        
        # Split the human prompt at the token `### Human:`
        parts = parts[0].split("### Human:")
        
        # Append the second part of the split to `human_prompt_list`
        human_prompt_list.append(parts[1])
        
# Checking if both the lists have the same size
print(len(human_prompt_list), len(assistant_response_list))

4746 4746


In [11]:
# Creating a new dataframe for the tasks ahead

preparedData = pd.DataFrame(human_prompt_list, columns=['prompt'])
preparedData['response'] = assistant_response_list

preparedData.head()

Unnamed: 0,prompt,response
0,Método del Perceptrón biclásico: definición y...,El método del Perceptrón biclásico es un algo...
1,"Listened to Dvorak's ""The New World"" symphony...","If you enjoyed Dvorak's ""New World"" Symphony,..."
2,I am using docker compose and i need to mount...,You can mount the Docker socket in a Docker C...
3,eu quero que você atue como um terminal linux...,$ pwd\n/home/user
4,Write a 4chan style greentext about someone w...,>be me\n>sister wants to watch the new hit ro...


# Setting up model for training

- Model: [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- Reference Notebook: [Fine-tune Llama 2 in Google Colab](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd#scrollTo=OSHlAbqzDFDq)

In [12]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.33.1 trl==0.4.7

In [13]:
import os
import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2024-07-08 07:03:29.060115: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 07:03:29.060238: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 07:03:29.193280: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [14]:
# The model that you want to train from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-hf"

# The instruction dataset to use
# dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-instructBacktranslation"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 3

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 3

# Batch size per GPU for evaluation
per_device_eval_batch_size = 3

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [15]:
os.environ['HUGGINGFACE_TOKEN'] = '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'

In [16]:
# Convert the pandas DataFrame to a Hugging Face dataset
preparedData = Dataset.from_pandas(preparedData)

In [17]:
preparedData.column_names

['prompt', 'response']

In [18]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
# Loads model from hugging face and device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    use_auth_token=os.getenv('HUGGINGFACE_TOKEN')
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
# This tokenizer is passed into the model for splitting the input data into chunks
# This is also obtained from hugging face
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_auth_token=os.getenv('HUGGINGFACE_TOKEN'))
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=preparedData,
    peft_config=peft_config,
    dataset_text_field="response",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)



config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



Map:   0%|          | 0/4746 [00:00<?, ? examples/s]

You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.3019
50,1.9484
75,1.2237
100,1.7491
125,1.1706
150,1.5169
175,1.0812
200,1.673
225,1.0319
250,1.6157


In [23]:
# Run text generation pipeline with our next model
prompt = "I want to meet with a potential sponsor for a new charity I'm running to help underprivileged children succeed at school. Write an email introducing myself and my work."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

def generate_instruction(ai_response:str):    
    result = pipe(f"<s>[INST] {ai_response} [/INST]")
    print(result[0]['generated_text'])

In [20]:
prompt = "Dear [Recipient's Name],\n\nI hope this email finds you well. My name is [Your Name] and I am writing to introduce myself and discuss an opportunity for collaboration between our organizations.\n\nI am the founder and CEO of [Your Company/Organization], a nonprofit organization dedicated to providing underprivileged children with the resources they need to succeed at school and beyond. We believe that all children deserve access to high-quality education regardless of their socioeconomic background, race, or ethnicity.\n\nFor the past five years, we have worked closely with local schools and community centers to provide underserved students with free tutoring services, college preparation workshops, and career development seminars. Our programs have helped hundreds of young people improve their academic performance, gain admission to top universities, and secure meaningful employment after graduation.\n\nHowever, we cannot achieve our mission without the support of generous donors like yourself. That is why I am writing to ask for your consideration of a financial contribution to our cause. Your gift will enable us to expand our programming and reach even more deserving youth in our community.\n\nThank you for your time and consideration. If you have any questions or would like to learn more about our work, please feel free to contact me at [Your Phone Number] or [Your Email Address].\n\nBest regards,\n[Your Name]"
generate_instruction(prompt)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


<s>[INST] Dear [Recipient's Name],

I hope this email finds you well. My name is [Your Name] and I am writing to introduce myself and discuss an opportunity for collaboration between our organizations.

I am the founder and CEO of [Your Company/Organization], a nonprofit organization dedicated to providing underprivileged children with the resources they need to succeed at school and beyond. We believe that all children deserve access to high-quality education regardless of their socioeconomic background, race, or ethnicity.

For the past five years, we have worked closely with local schools and community centers to provide underserved students with free tutoring services, college preparation workshops, and career development seminars. Our programs have helped hundreds of young people improve their academic performance, gain admission to top universities, and secure meaningful employment after graduation.

However, we cannot achieve our mission without the support of generous donors li

In [24]:
prompt="""
## Use a phone or tablet\n\n1. Open your phone or tablet\'s app store. This is the App Store on iOS and Play Store on Android.
Look for the app on your home screen or app menu and tap to open it.\n2. Search for Pure Flix. Tap the search bar and type \"pure flix\"
.\n3. Tap the Pure Flix App to open it. Check that the publisher is Pure Flix Digital. If you cannot find Pure Flix, you may be on an older 
software version that is unsupported. Check for a software update in your settings. Pure Flix requires iOS 10.0 or later for iPad, iPhone, 
iPod Touch, and Apple TV. Pure Flix requires Android 4.2 or higher on Android phones and tablets.\n4. Tap +Get or Install to download the app. 
You may need to confirm your credentials. The app will be added to your home screen or app menu when finished downloading.\n5. Open the app. 
Look for a blue icon that says "Pure Flix" on your home screen or app menu. Tap the app to open.\n6. Sign in with your Pure Flix username and password. 
If you do not have an account, sign up for a plan on the Pure Flix website: https://signup.pureflix.com/signup/plans\n\n\n## Use a computer\n\n1. 
Open a browser on your computer. This can be any standard modern browser, such as Chrome, Safari, Edge, Firefox, and IE11.\n2. Navigate to https://www.pureflix.com/. 
Type or copy the address in the address bar, or click the link.\n3. Click Sign In. This is in the top right corner of the screen.\n4. Enter your username and password 
and click Sign In. If you do not have an account, create one by clicking Create an Account.\n5. Use the navigation bar at the top to search or browse for movies and shows.
Click on a show or movie to stream it. Cast your browser tab to a TV if you prefer to use a bigger screen by using Chromecast.\n\n\n## Use a smart tv\n\n1. 
Launch your Smart TV. This may be Apple TV, Amazon Fire, Roku, Samsung Smart TV, or another Smart TV.\n2. Open the app store on your Smart TV. 
This may be called something different for every Smart TV, but look for an app store on your main home screen.\n3. Search for Pure Flix. 
Use the app store\'s search bar and type in \"pure flix\".\n4. Download the Pure Flix app. Look for a button that says Download, 
Install, or Get.\n5. Launch the app and login. Find the app on your home screen with the other apps for your Smart TV.\n\n\n## Use an xbox one\n\n1. 
Launch the XBOX dashboard. Start your XBOX one and sign in if you are not already. The dashboard is the home screen that appears when you are signed in.\n2. 
Open the XBOX Store. Use the left analog stick or directional pad to select the Store tab on the right.\n3. Search for Pure Flix. 
Navigate down to the search bar and enter in \"pure flix\".\n4. Select the Pure Flix app and select Get. This will install the Pure Flix app on your XBOX One.\n5. 
Login to your account once installed. You can find this app on your dashboard in the future.\n"""

result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] 
## Use a phone or tablet

1. Open your phone or tablet's app store. This is the App Store on iOS and Play Store on Android.
Look for the app on your home screen or app menu and tap to open it.
2. Search for Pure Flix. Tap the search bar and type "pure flix"
.
3. Tap the Pure Flix App to open it. Check that the publisher is Pure Flix Digital. If you cannot find Pure Flix, you may be on an older 
software version that is unsupported. Check for a software update in your settings. Pure Flix requires iOS 10.0 or later for iPad, iPhone, 
iPod Touch, and Apple TV. Pure Flix requires Android 4.2 or higher on Android phones and tablets.
4. Tap +Get or Install to download the app. 
You may need to confirm your credentials. The app will be added to your home screen or app menu when finished downloading.
5. Open the app. 
Look for a blue icon that says "Pure Flix" on your home screen or app menu. Tap the app to open.
6. Sign in with your Pure Flix username and password. 
If you do not h