# Instruction Backtranslation

[YouTube Explanation](https://www.youtube.com/watch?v=_284G3zvmr4)
 
**Objective:** Enhance the capability of a language model in following instructions accurately.

**Methodology:**

1. **Initial Setup:** Begin with a language model finetuned on a small dataset (seed data).
2. **Self-Augmentation:** Utilize this seed model to generate instructional prompts based on a large web corpus, effectively creating new training data.
3. **Self-Curation:** Select the highest quality training examples from these generated prompts for further use.
4. **Model Finetuning:** Use these curated high-quality examples to further train the language model, enhancing its ability to follow complex instructions.
5. **Iterative Improvement:** Repeat the augmentation and curation steps to progressively improve the model's performance.

**Outcome:** This iterative training approach aims to develop a language model that excels at understanding and executing instructions, outperforming existing models on benchmark tests without relying on advanced training data.

Implementing the SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION [paper](https://arxiv.org/pdf/2308.06259)

# Data Preparation

This notebook covers the comprehensive steps for loading, cleaning, and preprocessing the data. The goal is to prepare the web corpus and seed data in an optimal state for effective training and further processing. Key steps include handling missing values, normalizing text, and ensuring data consistency.

## Steps
1. Load the web corpus and seed data.
2. Clean the data by removing noise and irrelevant information.
3. Preprocess the text to normalize formats and tokenize.
4. Save the cleaned and preprocessed data for use in subsequent notebooks.


## Downloading the DataSet

[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [2]:
import pandas as pd

splits = {'train': 'openassistant_best_replies_train.jsonl', 'test': 'openassistant_best_replies_eval.jsonl'}
df = pd.read_json("hf://datasets/timdettmers/openassistant-guanaco/" + splits["train"], lines=True)

list = df['text'].tolist()

### Preparing the DataSet for model finetuning

In [3]:
def count_conversation_tokens(text):
    human_count = text.count("### Human:")
    assistant_count = text.count("### Assistant:")
    return human_count, assistant_count

In [4]:
# Creating new lists to store data of single human - assistant conversations
human_prompt_list = []
assistant_response_list = []

# Counting the occurrance of the tokens in a given string
for conversation in list:
    human_count, assistant_count = count_conversation_tokens(conversation)
    
    # Removing multi turn conversations
    # Proceed only if there is one human and one assistant token in this string
    if human_count == 1 and assistant_count == 1:
        
        # Split the string by the token `### Assistant:`
        parts = conversation.split("### Assistant:")
        
        # we can directly add the assistant part to the list
        assistant_response_list.append(parts[1])
        
        # Split the human prompt at the token `### Human:`
        parts = parts[0].split("### Human:")
        
        # Append the second part of the split to `human_prompt_list`
        human_prompt_list.append(parts[1])
        

preparedData = pd.DataFrame(human_prompt_list, columns=['prompt'])
preparedData['response'] = assistant_response_list

preparedData.head()

Unnamed: 0,prompt,response
0,Método del Perceptrón biclásico: definición y...,El método del Perceptrón biclásico es un algo...
1,"Listened to Dvorak's ""The New World"" symphony...","If you enjoyed Dvorak's ""New World"" Symphony,..."
2,I am using docker compose and i need to mount...,You can mount the Docker socket in a Docker C...
3,eu quero que você atue como um terminal linux...,$ pwd\n/home/user
4,Write a 4chan style greentext about someone w...,>be me\n>sister wants to watch the new hit ro...


In [5]:
len(preparedData)

4746