In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Instruction Backtranslation Method
 
**Objective:** Enhance the capability of a language model in following instructions accurately.

**Methodology:**

1. **Initial Setup:** Begin with a language model finetuned on a small dataset (seed data).
2. **Self-Augmentation:** Utilize this seed model to generate instructional prompts based on a large web corpus, effectively creating new training data.
3. **Self-Curation:** Select the highest quality training examples from these generated prompts for further use.
4. **Model Finetuning:** Use these curated high-quality examples to further train the language model, enhancing its ability to follow complex instructions.
5. **Iterative Improvement:** Repeat the augmentation and curation steps to progressively improve the model's performance.

**Outcome:** This iterative training approach aims to develop a language model that excels at understanding and executing instructions, outperforming existing models on benchmark tests without relying on advanced training data.

Implementing the SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION [paper](https://arxiv.org/pdf/2308.06259)

## Downloading the DataSet

[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [1]:
import pandas as pd

splits = {'train': 'openassistant_best_replies_train.jsonl', 'test': 'openassistant_best_replies_eval.jsonl'}
df = pd.read_json("hf://datasets/timdettmers/openassistant-guanaco/" + splits["train"], lines=True)

In [2]:
# Here we see that all data is present in the same column, we need to split this into two columns - User Prompt and AI Response
df.head()

Unnamed: 0,text
0,### Human: Can you write a short introduction ...
1,### Human: ¿CUales son las etapas del desarrol...
2,### Human: Can you explain contrastive learnin...
3,### Human: I want to start doing astrophotogra...
4,### Human: Método del Perceptrón biclásico: de...


In [15]:
df.loc[0]['text']

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

Here we notice that the above text has three splits: Starts with a human prompt, then the assistant responds, then we also see the Human's second question

In [5]:
# Creating a new list that has the content split at the `### Assistant:` token
content = df['text'].str.split('### Assistant:', expand=True)

In [12]:
content

Unnamed: 0,0,1,2,3
0,### Human: Can you write a short introduction ...,"""Monopsony"" refers to a market structure wher...",,
1,### Human: ¿CUales son las etapas del desarrol...,Jean Piaget fue un psicólogo suizo que propus...,,
2,### Human: Can you explain contrastive learnin...,Sure! Let's say you want to build a model whi...,It has to do with a process called differenti...,
3,### Human: I want to start doing astrophotogra...,Getting started in astrophotography can seem ...,Astrophotography can be a fun and rewarding h...,
4,### Human: Método del Perceptrón biclásico: de...,El método del Perceptrón biclásico es un algo...,,
...,...,...,...,...
9841,### Human: What are the risks when you do not ...,Not executing an ankle-brachial index (ABI) e...,,
9842,### Human: 为什么不能关灯玩手机？,出於各種原因，不建議在黑暗中玩手機。 首先，開車時使用手機會分散注意力，讓人難以專注於手頭...,,
9843,### Human: ¿De qué está hecho el plástico?,El plástico está hecho de una variedad de mat...,La dureza de los plásticos proviene de la com...,
9844,### Human: Nortzuk ziren Harry Potteren gurasoak?,Harry Potterren gurasoak James eta Lily Potte...,,


Here we notice that some rows even have three values for the second column, let's observe the raw data for that text.

In [16]:
df.loc[2]['text']

'### Human: Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?### Assistant: Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\'t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model

Here we notice that this is a multi step conversation. We need to get rid of such rows, to ensure we just have rows that contain one user prompt and one AI response.

## Preparing the dataset

In [23]:
list = df['text'].tolist()

In [24]:
list[0]

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

In [25]:
def count_conversation_tokens(text):
    human_count = text.count("### Human:")
    assistant_count = text.count("### Assistant:")
    return human_count, assistant_count

In [26]:
# Creating new lists to store data of single human - assistant conversations
human_prompt_list = []
assistant_response_list = []

# Counting the occurrance of the tokens in a given string
for conversation in list:
    human_count, assistant_count = count_conversation_tokens(conversation)
    if human_count == 1 and assistant_count == 1:
        # Split the string by the token `### Assistant:`
        parts = conversation.split("### Assistant:")
        
        # we can directly add the assistant part to the list
        assistant_response_list.append(parts[1])
        
        # Split the human prompt at the token `### Human:`
        parts = parts[0].split("### Human:")
        
        # Append the second part of the split to `human_prompt_list`
        human_prompt_list.append(parts[1])
        
# Checking if both the lists have the same size
print(len(human_prompt_list), len(assistant_response_list))

4746 4746


In [27]:
# Creating a new dataframe fo the task

preparedData = pd.DataFrame(human_prompt_list, columns=['prompt'])
preparedData['response'] = assistant_response_list

preparedData.head()

Unnamed: 0,prompt,response
0,Método del Perceptrón biclásico: definición y...,El método del Perceptrón biclásico es un algo...
1,"Listened to Dvorak's ""The New World"" symphony...","If you enjoyed Dvorak's ""New World"" Symphony,..."
2,I am using docker compose and i need to mount...,You can mount the Docker socket in a Docker C...
3,eu quero que você atue como um terminal linux...,$ pwd\n/home/user
4,Write a 4chan style greentext about someone w...,>be me\n>sister wants to watch the new hit ro...
