# Install dependencies


In [None]:
!pip install git+https://github.com/neuml/txtai git+https://github.com/neuml/txtinstruct
!pip install transformers

Collecting git+https://github.com/neuml/txtai
  Cloning https://github.com/neuml/txtai to /tmp/pip-req-build-mgk6g_du
  Running command git clone --filter=blob:none --quiet https://github.com/neuml/txtai /tmp/pip-req-build-mgk6g_du
  Resolved https://github.com/neuml/txtai to commit 29f49be53e3e5b79627c05e5f9133fa52b5afeb7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/neuml/txtinstruct
  Cloning https://github.com/neuml/txtinstruct to /tmp/pip-req-build-kp5rjtof
  Running command git clone --filter=blob:none --quiet https://github.com/neuml/txtinstruct /tmp/pip-req-build-kp5rjtof
  Resolved https://github.com/neuml/txtinstruct to commit dec901e35137addc5b874fd4eb14993383b40e78
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... 

# Architecture Overview

txtinstruct consists of three components to help train instruction-following models.

The first component is statement generation. Statement generation models create a statement from a context. This statement can be a question or request to describe a concept depending on the model.

The next component is a knowledge source for pulling context. An example knowledge source used in this notebook is a [txtai embeddings index of the full Wikipedia dataset](https://huggingface.co/NeuML/txtai-wikipedia).

The last component is a large language model (LLM) for translating source statements into target statements. If the statement is a question, the LLM answers it. If it's a descriptive statement, the LLM builds a description. In both cases, a prompt is used in combination with the knowledge source context to generate the target text.

# Statement Generation Model

The code below builds a question generation model using the [SQuAD dataset](https://huggingface.co/datasets/squad).

In [None]:
from datasets import load_dataset

from txtinstruct.models import StatementGenerator

# Load SQuAD dataset
dataset = load_dataset("squad", split="train")

# Train model
generator = StatementGenerator()
model, tokenizer = generator(
    "google/flan-t5-small",
    dataset,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=128 // 16,
    num_train_epochs=0.1,
    logging_steps=100,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]



model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


Note that we only trained the model for a fraction of an epoch for expediency. Under normal circumstances, `num_train_epochs` would be at least 3.

If you've trained models either with txtai or Hugging Face's trainer, you'll recognize many of the options. [See this page](https://neuml.github.io/txtai/pipeline/train/trainer/) to learn more on the configuration options available.

In [None]:
from txtai.pipeline import Sequences

# Load statement generation model
statements = Sequences((model, tokenizer))

# Run example prompt
statements("""Generate a question using the context below.
### Context:
txtai is an open-source platform for semantic search and workflows powered by language models.""")

'What is the open source platform for?'

Given the context, the question above is generated. Let's see how this helps build an instruction-tuning dataset.

# Build a dataset for Instruction-Tuning

Now that we have a statement generation model, let's build an instruction-tuning dataset.

We'll use the `txtai wikipedia embeddings index` as the knowledge source and `google/flan-t5-base` as our teacher model.

In [None]:
from txtai.embeddings import Embeddings
from txtinstruct.data import DatasetBuilder

# Load embeddings
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Query templates
templates = [
    "Tell me about {text}",
    "Give an explanation on {text}",
    "Provide a quick summary on {text}",
    "Explain {text} in simple terms",
    "Describe {text}"
]

# Build dataset
builder = DatasetBuilder(Sequences("google/flan-t5-base"), statements, templates)
builder(
    embeddings.search("SELECT id, text FROM txtai WHERE similar('large language models') AND percentile >= 0.99 LIMIT 5"),
    5,
    "data.json"
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/534 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

documents:   0%|          | 0.00/3.24G [00:00<?, ?B/s]

embeddings:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

100%|██████████| 5/5 [00:00<00:00, 4267.71it/s]


In [None]:
import pandas as pd
df = pd.read_json('/content/data.json')
df

Unnamed: 0,context,statements
0,A large language model (LLM) is a large-scale ...,[{'source': 'What is the name of the large lan...
1,PaLM (Pathways Language Model) is a 540 billio...,[{'source': 'What is the name of the model tha...
2,LLaMA (Large Language Model Meta AI) is a fami...,[{'source': 'What is the name of the large lan...
3,A language model is a probabilistic model of a...,[{'source': 'What was the first significant st...
4,LangChain is a framework designed to simplify ...,[{'source': 'What is the name of the framework...


In [None]:
from huggingface_hub import login
from datasets import Dataset
login()


'''from google.colab import userdata

# Defined in the secrets tab in Google Colab
hf_token = userdata.get('huggingface')'''

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset
'''import json
f = open('/content/data.json')
data = json.load(f)
print(data)
f.close()
'''
dataset = load_dataset('json', data_files='/content/data.json')
dataset.push_to_hub("sachintripathi04/gen_data")

Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/sachintripathi04/gen_data/commit/33d482c7790f859d1ea68c7047df46ff19d121c5', commit_message='Upload dataset', commit_description='', oid='33d482c7790f859d1ea68c7047df46ff19d121c5', pr_url=None, pr_revision=None, pr_num=None)