# Genstruct - Generating instructions from text with Open-Source

Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus.
This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.

You need:
- Some pieces of text content (of 1024 tokens max) and
- a title for each piece.

We will pass the pair title-content to `Genstruct-7B` to
- generate LLM completion
- isolate the generated instruction by using the special tokens


It is available from huggingface [here](https://huggingface.co/NousResearch/Genstruct-7B) and inspired from the Ada-Instruct paper.

**GPU**
- You can run this notebook on a free **T4**

In [1]:
# install bitsandytes for quantization
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0


**Before you continue!** You'll need to restart the session. For this: click Runtime > Restart Session

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

In [2]:
MODEL_NAME = 'NousResearch/Genstruct-7B'

# You could load it unquantized if you have enough vRAM or more quantized (4bit) if you want more RAM.
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# load the model, send it to GPU and specify the quantization
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', quantization_config=quantization_config)

# the tokenizer is very important, and here we take the one corresponding to our model.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

In [3]:
def make_message(title: str, content: str) -> list[dict]:
    message = [{'title': title, 'content': content}]
    return message


def generate(model, tokenizer, message):
    inputs = tokenizer.apply_chat_template(message, return_tensors='pt').cuda()
    outputs = model.generate(inputs, max_new_tokens=512, temperature=0.9)
    decoded_output = tokenizer.decode(outputs[0]).split(tokenizer.eos_token)[0]
    return decoded_output

This strategy requires two parameters:
- `title` a title for the document raw text (if the content is an article, the article title is fine, if it is a chapter within a larger document, it can also be the title of one of the chapter or section).
- `content` this the the raw text

In [4]:
title = "Apollo 1 Tragedy"
content = """One of the worst tragedies in the history of spaceflight occurred on January 27, 1967 when the crew of Gus Grissom, Ed White, and Roger Chaffee were killed in a fire in the Apollo Command Module during a preflight test at Cape Canaveral. They were training for the first crewed Apollo flight, an Earth orbiting mission scheduled to be launched on 21 February. They were taking part in a "plugs-out" test, in which the Command Module was mounted on the Saturn 1B on the launch pad just as it would be for the actual launch, but the Saturn 1B was not fueled. The plan was to go through an entire countdown sequence.

At 1 p.m. on Friday, 27 January 1967 the astronauts entered the capsule on Pad 34 to begin the test. A number of minor problems cropped up which delayed the test considerably and finally a failure in communications forced a hold in the count at 5:40 p.m. At 6:30 p.m., Grissom said "How are we going to get to the Moon if we can't talk between three buildings?". At 6:31 p.m. a surge was recorded in the AC bus 2 voltage readings, possibly indicating a short-circuit. The cockpit recording is difficult to interpret in places but a few seconds later one of the astronauts (probably Chaffee) is heard to say what sounds like "Flames!". Two seconds after that White was heard to say, "We've got a fire in the cockpit." The fire spread throughout the cabin in a matter of seconds. Chaffee said, "We have a bad fire!", followed by shouting. The last crew communication ended 17 seconds after the first indication of the start of the fire, followed by loss of all telemetry. The Apollo hatch could only open inward and was held closed by a number of latches which had to be operated by ratchets. It was also held closed by the interior pressure, which was higher than outside atmospheric pressure and required venting of the command module before the hatch could be opened. It took at least 90 seconds to get the hatch open under ideal conditions. Because the cabin had been filled with a pure oxygen atmosphere at normal pressure for the test and there had been many hours for the oxygen to permeate all the material in the cabin, the fire spread rapidly and the astronauts had no chance to get the hatch open. Nearby technicians tried to get to the hatch but were repeatedly driven back by the heat and smoke. By the time they succeeded in getting the hatch open roughly 5 minutes after the fire started the astronauts had already perished, probably within the first 30 seconds, due to smoke inhalation and burns.

The Apollo program was put on hold while an exhaustive investigation was made of the accident. It was concluded that the most likely cause was a spark from a short circuit in a bundle of wires that ran to the left and just in front of Grissom's seat. The large amount of flammable material in the cabin in the oxygen environment allowed the fire to start and spread quickly. A number of changes were instigated in the program over the next year and a half, including designing a new hatch which opened outward and could be operated quickly, removing much of the flammable material and replacing it with self-extinguishing components, using a nitrogen-oxygen mixture at launch, and recording all changes and overseeing all modifications to the spacecraft design more rigorously.

The mission, originally designated Apollo 204 but commonly referred to as Apollo 1, was officially assigned the name "Apollo 1" in honor of Grissom, White, and Chaffee. The first Saturn V launch (uncrewed) in November 1967 was designated Apollo 4 (no missions were ever designated Apollo 2 or 3). The Apollo 1 Command Module capsule 012 was impounded and studied after the accident and was then locked away in a storage facility at NASA Langley Research Center. The changes made to the Apollo Command Module as a result of the tragedy resulted in a highly reliable craft which, with the exception of Apollo 13, helped make the complex and dangerous trip to the Moon almost commonplace. The eventual success of the Apollo program is a tribute to Gus Grissom, Ed White, and Roger Chaffee, three fine astronauts whose tragic loss was not in vain."""

In [5]:
message = make_message(title=title, content=content)

In [6]:
len(tokenizer.apply_chat_template(message, return_tensors='pt')[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 1024). Running this sequence through the model will result in indexing errors


1076

You'll have to be carefull, the maximum model input length for `Genstruct7B` is 1024 tokens so here a part of our input will be truncated. Keep that in mind when you pass some text.

In [7]:
# call the LLM
completion = generate(model, tokenizer, message)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [8]:
# we use the special token to split the completion and keep what's between [[[User]]] query and the [[[Assistant]]] answer.
generated_instruction = completion.split('[[[User]]]')[1].split('[[[Assistant]]]')[0]
generated_instruction

'  NASA was planning to send humans to the Moon. Their first attempt was the Apollo 1 mission. Unfortunately, there was a fire in the capsule during a preflight test and the 3 astronauts inside died. The mission was a tragedy. The second mission was Apollo 7, which was a success.\nWhich mission had a Saturn V rocket?\n'

In [9]:
import torch
torch.cuda.empty_cache()