# Prompt2Model - Generate Deployable Models from Instructions

[Prompt2Model](https://github.com/neulab/prompt2model) is a system that takes a natural language task description (like the prompts used for large language models such as ChatGPT) to train a small special-purpose model that is conducive for deployment.

In this demo, we demonstrate how to use Prompt2Model to create a model that answers questions over documents, but you can adapt it to any task you like by changing the initial prompt and adjusting the following design decisions appropriately. Every place that has a comment saying `CHANGE THIS` is a variable that you can change to adapt the demo to your task.

You can run the demo locally or in Colab. If you are running in Colab on GPUs, you will probably want to use an A100 GPU, which has sufficient memory to train most models that prompt2model will suggest.
<a href="https://colab.research.google.com/github/neulab/prompt2model/blob/main/colab_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you have any questions or feedback, please feel free to contact us!

- **Github:** open an [issue](https://github.com/neulab/prompt2model/issues) or submit a PR
- **Discord:** join us on [discord](https://discord.gg/UCy9csEmFc)
- **Twitter:** reach out to [@vijaytarian](https://twitter.com/vijaytarian) and [@Chenan3_Zhao](https://twitter.com/Chenan3_Zhao)

## Setting Up

First, start out by installing prompt2model from pypi.

In [1]:
%pip install prompt2model

Note: you may need to restart the kernel to use updated packages.


Set your OpenAI API key as an environment variable. A good way to do this is to create a `.env` file with a single line.

```text
OPENAI_API_KEY=<your key here>
```

If you are using Colab, you can create this `.env` file locally, then upload it to Colab by clicking on the file folder on the left side of the screen.

And then run the following command to load environment variables from your `.env` file into the running script.

In [2]:
%pip install python-dotenv
import dotenv
dotenv.load_dotenv()

Note: you may need to restart the kernel to use updated packages.


True

You can check to make sure that the key is actually imported by printing out the first few characters of it.

In [3]:
import os
os.environ['OPENAI_API_KEY'][:3]

'sk-'

## Specify your Prompt

The most important design decision in using prompt2model is what prompt you will use to specify your task. In order to do so it is best to:

1. Explain your task
2. Provide a few examples

In this demo, we will use the following prompt to specify a **question answering system**. If you want to try prompt2model on a new task, you can write a similar prompt by swapping in a new description and new examples. Note that this format is a bit flexible, so you don't have to follow this *exact* format, but it is a good starting point. You can also see our suggestions on [writing good prompts](https://github.com/neulab/prompt2model/blob/main/prompt_examples.md).

In [4]:
# CHANGE THIS if you want to use a different prompt or tackle a different task
prompt = """
Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"
"""

## Parse the Prompt

Next, Prompt2Model parses out the instructions an examples from the prompt.
We use the `OpenAIInstructionParser` to do so.

In [5]:
from prompt2model.prompt_parser import OpenAIInstructionParser, TaskType

prompt_spec = OpenAIInstructionParser(task_type=TaskType.TEXT_GENERATION)
prompt_spec.parse_from_prompt(prompt)
print(f"Instruction:\n{prompt_spec.instruction}\n")
print(f"Examples:\n{prompt_spec.examples}")

  from .autonotebook import tqdm as notebook_tqdm


Instruction:
Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Examples:
input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, a

## Retrieve Model

First, we retrieve a base model that we will train. We can use the `DescriptionModelRetriever` to do so.

`top_model_names` is a list of pretrained Hugging Face models. You can choose the first one by default.

In [6]:
from prompt2model.model_retriever import DescriptionModelRetriever

retriever = DescriptionModelRetriever(
    model_descriptions_index_path="huggingface_data/huggingface_models/model_info/",
    use_bm25=True,
    use_HyDE=True,
)
top_model_names = retriever.retrieve(prompt_spec)
pre_train_model_name = top_model_names[0]
print(pre_train_model_name)

100%|██████████| 11929/11929 [00:01<00:00, 6064.56it/s]


ValueError: API key must be provided or set the environment variable with `export OPENAI_API_KEY=<your key>`.

## Retrieve and Process Dataset

Next, `Prompt2Model` searches for datasets on Hugging Face to try to find training datasets that may be useful for your task. Specifically, we use `DescriptionDatasetRetriever`, which looks up datasets that match the description.

First we initialize the retriever. This creates the search index so it may take several minutes the first time you run it.

In [None]:
from prompt2model.dataset_retriever import DescriptionDatasetRetriever

retriever = DescriptionDatasetRetriever()

Next we retriever a list of top datasets for the current prompt (and display their basic data).

In [None]:

sorted_dataset_list = retriever.retrieve_top_datasets(prompt_spec)

print("#\tName\tDescription")
for i, d in enumerate(sorted_dataset_list):
    description_no_spaces = d.description.replace("\n", " ")
    print(f"{i+1}):\t{d.name}\t{description_no_spaces}")

retrieved_dataset_dict = None

If none of the datasets in the list look useful for your task, you can skip the rest of the section and we won't use any retrieved data.

However, if one of the datasets looks useful, set the `retrieved_dataset_name` variable, and continue through the rest of the section. For the question answering example, we will pick the `squad` dataset to train our model, but of course you will want to change this to the dataset that you selected if you're doing a different task.


In [None]:
# CHANGE THIS if you want to use a different retrieved dataset
retrieved_dataset_name = "squad"

Existing datasets on Hugging Face have many different formats, but Prompt2Model expects that a dataset should have one input and one output, both of which are strings. In order to solve this, we do a **canonicalization** step, where we convert the dataset into a format that is compatible with `Prompt2Model`.

In order to do so, we examine the dataset and find the different configurations that exist.

In [None]:
import datasets

configs = datasets.get_dataset_config_names(retrieved_dataset_name)
print(f"Available dataset configs {configs}")

Then we choose one.

In [None]:
# CHANGE THIS if you want to use a different dataset configuration
chosen_config = "plain_text"

Next, we read in the dataset and print out an example. You can use this to check which columns you'd like to use for the input and output.

In [None]:
import json

dataset = datasets.load_dataset(retrieved_dataset_name, chosen_config)
if "train" not in dataset:
    raise ValueError(
        f"Dataset {retrieved_dataset_name} does not have a train split."
    )
train_columns = dataset["train"].column_names
train_columns_formatted = ", ".join(train_columns)

if len(dataset["train"]) == 0:
    raise ValueError(
        f"Dataset {retrieved_dataset_name} has no rows in the train split."
    )
example_rows = json.dumps(dataset["train"][0], indent=4)

print(f"Loaded dataset. Example row:\n{example_rows}\n")

print(f"It has these columns: {train_columns_formatted}.")

Now we can set the following variables to the ones that we'd like to use.

In [None]:
# CHANGE THIS if you want to use a different dataset configuration
input_columns = ["question", "context"]
output_column = "answers"

Finally, we canonicalize the dataset, and we have properly prepared our dataset for training. We also save it to disk.

In [None]:
retrieved_dataset_dict = retriever.canonicalize_dataset_using_columns(
    dataset, input_columns, output_column
)
retrieved_dataset_dict.save_to_disk("retrieved_dataset_dict")

## Generate Dataset

Next, we generate some examples for training the model. We can use `OpenAIDatasetGenerator` to generate these examples.

Note that there are a number of hyperparameters here. These are in general good defaults, but you might want to play with them. In particular, this generates 5,000 examples, which may be expensive (roughly $5), so you could choose to generate fewer.

In [None]:
from prompt2model.dataset_generator import OpenAIDatasetGenerator, DatasetSplit

dataset_generator = OpenAIDatasetGenerator(
    initial_temperature=0.3,
    max_temperature=1.4,
    responses_per_request=3,
    max_api_calls=10000,
    requests_per_minute=80,
)
generated_dataset = dataset_generator.generate_dataset_split(
    prompt_spec, 5000, split=DatasetSplit.TRAIN
)
generated_dataset.save_to_disk("generated_dataset")

## Finetune the Model

Next, we fine-tune the model. To do so we first combine the retrieved dataset with generated dataset and grab our train/validation, and testing splits.

In [None]:
from prompt2model.dataset_processor import TextualizeProcessor

text_processor = TextualizeProcessor(has_encoder=True)
text_modified_dataset_dicts = text_processor.process_dataset_lists(
    prompt_spec.instruction,
    [generated_dataset, retrieved_dataset_dict["train"]],
    train_proportion=0.6,
    val_proportion=0.2,
    maximum_example_num=3000
)
train_datasets = [each["train"] for each in text_modified_dataset_dicts]
val_datasets = [each["val"] for each in text_modified_dataset_dicts]
test_datasets = [each["test"] for each in text_modified_dataset_dicts]


Combine the retrieved dataset with generated dataset and use the `GenerationModelTrainer` to finetune the retrieved model. After the finetuning, we save the model and tokenizer.

In [None]:
from prompt2model.model_trainer import GenerationModelTrainer
from pathlib import Path

trainer = GenerationModelTrainer(
    pre_train_model_name,
    has_encoder=True,
    executor_batch_size=8,
    tokenizer_max_length=1024,
    sequence_max_length=1280,
)

args_output_root = Path("result/training_output")
args_output_root.mkdir(parents=True, exist_ok=True)

trained_model, trained_tokenizer = trainer.train_model(
    hyperparameter_choices={
        "output_dir": str(args_output_root),
        "save_strategy": "epoch",
        "num_train_epochs": 1,
        "per_device_train_batch_size": 8,
        "evaluation_strategy": "epoch",
    },
    training_datasets=train_datasets,
    validation_datasets=val_datasets,
)

trained_model.save_pretrained("trained_model")
trained_tokenizer.save_pretrained("trained_tokenizer")

## Try it out!

Now, you can add input and use your fine-tuned model to do inference.

In [None]:
from prompt2model.model_executor import GenerationModelExecutor

model_executor = GenerationModelExecutor(trained_model, trained_tokenizer)
# CHANGE THIS to your own input
input = "Question: How many departments are within the Stinson-Remick Hall of Engineering? Context: The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively."
response = model_executor.make_single_prediction(
    text_processor.wrap_single_input(prompt_spec.instruction, input)
)
print(response.prediction)

## Evaluate Model

After the training, we can evaluate the trained model on the conbined test set with `ModelEvaluator`.
This will output a number of metrics indicating how good the answers are.

In [None]:
from prompt2model.model_executor import GenerationModelExecutor
from prompt2model.model_evaluator import Seq2SeqEvaluator

import transformers
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

trained_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
    "trained_model"
).to(device)
trained_tokenizer = transformers.AutoTokenizer.from_pretrained(
    "trained_tokenizer"
)

test_dataset = datasets.concatenate_datasets(test_datasets)
model_executor = GenerationModelExecutor(trained_model, trained_tokenizer, 1)
t5_outputs = model_executor.make_prediction(test_dataset, "model_input")
evaluator = Seq2SeqEvaluator()
metric_values = evaluator.evaluate_model(test_dataset, "model_output", t5_outputs)
print(metric_values)

## Final Words

We hope that you found this demo useful!
If you have any questions or feedback, please get in contact. And we would love to have community contributions!