# Prompt2Model NoteBook Demo

In our `cli_demo.py`, we hard-coded a lot of parameters which is actually configurable. In this jupternote book demo, we use the machine reading quesiton-answering problem, i.e. [squad](https://huggingface.co/datasets/squad), to give a quick guidance of how to configure these parameters for your own setting.

<a href="https://colab.research.google.com/github/neulab/prompt2model/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install prompt2model

In [None]:
!export OPENAI_API_KEY=your openai api key

## Parse Prompt

Use the `OpenAIInstructionParser` to parse the input prompt.
from prompt2model.prompt_parser import OpenAIInstructionParser, TaskType

In [None]:
from prompt2model.prompt_parser import OpenAIInstructionParser, TaskType

prompt = """
Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"
"""

prompt_spec = OpenAIInstructionParser(task_type=TaskType.TEXT_GENERATION)
prompt_spec.parse_from_prompt(prompt)
print(f"Instruction: {prompt_spec.instruction}")
print(f"exmaples: {prompt_spec.examples}")

## Retrieve Dataset

Use the `DescriptionDatasetRetriever` to retrieve a dataset.

Note that retriving a dataset is an interactive process. Watch the logging of the code block and input your response to in the input block.

In [None]:
from prompt2model.dataset_retriever import DescriptionDatasetRetriever

retriever = DescriptionDatasetRetriever()
retrieved_dataset_dict = retriever.retrieve_dataset_dict(prompt_spec)
retrieved_dataset_dict.save_to_disk("retrieved_dataset")

## Retrieve Model

Use the `DescriptionModelRetriever` to retrieve a pretrain model.

The `top_model_names` is a list of pretrain model names of HuggingFace model. In our demo, we choose the first one as default.

In [None]:
from prompt2model.model_retriever import DescriptionModelRetriever

retriever = DescriptionModelRetriever(
    model_descriptions_index_path="huggingface_data/huggingface_models/model_info/",  # noqa E501
    use_bm25=True,
    use_HyDE=True,
)
top_model_names = retriever.retrieve(prompt_spec)
pre_train_model_name = top_model_names[0]
print(pre_train_model_name)

## Generate Dataset

Use `OpenAIDatasetGenerator` to generte new examples for the machine reading quesiton-answering task.

In [None]:
from prompt2model.dataset_generator import OpenAIDatasetGenerator, DatasetSplit

unlimited_dataset_generator = OpenAIDatasetGenerator(
    initial_temperature=0.3,
    max_temperature=1.4,
    responses_per_request=3,
    max_api_calls=10000,
    requests_per_minute=80,
)
generated_dataset = unlimited_dataset_generator.generate_dataset_split(
    prompt_spec, 5000, split=DatasetSplit.TRAIN
)
generated_dataset.save_to_disk("generated_dataset")

## Preprocess Dataset

Combine the `generated_dataset` with `retrieved_dataset_dict` and use `TextualizeProcessor` to preprocess the training dataset.

In [None]:
import datasets
from prompt2model.dataset_processor import TextualizeProcessor

train_generated_dataset = datasets.Dataset.from_dict(generated_dataset[:3000])
val_generated_dataset = datasets.Dataset.from_dict(generated_dataset[3000: 4000])
test_generated_dataset = datasets.Dataset.from_dict(generated_dataset[4000:])

generated_dataset = datasets.DatasetDict(
    {"train": train_generated_dataset, "val": val_generated_dataset, "test": test_generated_dataset}
)

retrieved_dataset = datasets.DatasetDict(
    {
        "train": datasets.Dataset.from_dict(retrieved_dataset_dict["train"][:3000]),
        "val": datasets.Dataset.from_dict(retrieved_dataset_dict["train"][3000:4000]),
        "test": datasets.Dataset.from_dict(retrieved_dataset_dict["train"][4000:5000]),
    }
)

DATASET_DICTS = [generated_dataset, retrieved_dataset]

t5_processor = TextualizeProcessor(has_encoder=True)
t5_modified_dataset_dicts = t5_processor.process_dataset_dict(
    prompt_spec.instruction, DATASET_DICTS
)

## Finetune the Model

Combine the retrieved dataset with generated dataset and use the `GenerationModelTrainer` to finetune the retrieved model. After the finetuning, we save the model and tokenizer to the disk.

In [None]:
from prompt2model.model_trainer import GenerationModelTrainer
from prompt2model.utils.logging_utils import get_formatted_logger
from pathlib import Path
import logging

trainer_logger = get_formatted_logger("ModelTrainer")
trainer_logger.setLevel(logging.INFO)
train_datasets = [each["train"] for each in t5_modified_dataset_dicts]
val_datasets = [each["val"] for each in t5_modified_dataset_dicts]
test_datasets = [each["test"] for each in t5_modified_dataset_dicts]

trainer = GenerationModelTrainer(
    pre_train_model_name,
    has_encoder=True,
    executor_batch_size=1,
    tokenizer_max_length=1024,
    sequence_max_length=1280,
)

args_output_root = Path("result/training_output")
args_output_root.mkdir(parents=True, exist_ok=True)

trained_model, trained_tokenizer = trainer.train_model(
    hyperparameter_choices={
        "output_dir": str(args_output_root),
        "save_strategy": "epoch",
        "num_train_epochs": 1,
        "per_device_train_batch_size": 1,
        "evaluation_strategy": "epoch",
    },
    training_datasets=train_datasets,
    validation_datasets=val_datasets,
)

trained_model.save_pretrained("trained_model")
trained_tokenizer.save_pretrained("trained_tokenizer")

## Evaluate Model

After the trainning, we evalaute the trained model on the conbined test set with `ModelEvaluator`.

In [None]:
from prompt2model.model_executor import GenerationModelExecutor
from prompt2model.model_evaluator import Seq2SeqEvaluator
import transformers
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
evaluator_logger = get_formatted_logger("ModelEvaluator")
evaluator_logger.setLevel(logging.INFO)

trained_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
    "trained_model"
).to(device)
trained_tokenizer = transformers.AutoTokenizer.from_pretrained(
    "trained_tokenizer"
)

test_dataset = datasets.concatenate_datasets(test_datasets)

model_executor = GenerationModelExecutor(
    trained_model,
    trained_tokenizer,
    1,
    tokenizer_max_length=1024,
    sequence_max_length=1280,
)
t5_outputs = model_executor.make_prediction(
    test_set=test_dataset, input_column="model_input"
)
evaluator = Seq2SeqEvaluator()
metric_values = evaluator.evaluate_model(
    test_dataset,
    "model_output",
    t5_outputs,
    encoder_model_name="xlm-roberta-base",
)
print(metric_values)