Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Dolly 2.0 – The World’s First, Truly Open Instruction-Tuned LLM on IPUs – Inference

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   NLP   |  Instruction Fine-tuned Text Generation  | Dolly 2.0 | N/A | Inference | recommended: 16 (minimum 4) |  ???   |

[Dolly 2.0](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) is a 12B parameter language model trained and instruction fine-tuned by [Databricks](https://www.databricks.com). By instruction fine-tuning the large language model (LLM), we obtain an LLM better suited for human interactivity. Crucially, Databricks released all code, model weights, and their fine-tuning dataset with an open-source license that permits commercial use. This makes Dolly 2.0 the world's first, truly open-source instruction-tuned LLM. In this notebook, we will show you how to run Dolly 2.0 using Graphcore IPUs on Paperspace with your own prompts.

In this notebook you will:
- Create and configure a Dolly inference pipeline.
- Run Dolly inference on a text prompts to generate answers to user specified questions.

This notebook requires a minimum of 4 IPUs to run. This notebook also supports running on a POD16, resulting in a faster pipeline inference speed.

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/t8Jxz1)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

Run the next cell to install extra requirements for this notebook.

In [None]:
%pip install -r requirements.txt

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
number_of_ipus

## Dolly inference pipeline

Let's begin by loading the inference config for Dolly. A configuration suitable for your instance will automatically be selected.

In [None]:
from utils.setup import dolly_config_setup

config_name = "dolly_pod4" if number_of_ipus == 4 else "dolly_pod16"
config, *_ = dolly_config_setup("config/inference.yml", "release", config_name)
config

Next, we want to create our inference pipeline. Here we define the maximum
sequence length and maximum micro batch size. Before executing a model on IPUs
it needs to be turned into an executable format by compiling it. This will
happen when the pipeline is created. All input shapes must be known before
compiling, so if the maximum sequence length or micro batch size is changed, the
pipeline will need to be recompiled.

Selecting a longer sequence length or larger batch size will use more IPU
memory. This means that increasing one may require you to decrease the other.

*This cell will take approximately 18 minutes to complete, which includes downloading the model weights.*

In [None]:
import api

# changing these parameters will trigger a recompile.
sequence_length = 512  # max 2048
micro_batch_size = 4

dolly_pipeline = api.DollyPipeline(
    config, sequence_length=sequence_length, micro_batch_size=micro_batch_size
)

And run the pipeline!

In [None]:
answer = dolly_pipeline("How many islands are there in Scotland?")

We've just run Dolly with the default parameters. Thanks to instruction fine-tuning, we can give prompts to Dolly as if it were a chatbot, as instruction fine-tuning helps turn a base language model into one more suited for interactive behaviour with humans.

There are a few sampling parameters we can use to control the behaviour of Dolly:
- `temperature` – Indicates whether you want more creative or more factual outputs. A higher value generates more creative outputs and a lower value generates more factual answers. Typical values fall between `0.0` and `1.0`.
- `top_k` – Indicates that only among the highest `top_k` probable tokens should be sampled. Set to 0 to sample across all possible tokens, which means that top k sampling is disabled. The value for `top_k` must be between a minimum of 0 and a maximum of `config.model.embedding.vocab_size` which is 50,280.
- `output_length` – Indicates the number of tokens to sample before stopping. Sampling can stop early if the model outputs `### END`.
- `print_final` – If `True`, prints the total time and throughput.
- `print_live` – If `True`, the tokens will be printed as they are being sampled. If  `False`, only the answer will be returned as a string.
- `prompt` – A string containing the question you wish to generate an answer for.

Changing these parameters will not result in any recompilation as the input shape to the model has not changed. These can be freely changed and experimented with in the next cell to produce the behaviour you want from Dolly.

In [None]:
temperature = 0.6
top_k = 5
output_length = None
print_live = True
print_final = True

prompt = "Who was Dolly the sheep?"
answer = dolly_pipeline(
    prompt,
    temperature=temperature,
    k=top_k,
    output_length=output_length,
    print_live=print_live,
    print_final=print_final,
)

If you set the `micro_batch_size` during pipeline creation to be greater than one, you can also use the pipeline on a batch of prompts – simply pass a list of prompts to the pipeline. The number of prompts must be less than or equal to the `micro_batch_size`

In [None]:
temperature = 0.6
top_k = 5
output_length = None
print_live = False
print_final = True

prompt = [
    "Write a Haiku about Dolly the Sheep.",
    "What makes something a Haiku?",
    "Can sheep write Haikus?",
    "Where is the best place to purchase my own Dolly the Sheep?",
]
answer = dolly_pipeline(
    prompt,
    temperature=temperature,
    k=top_k,
    output_length=output_length,
    print_live=print_live,
    print_final=print_final,
)

for p, a in zip(prompt, answer):
    print(f"Instruction: {p}")
    print(f"Response: {a}")

As Dolly is an instruction fine-tuned model, it was trained with a specific prompt format. Internally, the pipeline will transform your prompts into the correct format. To see the full prompt, you can view the `last_instruction_prompt` attribute on the pipeline:

In [None]:
print(dolly_pipeline.last_instruction_prompt[0])

Remember to detach your pipeline when you are finished to free up resources:

In [None]:
dolly_pipeline.detach()

## Conclusion

In this notebook, we have demonstrated how you can easily run Dolly 2.0 for inference on Graphcore IPUs on text prompts. Instruction fine-tuning is a powerful method of turning a base LLM into one more suited for human interactivity, such as a question-answer model or a chatbot.

Although larger instruction LLMs exist with more world knowledge such as ChatGPT, they are closed-source or are subject to non-commercial licensing. This makes Dolly 2.0 a significant milestone as the first of its kind, with future instruction fine-tuned LLMs no doubt to quickly follow. All of which will be truly open.

## Next steps

Check out the full list of [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how IPUs perform on other tasks.