# Flan-T5 is (probably) all you need

We believe that GPT-3 scale models are not necessary to perform many NLP workloads in production today. Many of these workloads simply do not need the impressive generative capability that these models offer. Here are a few of the issues with using (sometimes) excessive models:
- Costs your business more, because these models must be executed on additional GPUs which is reflected in the bill from your cloud compute provider
- The machines required to run these models A100 (H100) are in short supply and high demand, increasing the cost further
- Has a greater environmental impact due to the high power requirements to both operate and cool these additional GPUs

In December 2022 Google published [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416) in which they perform extensive finetuning on a broad collection of tasks on a variety of models (PaLM, T5, U-PaLM). Part of this publication was the release of Flan-T5 checkpoints, "which achieve strong few-shot performance even compared to much larger models".

In this notebook we will demonstrate how you can use Flan-T5 on readily available IPU-POD4s (Flan-T5-L) and IPU-POD16s (Flan-T5-XL) for common NLP workloads. We shall do this by considering the following questions:
- How good is Flan-T5, really?
- How do I run Flan-T5 on IPUs?
- What can I use Flan-T5 for?
- Why would I pay extra for Flan-T5-XL?

## How good is Flan-T5, really?

Let's start by looking at some numbers from the paper:

<img src="images/t5_vs_flan_t5.png" style="width: 640px;"/>

> Part of table 5 from [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)

These results are astounding, notice that:
- Flan-T5 performs ~2x better than T5 in MMLU, BBH & MGSM
- In TyDiQA we even see the emergence of new abilities
- Flan-T5-Large is better than all previous variants of T5 (even XXL)

This establishes Flan-T5 as an entirely different beast to the T5 you may know. Now let's see how Flan-T5 compares to other models in the MMLU benchmark:

| Rank | Model | Average (%) | Parameters (Billions) |
|------|-------|-------------|-----------------------|
| 22 | GPT-3 (finetuned) | 53.9 | 175 |
| 23 | GAL 120B (zero-shot) | 52.6 | 120 |
| 24 | Flan-T5-XL | 52.4 | 3 |
| 25 | Flan-PaLM 8B | 49.3 | 8 |
| 30 | Flan-T5-XL (CoT) | 45.5 | 3 |
| 31 | Flan-T5-L | 45.1 | 0.8 |
| 33 | GPT-3 (few-shot, k=5) | 43.9 | 175 |
| 35 | Flan-PaLM 8B (CoT) | 41.3 | 8 |
| 36 | Flan-T5-L (CoT) | 40.5 | 0.8 |
| 38 | LLaMA 7B (few-shot, k=5) | 35.1 | 7 |

> Part of the MMLU leaderboard from [Papers With Code](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu) (CoT = Chain of Thought)

This table shows that:
- Flan-T5-L/XL (0.8B and 3B respectively) is punching orders of magnitute above its weight class appearing alongside titans like GPT-3 (175B), Galactica (120B)
- GPT-3 must be finetuned for MMLU to rank at the top of this comparison, but this benchmark was held out for evaluation with Flan-T5
- Flan-T5 also outperforms smaller versions of more recent LLMs like PaLM and LLaMA (while also being multiple times smaller)

## How do I run Flan-T5 on IPUs?

Since the Flan-T5 checkpoints are available on HuggingFace, you can use Graphcore's HuggingFace integration ([🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore)) to easily run Flan-T5 with standard inference pipeline.

If you already have an existing HuggingFace based application that you'd like to try on IPUs, then it's as simple as:

```diff
->>> from transformers import pipeline
+>>> from optimum.graphcore import pipeline

->>> text_generator = pipeline("text2text-generation", model="google/flan-t5-large")
+>>> text_generator = pipeline("text2text-generation", model="google/flan-t5-large", ipu_config="Graphcore/t5-large-ipu")
>>> text_generator("Please solve the following equation: x^2 - 9 = 0")
[{'generated_text': '3'}]
```

Now let's define a text generator of our own to use in the rest of this notebook. First, let's make sure your environment has the latest version of [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) available:

In [None]:
%pip install optimum-graphcore>=0.6.1

The location of the cache directories can be configured through environment variables or directly in the notebook:

In [3]:
import os
executable_cache_dir=os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/")
num_available_ipus=int(os.getenv("NUM_AVAILABLE_IPU") or os.getenv("GRAPHCORE_POD_TYPE", "pod")[3:] or 4)

Next, let's import the `pipeline` from `optimum.graphcore` and create our Flan-T5 pipeline for the appropriate number of IPUs:

In [4]:
from optimum.graphcore import pipeline

size = {4: "large", 16: "xl"}
# TODO remove when configs are uploaded to HF
cfg = {
    "large": {
            "layers_per_ipu": [12, 12, 12, 12],
            "matmul_proportion": [0.05, 0.05, 0.2, 0.2],
    },
    "xl": {
            "layers_per_ipu": [2, 4, 3, 3, 3, 3, 3, 3, 2, 4, 3, 3, 3, 3, 3, 3],
            "matmul_proportion": [0.1, 0.05, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.05, 0.05, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],  # fmt: skip
    },
}
from optimum.graphcore import IPUConfig
ipu_config = IPUConfig(
    layers_per_ipu=cfg[size[num_available_ipus]]["layers_per_ipu"],
    matmul_proportion=cfg[size[num_available_ipus]]["matmul_proportion"],
    executable_cache_dir=executable_cache_dir,
    # replication_factor=n_ipus // len(cfg[size]["layers_per_ipu"]),
)
# end TODO
flan_t5 = pipeline(
    "text2text-generation",
    model=f"google/flan-t5-{size[num_available_ipus]}",
    max_input_length=896,
    ipu_config=ipu_config,
)

  from .autonotebook import tqdm as notebook_tqdm


Now, just for fun, let's ask it some random questions:

In [5]:

questions = [
    "Solve the following equation for x: x^2 - 9 = 0",
    "At what temperature does nitrogen freeze?",
    "In order to reduce symptoms of asthma such as tightness in the chest, wheezing, and difficulty breathing, what do you recommend?",
    "Which country is home to the tallest mountainin the world?"
]
for out in flan_t5(questions):
    print(out)

Graph compilation: 100%|██████████| 100/100 [05:20<00:00]
Graph compilation:  39%|███▉      | 39/100 [01:15<00:44]

## What can I use Flan-T5 for?

As we mentoined earlier, Flan-T5 has been finetuned on hundreds of different tasks. So no matter what your task might be, it's worth seeing if Flan-T5 can meet your requirements. Here we will demonstrate at a few common ones:

### Sentiment Analysis

In [4]:
sentiment_analysis = (
    "Review: It gets too hot, the battery only can last 4 hours. Sentiment: Negative\n"
    "Review: Nice looking phone. Sentiment: Positive\n"
    "Review: Sometimes it freezes and you have to close all the open pages and then reopen where you were. Sentiment: Negative\n"
    "Review: Wasn't that impressed, went back to my old phone. Sentiment:"
)
flan_t5(sentiment_analysis)[0]["generated_text"]

'Negative'

### Advanced Named Entity Recognition

In [5]:
advanced_ner = """Microsoft Corporation is a company that makes computer software and video games. Bill Gates and Paul Allen founded the company in 1975
[Company]: Microsoft, [Founded]: 1975, [Founders]: Bill Gates, Paul Allen

Amazon.com, Inc., known as Amazon , is an American online business and cloud computing company. It was founded on July 5, 1994 by Jeff Bezos
[Company]: Amazon, [Founded]: 1994, [Founders]: Jeff Bezos

Apple Inc. is a multinational company that makes personal computers, mobile devices, and software. Apple was started in 1976 by Steve Jobs and Steve Wozniak."""
print(flan_t5(advanced_ner)[0]["generated_text"])

[Company]: Apple, [Founded]: 1976, [Founders]: Steve Jobs, Steve Wozniak


### Question Answering

In [6]:
context = 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.'
question = "Which NFL team represented the AFC at Super Bowl 50?"
response = flan_t5(f"{context} {question}")
# The correct answer is Denver Broncos
print(response[0]['generated_text'])


Denver Broncos


### Intent Classification

In [7]:
intent_classification = """[Text]: I really need to get a gym membership, I'm exhausted.
[Intent]: get gym membership

[Text]: What do I need to make a carbonara?
[Intent]: cook carbonara

[Text]: I need all these documents sorted and filed by Monday.
[Intent]:"""
print(flan_t5([intent_classification])[0]["generated_text"])

file documents


### Summarization

In [8]:
summarization="""
Document: Firstsource Solutions said new staff will be based at its Cardiff Bay site which already employs about 800 people.
The 300 new jobs include sales and customer service roles working in both inbound and outbound departments.
The company's sales vice president Kathryn Chivers said: "Firstsource Solutions is delighted to be able to continue to bring new employment to Cardiff."
Summary: Hundreds of new jobs have been announced for a Cardiff call centre.

Document: The visitors raced into a three-goal first-half lead at Hampden.
Weatherson opened the scoring with an unstoppable 15th-minute free-kick, and he made it 2-0 in the 27th minute.
Matt Flynn made it 3-0 six minutes later with a fine finish.
Queen's pulled a consolation goal back in stoppage time through John Carter.
Summary: Peter Weatherson netted a brace as Annan recorded only their second win in eight matches.

Document: Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday.
Detectives said three firearms, ammunition and a five-figure sum of money were recovered.
A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday.
Summary:
"""
print(flan_t5(summarization)[0]["generated_text"])

A man has been arrested after a firearm was found in a property in Edinburgh.


### Text Classification

In [9]:
text_classification_1 = """Message: When the spaceship landed on Mars, the whole humanity was excited
[Topic]: space

Message: I love playing tennis and golf. I'm practicing twice a week.
[Topic]: sport

Message: Managing a team of sales people is a tough but rewarding job.
[Topic]: business

Message: I am trying to cook chicken with tomatoes.
[Topic]:"""
print(flan_t5(text_classification_1)[0]["generated_text"])

cooking
document, mean pooling, mean pooling


In [None]:

text_classification_2 = """Information Retrieval (IR) is the process of obtaining resources relevant to the information need. For instance, a search query on a web search engine can be an information need. The search engine can return web pages that represent relevant resources.
Keywords: information, search, resources

David Robinson has been in Arizona for the last three months searching for his 24-year-old son, Daniel Robinson, who went missing after leaving a work site in the desert in his Jeep Renegade on June 23. 
Keywords: searching, missing, desert

I believe that using a document about a topic that the readers know quite a bit about helps you understand if the resulting keyphrases are of quality.
Keywords: document, understand, keyphrases

Since transformer models have a token limit, you might run into some errors when inputting large documents. In that case, you could consider splitting up your document into paragraphs and mean pooling (taking the average of) the resulting vectors.
Keywords:"""

print(flan_t5(text_classification_2)[0]["generated_text"])

## Why would I pay extra for Flan-T5-XL?

As we saw earlier when looking at the results from the paper, Flan-T5-XL is roughly 40% better than Flan-T5-Large on average across its validation tasks. Therefore when deciding if Flan-T5-XL is worth it for you, ask yourself the following questions:
- Does my data need greater linguistic understanding for the task to be performed?
- Is my task too complicated for a model as small as Flan-T5-Large and too easy for a model as large as GPT-3?
- Does my task require longer output sequences that Flan-T5-XL is needed to generate?

For fun, let's now look at an example of a task where the answer to all of the above questions is yes. Let's say you have a customer service AI that you use to answer basic questions in order to reduce the workload of your customer service personnel. This needs:
- Strong linguistic ability to both parse and generate meduim sized chunks of text
- An LLM that is able to learn well from context, but doesn't have all of human history embedded in its parameters
- The ability to produce multiple sentence responses, but not much longer than this

Looking at the code below, we see some context about Graphcore provided in the input, as well as primer for a conversational response from the model. As you can see from the example conversation (read before executing the code block), Flan-T5-XL was able to understand the information provided in the context and provide useful and natural answers to the questions it was asked.

In [10]:
from IPython.display import clear_output

class ChatBot:
    def __init__(self, model, context) -> None:
        self.model = model
        self.initial_context = context
        self.context = self.initial_context
        self.user, self.persona = [x.split(":")[0] for x in context.split("\n")[-2:]]

    def ask(self, question):
        question += "." if question[-1] not in [".", "?", "!"] else ""
        x = f"{self.context}\n{self.user}: {question}\n{self.persona}: "
        # print(f"\n{x}\n")
        y = self.model(x)
        response = y[0]["generated_text"]
        self.context = f"{x}{response}"
        return response

    def session(self):
        print("Starting session", flush=True)
        prompt = input()
        while prompt != "":
            if prompt == "reset":
                clear_output()
                print("Starting session", flush=True)
                self.context = self.initial_context
                prompt = input()
            print(f"{self.user.title()}: {prompt}", flush=True)
            answer = self.ask(prompt)
            print(f"{self.persona.title()}: {answer}", flush=True)
            prompt = input()
        print("Ending session", flush=True)


context = f"""This is a conversation between a [customer] and a [virtual assistant].
The [virtual assistant] works at Graphcore. Graphcore is:
- Located in Bristol.
- Invented the Intelligence Processing Unit (IPU). It is purpose built for AI.
- The currently available models are: Classic IPU, Bow IPU, C600
- IPUs are available on: Paperspace, Gcore and Graphcloud

[virtual assistant]: Hello, welcome to Graphcore, how can I help you today?
[customer]: I'd like to ask some questions about your company.
[virtual assistant]: Ok, I can help you with that."""
chatbot = ChatBot(flan_t5, context)
chatbot.session()

Starting session
[Customer]: What is an IPU?
[Virtual Assistant]: The Intelligence Processing Unit (IPU) is a computer chip that is used to process artificial intelligence.
[Customer]: Who makes it?
[Virtual Assistant]: Graphcore is the manufacturer of the IPU.
[Customer]: Can I use them?
[Virtual Assistant]: Yes, I'm sure you can.
[Customer]: Where?
[Virtual Assistant]: The IPU is available on Paperspace, Gcore and Graphcloud.
Ending session


In [12]:
flan_t5.model.detachFromDevice()