All the LLM programs I've written so far return a complete response (up to the max_new_tokens setting). This notebook creates a model that streams tokens as they are generated.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
streamer = TextStreamer(tok)
t1= time.perf_counter()
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)
t2= time.perf_counter()
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, or twenty-one.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in
Took 45.662352311002905 seconds to execute.


In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time
import torch

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
inputs = tok(["An increasing sequence: one,"], return_tensors="pt").to("cuda")
streamer = TextStreamer(tok)
t1= time.perf_counter()
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)
t2= time.perf_counter()
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, or twenty-one.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in a word.

The number of characters in a word is the number of letters in
Took 2.4244050989982497 seconds to execute.


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time
import torch
model_name = "mistralai/Mistral-7B-v0.1"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
prompt = "Harry Potter was anything but an normal boy. "
inputs = tok([prompt], return_tensors="pt").to("cuda")
streamer = TextStreamer(tok)
t1= time.perf_counter()
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)
t2= time.perf_counter()
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Harry Potter was anything but an normal boy.  He was a wizard.  He was a wizard who lived in a cupboard under the stairs.  He was a wizard who lived in a cupboard under the stairs with his aunt and uncle.  He was a wizard who lived in a cupboard under the stairs with his aunt and uncle and his cousin.  He was a wizard who lived in a cupboard under the stairs with his aunt and uncle and his cousin and his aunt’s cat.  He was a wizard who lived in a cupboard under the stairs with his aunt and uncle and his cousin and his aunt’s cat and his aunt’s cat’s kittens.  He was a wizard who lived in a cupboard under the stairs with his aunt and uncle and his cousin and his aunt’s cat and his aunt’s cat’s kittens and his aunt’s cat’s kittens’ kittens.  He was a wizard
Took 6.701926237998123 seconds to execute.


Super repetitive. Let's change the generation strategy to beam search.

In [3]:
prompt = "Harry Potter was anything but an normal boy. "
inputs = tok([prompt], return_tensors="pt").to("cuda")
streamer = TextStreamer(tok)
t1= time.perf_counter()
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200, num_beams=5)
t2= time.perf_counter()
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Harry Potter was anything but an normal boy. 

ValueError: `streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1.

Oops! We got an error:
```
ValueError: `streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1.
```

So let's not stream the output.

In [7]:
prompt = "Harry Potter was anything but an normal boy. "
inputs = tok([prompt], return_tensors="pt").to("cuda")
streamer = TextStreamer(tok)
t1= time.perf_counter()
beam_output = model.generate(**inputs, max_new_tokens=200, num_beams=5)
t2= time.perf_counter()
output=tok.decode(beam_output[0], skip_special_tokens=True)
print(f"Output:\n{output}\n")
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Output:
Harry Potter was anything but an normal boy.  He lived in a cupboard under the stairs with his aunt, uncle, and cousin, Dudley Dursley.  Dudley was a spoiled brat who was always picking on Harry.  Harry’s parents died in a car crash when he was a baby.  His aunt and uncle didn’t want to take care of him, so they sent him to live with their sister, Petunia, and her husband, Vernon, and their son, Dudley.

When Harry was eleven years old, he received a letter from Hogwarts School of Witchcraft and Wizardry.  His aunt and uncle didn’t want him to go to the school, so they didn’t let him open the letter.  But Harry was determined to go to Hogwarts, so he ran away from his aunt and uncle’s house and went to Hogwarts on his own.

At Hogwarts

Took 8.732398942000145 seconds to execute.


In [9]:
prompt = "# Hello world!\n It all started "
inputs = tok([prompt], return_tensors="pt").to("cuda")
t1= time.perf_counter()
beam_output = model.generate(**inputs, max_new_tokens=200, num_beams=5)
t2= time.perf_counter()
output=tok.decode(beam_output[0], skip_special_tokens=True)
print(f"Output:\n{output}\n")
print(f"Took {t2-t1} seconds to execute.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Output:
# Hello world!
 It all started 10 years ago, when I was 10 years old. I was in the 5th grade, and I wanted to learn how to code. I started with HTML and CSS, and then moved on to JavaScript. I was hooked! I loved the feeling of creating something from scratch, and seeing it come to life on the screen.

Over the years, I've continued to learn and grow as a developer. I've worked on a variety of projects, from small personal websites to large-scale enterprise applications. I've also had the opportunity to work with a variety of technologies, including React, Node.js, and Python.

Today, I'm a full-stack developer with a passion for building beautiful, user-friendly applications. I'm always looking for new challenges and opportunities to learn and grow. If you're looking for a developer who is passionate about their work and committed to delivering high-quality results, I'

Took 8.730228677999548 seconds to execute.
