# Lecture 01: Inference with Hugging Face

## Pipelines

Let's start with a classic language model, OpenAI's GPT-2. 

Hugging Face has a simple way to generate text with a model like this: you create a text generation `pipeline`, pass it some text, and you get a completion out the other end. 

In [2]:
from transformers import pipeline
from IPython.display import display, Markdown

In [3]:
generator = pipeline('text-generation', model='gpt2')

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


We can ignore this warning above. Let's generate some text!

In [7]:
generated_sequence = generator(
          "Welcome to INFO 4940 at Cornell! I'm your instructor,",
          max_new_tokens=50, 
          num_return_sequences=1
          )

display(Markdown(generated_sequence[0]["generated_text"]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=50) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Welcome to INFO 4940 at Cornell! I'm your instructor, and I'm always here to help you learn!

Check us out at our Facebook Page!!!

Facebook Page @ Cornell University

Facebook Page @ Cornell University

Facebook Page @ Cornell University

Two things to notice here:
- GPT-2 is not a chatbot: it is just *completing* the text we pass it. If you recall the first day's lecture, it's just continually "filling in the blank" starting from the text we give it. 
- The generated text can be pretty weird. Why do you think that is? 

## Loading models and tokenizers

For now, let's imagine that most of our model is a black box: we don't know how it works on the inside, but we can give it something and observe what it gives us in return. The `pipeline` approach simplifies inference tremendously, but it's ultimately not very useful if we want to see how our model generates predictions based on our inputs. 

Let's go a tiny bit deeper, while still hand waving away most of the model. 

In [8]:
import torch 
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

Language models don't actually ingest strings, but rather *tokens*. These tokens are indices in the model's *vocabulary*. In other words, we need a way to convert our input (a string) into tokens (a corresponding list of integers).

Every text generation model on Hugging Face has its own tokenizer which handles this process for you. We will talk about how tokenizers actually work later, but for now, you should familiarize yourself with a process that looks something like this:

In [9]:
model_name_or_path = "gpt2"

# Download and load our model from Hugging Face
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

# Download and load its tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Unlike the `pipeline` approach earlier, we now can inspect our model directly. Let's do that!

In [10]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

This is probably all pretty meaningless at this point, and that is totally fine. What's important here is that the model is loaded on our device and we can inspect, modify, add, or remove any part of it. 

Remember how big, ChatGPT-scale models can have trillions of parameters? Let's see how many parameters GPT-2 has.

In [11]:
num_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters for {model_name_or_path}: {num_params:,}")

Total parameters for gpt2: 124,439,808


Compared to something like ChatGPT, GPT-2 is **really** small. In fact, it's about one fifth the size of the smallest class of LLMs nowadays (0.6B).  

Let's return to our tokenizer. What does it do? 

In [12]:
tokenizer("Hello world!")

{'input_ids': [15496, 995, 0], 'attention_mask': [1, 1, 1]}

Like we said above, the tokenizer maps a string to a sequence of integers. These integers correspond to tokens in the model's vocabulary. We can inspect the vocabulary like so: 

In [13]:
tokenizer.vocab

{'Ġhorizontal': 16021,
 'ĠSega': 29490,
 'ottesville': 23806,
 'wing': 5469,
 'orses': 11836,
 'ĠAssist': 43627,
 'ĠJagu': 21117,
 'ĠOpera': 26049,
 'pent': 16923,
 'ĠSing': 5573,
 'Ġbranches': 13737,
 'innacle': 37087,
 'Ġnause': 24480,
 'Flash': 30670,
 'ĠViol': 13085,
 'Ġperceptions': 23574,
 'iot': 5151,
 'Ġaccol': 45667,
 'ĠSamoa': 43663,
 'internal': 32538,
 'Ġbroadband': 18729,
 'andre': 49078,
 'ocre': 27945,
 'ĠNay': 38808,
 'Ġ8': 807,
 'OUT': 12425,
 'utra': 35076,
 'xxx': 31811,
 'ĠRichie': 49209,
 'erk': 9587,
 'Ġspeak': 2740,
 'xus': 40832,
 'Ġlogically': 34193,
 '.</': 25970,
 'Ġgreatly': 9257,
 '94': 5824,
 'Ġlaun': 2698,
 'ĠMagnus': 35451,
 'ĠFrames': 36291,
 'Ġnavigate': 16500,
 'Ġsupper': 43743,
 'Ġpromised': 8072,
 'ĠProfit': 42886,
 'ĠIPM': 50200,
 'ĠKiller': 19900,
 'mc': 23209,
 'rent': 1156,
 'Ġimprisonment': 16510,
 'ĠGreenland': 30155,
 'ĠPresent': 21662,
 'Ġpivotal': 28992,
 'Ġbuilders': 31606,
 'Ġbillions': 13188,
 'Ġlumber': 34840,
 'ĠKarachi': 46154,
 'geme

In [14]:
# How big is it? 
len(tokenizer.vocab)

50257

You might be wondering about that "Ġ" at the beginning of some of the tokens. It's not important now, but that indicates leading whitespace. Models like GPT-2 use a *subword tokenizer* that splits words into multiple tokens. We'll talk more about this later in the course!

Here's another example of what these input IDs look like along with the string they correspond to:

In [15]:
example_text = "INFO 4940: How LLMs Work"
example_input_ids = tokenizer(example_text)["input_ids"]
example_token_strs = tokenizer.convert_ids_to_tokens(example_input_ids)

[f"{id}: {tok}" for id, tok in zip(example_input_ids, example_token_strs)]

['10778: INFO',
 '5125: Ġ49',
 '1821: 40',
 '25: :',
 '1374: ĠHow',
 '27140: ĠLL',
 '10128: Ms',
 '5521: ĠWork']

## Inference

Enough about tokenizers for now--let's use our model!

In [None]:
example_text = "I visited the Space Needle, Pike Place Market, and Pioneer Square on my trip to"
example_encoding = tokenizer(
    example_text,
    return_tensors="pt"      # We need tensors, not lists:
)                            # our model is actually a PyTorch model
                             # under the hood.


model.eval()                 # We are not training our model, 
                             # so we put it in "eval" mode.

with torch.no_grad():        # Don't need gradients (more on this later)
    example_logits = model(**example_encoding)["logits"].squeeze()

Ok, what's going on here? 

1. We encode our text using the `tokenizer`, specifying that we want the output as tensors. This gives us a dictionary of tensors.
2. We make sure the `model` is in `eval` mode, and that we aren't calculating gradients. 
3. We pass our encodings to the model and get logits back out.

What are logits in this context? It might help to look at the shape of the tensor: 

In [None]:
example_logits.size()

Does that second number look familiar? Here's a hint:

In [None]:
len(tokenizer.vocab)

What about that first number? 

In [None]:
example_encoding["input_ids"].squeeze().size()

For every token in our input sequence, we get a vector the size of our vocabulary. In other words, we get a prediction over our vocabulary that is conditioned on the preceding tokens in the sequence. 

Remember what we said in the first lecture: a language model defines a probability distribution over sequences. That's exactly what the model is doing!

However, there's a small catch:

In [None]:
print(f"Logit sum: {example_logits.sum(dim=1)}")
print(f"Logit max: {example_logits.max(dim=1)[0]}")
print(f"Logit min: {example_logits.min(dim=1)[0]}")

We need a way to convert these raw outputs into a usable probability distribution (every element between 0 and 1, all elements sum to 1). Fortunately, there's a function that does just that: [Softmax](https://docs.pytorch.org/docs/stable/generated/torch.nn.Softmax.html).

In [None]:
example_probs = F.softmax(example_logits, dim=1)

example_probs.sum(dim=1)

Great!

Now, let's recall what our input sequence was: 

In [None]:
example_text

Of course, we could generate more text here, but let's simplify things a little. Let's compare the probabilities computed for a few words given this input sequence. 

First, we need to get the index of each word's token in the model's vocabulary:

In [None]:
cities = ["Seattle", "Portland", "Chicago", "Houston"]
city_indices = [tokenizer.vocab["Ġ"+c] for c in cities]
city_indices

We want to get probabilities given the whole sequence, so we're interested in the last logits. Let's use our vocab indices to get those predictions!

In [None]:
for city, prob in zip(cities, example_probs[-1, city_indices]): 
    print(f"{city}: {prob:.2f}")

As we'd expect, Seattle is the most likely city out of the limited list we presented. However, its absolute probability seems relatively low--let's look at the most likely tokens.

In [None]:
# Sort probabilities and take the top 10
top_predictions = example_probs[-1].argsort(descending=True)[:10]

for idx in top_predictions:
    token = tokenizer.convert_ids_to_tokens([idx])
    prob = example_probs[-1, idx]
    print(f"{token[0].replace('Ġ', '')}: {prob:.2f}")

While this is a toy example, I hope you can see how powerful this is for applications beyond just text generation. I also hope that this demystifies what's going on under the hood for an LLM: even for models several orders of magnitude larger than this, they are ultimately just producing logits from sequences of integers. 