# LLM Inference Example

| Model | Result | Status |
| ----- | ------ | --- |
| Mixtral Instruct | The GPU memory requirement exceeds Kaggle's limit. | 💥 Kaboom!! |
| Pegasus | All Okay! | ✅ Good |
| GPT 2 | At least there are no errors. 😐 | ✅ Good |
| [LLaMA 3.2 (i.e. latest)](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) | At least there are no errors. 😐 | ✅ Good |
| [FLAN-T5](https://huggingface.co/collections/google/flan-t5-release-65005c39e3201fff885e22fb) | At least there are no errors. 😐 | ✅ Good |

<!--
✅ Good
💥 Kaboom!!
⏳ In Progress...
🤔 Problematic
-->

In [12]:
# !pip install bitsandbytes>=0.39.0 sacremoses==0.0.53 accelerate -q

In [1]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret('HUGGINGFACE_TOKEN'))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
import torch

if torch.cuda.is_available():
    device = 'cuda:0'
else:
    device = 'cpu'

print('Device:', device)
print('Device name:', torch.cuda.get_device_name(device))

Device: cuda:0
Device name: Tesla P100-PCIE-16GB


In [24]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PegasusForConditionalGeneration,
    PegasusTokenizer,
    T5ForConditionalGeneration,
    T5Tokenizer,
    pipeline,
)

## FLAN-T5

In [28]:
model_id = "google/flan-t5-large"

tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto",
)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [57]:
input_text = "Who are you?"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

<pad> adam</s>


## LLaMA 3

In [10]:
model_id = "meta-llama/Llama-3.2-3B"

pipe = pipeline(
    "text-generation",
    model=model_id, 
    model_kwargs={
        "torch_dtype": torch.bfloat16,
    },
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [23]:
%%time
pipe_params = {
    'max_new_tokens': 100,
    'pad_token_id': pipe.tokenizer.eos_token_id,
}

prompt = '''Saigon is'''

gen_text = pipe(prompt, **pipe_params)[0]['generated_text']
print(f'[TEXT STARTS]\n{gen_text}\n[TEXT ENDS]')

[TEXT STARTS]
Saigon is a city of contrasts and contradictions. The city is full of contrasts. The contrasts are apparent in its people. Some are well dressed and others are not. Some are rich and others are poor. Some are educated and others are not. Some are religious and others are not. Some are conservative and others are liberal. Some are friendly and others are not. Some are polite and others are not. Some are friendly and others are not. Some are friendly and others are not. Some are friendly and others
[TEXT ENDS]
CPU times: user 3.71 s, sys: 19.8 ms, total: 3.73 s
Wall time: 3.73 s


## GPT 2

In [None]:
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [44]:
%%time

prompt = "This is how to fry an egg:"

input_ids = tokenizer(prompt, return_tensors="pt").to(device).input_ids

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=200,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print('[GENERATED TEXT BEGINS]')
print(gen_text)
print('[GENERATED TEXT ENDS]')

[GENERATED TEXT BEGINS]
This is how to fry an egg:

1 egg = 3% flour.

1 slice of butter = 1 tsp salt.

1 egg = 1/2 tsp cinnamon.

Serve warm. It's great to try this recipe on hot days.

*The flour will absorb the moisture you've absorbed through the cooking process, so be sure to cook to a lower temperature (it will not give off as much sugar or carbs as is recommended, and you'll likely burn your hand).

For more egg-flourless recipes, check out these popular low carb eggs:

Here's a post about how to do your own low carb egg filling.

For egg replacement, you can use a traditional non-dairy yogurt or milk substitute, like cottage cheese or almond milk or yogurt cheese.

For the "low" egg replacement, you can use a low-fat sour cream substitute, a sour cream derivative or a regular low
[GENERATED TEXT ENDS]
CPU times: user 1.67 s, sys: 0 ns, total: 1.67 s
Wall time: 1.67 s


## Pegasus

In [16]:
model_name = "google/pegasus-xsum"

tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

pytorch_model.bin:  34%|###3      | 765M/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [17]:
src_text = [
    """
    PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be
    affected by the shutoffs which were expected to last through at least midday tomorrow.
    """,
]

src_text = list(map(lambda s: ' '.join(s.split()), src_text))
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)

In [21]:
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
# assert (
#     tgt_text[0]
#     == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
# )

print(len(tgt_text))
print(tgt_text[0])

1
California's largest electricity provider has turned off power to hundreds of thousands of customers.


## Mixtral Instruct

In [5]:
%%time

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

model_params = {
    'torch_dtype': torch.float16,
    # 'use_flash_attention_2': True,
    'device_map': 'auto',
}

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, **model_params)

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/92.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00001-of-00019.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00002-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00005-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00006-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00007-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00010-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00011-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00012-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00013-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00014-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00015-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00016-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00017-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00018-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00019-of-00019.safetensors:   0%|          | 0.00/4.22G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

CPU times: user 2min 37s, sys: 3min 10s, total: 5min 48s
Wall time: 40min 23s


💥💥💥 The Kaggle kernel crashed after the previous cell.
So, MixtralInstruct dismissed!

In [7]:
model.to(device)

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

In [None]:
%%time

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

print('Tokenizing input..')
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)

print('Generating output as embeddings..')
outputs = model.generate(inputs, max_new_tokens=20)

print('✅ All Done')

In [None]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))