# HuggingFace LLMs

```
Author: Han-Elliot Phan
Email: hanelliotphan@gmail.com

Last update: February 10, 2025
```

My fun time with some Hugging Face LLMs that would be helpful for the QuickMeet project.

--------------------------------------------------------------------------------

## 1. HuggingFace Pipelines

In [1]:
!nvidia-smi

Tue Feb 11 05:01:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   61C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install -U datasets transformers sentencepiece soundfile accelerate bitsandbytes

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers
  Downloading transformers-4.48.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  

In [3]:
# Import libraries

import torch
import soundfile as sf

from transformers import pipeline
from diffusers import DiffusionPipeline, StableDiffusionPipeline
from datasets import load_dataset
from IPython.display import Audio

In [4]:
# Text Sentiment Analysis
classifier = pipeline("sentiment-analysis", device=0)
result = classifier("I will succeed in creating this meaningful QuickMeet project")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998481273651123}]


In [5]:
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True, device=0)
result = ner("Caitlyn Kiramman is the finest sheriff of Piltover, and her girlfriend Vi is from Zaun")
print(result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cuda:0


[{'entity_group': 'PER', 'score': 0.9906481, 'word': 'Caitlyn Kiramman', 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.6883328, 'word': 'Piltover', 'start': 42, 'end': 50}, {'entity_group': 'PER', 'score': 0.9948971, 'word': 'Vi', 'start': 71, 'end': 73}, {'entity_group': 'LOC', 'score': 0.87718123, 'word': 'Zaun', 'start': 82, 'end': 86}]




In [6]:
# Q&A with Context
qa = pipeline("question-answering", device=0)
result = qa(question="Who is the Piltover's finest?", context="Caitlyn Kiramman is the finest sheriff of Piltover, and her girlfriend Vi is from Zaun")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.995520830154419, 'start': 0, 'end': 16, 'answer': 'Caitlyn Kiramman'}


In [7]:
# Text Summarization
# Text source: https://en.wikipedia.org/wiki/Arcane_(TV_series)
summarizer = pipeline("summarization", device=0)
text = "Arcane (titled onscreen as Arcane: League of Legends) is a steampunk action-adventure television series created by Christian Linke and Alex Yee. It was produced by the French animation studio Fortiche under the supervision of Riot Games, and distributed by Netflix. Set in Riot's League of Legends universe, it primarily focuses on sisters Violet / 'Vi' (Hailee Steinfeld) and Powder / Jinx (Ella Purnell) as they become embroiled in a conflict between their native underbelly of Zaun and the city of Piltover. First announced at the League of Legends tenth anniversary celebration in 2019, the series' first season was released between November 6 and 20, 2021, and a second and final season was released between November 9 and 23, 2024."
result = summarizer(text)
print(result)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda:0


[{'summary_text': " Set in Riot's League of Legends universe, it primarily focuses on sisters Violet / 'Vi' and Powder / Jinx . First season was released between November 6 and 20, 2021, and a second and final season released November 9 and 23, 2024 . First announced in 2019 at the tenth anniversary celebration of the game's tenth anniversary ."}]


In [8]:
# Text Translation
translator = pipeline("translation_en_to_fr", device=0)
result = translator("Caitlyn Kiramman loves Vi the most in her life")
print(result[0]['translation_text'])

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0


Caitlyn Kiramman aime Vi le plus dans sa vie


In [9]:
# Text Classification
classifier = pipeline("zero-shot-classification", device=0)
result = classifier(
    "This project is about creating a list of action items from meeting transcriptions using Python and LLM models",
    candidate_labels=["education", "technology", "business", "politics"],
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


{'sequence': 'This project is about creating a list of action items from meeting transcriptions using Python and LLM models', 'labels': ['technology', 'business', 'education', 'politics'], 'scores': [0.8668720722198486, 0.10780735313892365, 0.017918990924954414, 0.007401592098176479]}


In [10]:
# Text Generation
generator = pipeline("text-generation", device=0)
result = generator("We finish each other...")
print(result[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


We finish each other... I do not know what will happen with each other! I should have just left it alone."

She turned to see the young man's expression change into a frown as he said, "Not yet! Don't you


In [11]:
# Audio Generation
audio_gen = pipeline("text-to-speech", "microsoft/speecht5_tts", device=0)

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

speech = audio_gen(
    "I love you more than anyone else in this world.",
    forward_params={"speaker_embeddings": speaker_embeddings}
)

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
Audio("speech.wav")

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/585M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

cmu-arctic-xvectors.py:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/50.6M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7931 [00:00<?, ? examples/s]

## 2. HuggingFace Tokenizers

### 2.1. Meta's Llama 3.1

In [12]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

In [13]:
# Sign in to HuggingFace using the API key
# Generate your API key here: https://huggingface.co/settings/tokens

hf_token = userdata.get('HUGGINGFACE_TOKEN')
login(token=hf_token, add_to_git_credential=True)

In [14]:
# Llama 3.1 from Meta
# Make sure you request access to the Meta Llama 3.1 repository here: https://huggingface.co/meta-llama/Llama-3.1-8B

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [15]:
# Tokenize text

text = "Caitlyn Kiramman is the Piltover's finest"
tokens = tokenizer.encode(text)
tokens

[128000,
 34,
 1339,
 18499,
 26608,
 309,
 1543,
 374,
 279,
 37451,
 998,
 424,
 596,
 28807]

In [16]:
tokenizer.decode(tokens)

"<|begin_of_text|>Caitlyn Kiramman is the Piltover's finest"

In [17]:
tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'C',
 'ait',
 'lyn',
 ' Kir',
 'am',
 'man',
 ' is',
 ' the',
 ' Pil',
 'to',
 'ver',
 "'s",
 ' finest']

In [18]:
tokenizer.get_added_vocab()

{'<|begin_of_text|>': 128000,
 '<|end_of_text|>': 128001,
 '<|reserved_special_token_0|>': 128002,
 '<|reserved_special_token_1|>': 128003,
 '<|finetune_right_pad_id|>': 128004,
 '<|reserved_special_token_2|>': 128005,
 '<|start_header_id|>': 128006,
 '<|end_header_id|>': 128007,
 '<|eom_id|>': 128008,
 '<|eot_id|>': 128009,
 '<|python_tag|>': 128010,
 '<|reserved_special_token_3|>': 128011,
 '<|reserved_special_token_4|>': 128012,
 '<|reserved_special_token_5|>': 128013,
 '<|reserved_special_token_6|>': 128014,
 '<|reserved_special_token_7|>': 128015,
 '<|reserved_special_token_8|>': 128016,
 '<|reserved_special_token_9|>': 128017,
 '<|reserved_special_token_10|>': 128018,
 '<|reserved_special_token_11|>': 128019,
 '<|reserved_special_token_12|>': 128020,
 '<|reserved_special_token_13|>': 128021,
 '<|reserved_special_token_14|>': 128022,
 '<|reserved_special_token_15|>': 128023,
 '<|reserved_special_token_16|>': 128024,
 '<|reserved_special_token_17|>': 128025,
 '<|reserved_special_to

In [19]:
# Llama 3.1 using instruct

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-instruct", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [20]:
# Apply chat template

message = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the meaning of life?"},
]

prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the meaning of life?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### 2.2. Other Large Language Models

In [21]:
# Initialize models

PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [22]:
# Set up model tokenizer

phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

In [23]:
# Compare encoding between the models

phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)
text = "Caitlyn Kiramman is the Piltover's finest"

print("Meta's Llama-3 encoded text:")
print(tokenizer.encode(text))
print("-----------------------")
print("Phi-3 encoded text:")
print(phi3_tokenizer.encode(text))
print("-----------------------")
print("Qwen2 encoded text:")
print(qwen2_tokenizer.encode(text))
print("-----------------------")
print("Starcoder-2 encoded text:")
print(starcoder2_tokenizer.encode(text))

Meta's Llama-3 encoded text:
[128000, 34, 1339, 18499, 26608, 309, 1543, 374, 279, 37451, 998, 424, 596, 28807]
-----------------------
Phi-3 encoded text:
[315, 1249, 13493, 5201, 314, 1171, 338, 278, 14970, 517, 369, 29915, 29879, 1436, 342]
-----------------------
Qwen2 encoded text:
[34, 1315, 18013, 25527, 309, 1515, 374, 279, 36351, 983, 423, 594, 27707]
-----------------------
Starcoder-2 encoded text:
[72, 1149, 26430, 1242, 495, 424, 1607, 458, 341, 466, 354, 471, 443, 1200, 3770, 464]


In [24]:
# Compare batch decoding between the models

print("Meta's Llama-3 batch-decoding:")
print(tokenizer.batch_decode(tokenizer.encode(text)))
print("-----------------------")
print("Phi-3 batch-decoding:")
print(phi3_tokenizer.batch_decode(phi3_tokenizer.encode(text)))
print("-----------------------")
print("Qwen-2 batch-decoding:")
print(qwen2_tokenizer.batch_decode(qwen2_tokenizer.encode(text)))
print("-----------------------")
print("Starcoder-2 batch-decoding:")
print(starcoder2_tokenizer.batch_decode(qwen2_tokenizer.encode(text)))

Meta's Llama-3 batch-decoding:
['<|begin_of_text|>', 'C', 'ait', 'lyn', ' Kir', 'am', 'man', ' is', ' the', ' Pil', 'to', 'ver', "'s", ' finest']
-----------------------
Phi-3 batch-decoding:
['C', 'ait', 'lyn', 'Kir', 'am', 'man', 'is', 'the', 'Pil', 'to', 'ver', "'", 's', 'fin', 'est']
-----------------------
Qwen-2 batch-decoding:
['C', 'ait', 'lyn', ' Kir', 'am', 'man', ' is', ' the', ' Pil', 'to', 'ver', "'s", ' finest']
-----------------------
Starcoder-2 batch-decoding:
['<NAME>', 'create', 'bos', 'dhcp', 'ro', 'over', '\n\t\t', '�', 'ilia', 'url', 'em', ' </', 'RECE']


In [25]:
# Compare chat template between the models

print("Meta's Llama-3 chat template:")
print(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))
print("-----------------------")
print("Phi-3 chat template:")
print(phi3_tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))
print("-----------------------")
print("Qwen-2 chat template:")
print(qwen2_tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))

Meta's Llama-3 chat template:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the meaning of life?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


-----------------------
Phi-3 chat template:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
What is the meaning of life?<|end|>
<|assistant|>

-----------------------
Qwen-2 chat template:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the meaning of life?<|im_end|>
<|im_start|>assistant



## 3. HuggingFace Transformers

In [26]:
!pip install -q -U requests torch bitsandbytes transformers accelerate sentencepiece

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m766.7/766.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.1/150.1 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.2/253.2 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.5.1+cu124 requires torch==2.5.1, but you have torch 2.6.0 which is incompatible.
torchvision 0.20.1+cu124 requires torch==2.5.1, but you have torch 2.6.0 which is incompatible.
fastai 2.7.18 requires torch<2.6,>=1.10, but you have torch 2.6.0 which is incompatible.[0m[31m
[0m

In [27]:
# Import libraries

from google.colab import userdata
from huggingface_hub import login
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TextStreamer
import torch

In [28]:
# Instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

In [29]:
messages = [
    {"role": "system", "content": "You are more beautiful than you think."},
    {"role": "user", "content": "Don't let those who don't matter impact you"}
]

In [30]:
# Quantization Configuration (load the model into memory and use less memory)

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [31]:
# Tokenizer

llama_tokenizer = AutoTokenizer.from_pretrained(LLAMA)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
inputs = llama_tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [32]:
# Create model
model = AutoModelForCausalLM.from_pretrained(
    LLAMA,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quant_config
)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [33]:
# Check the memory footprint

memory = model.get_memory_footprint()
print(f"Memory footprint: {memory} bytes")

Memory footprint: 5591539968 bytes


In [34]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

In [35]:
outputs = model.generate(inputs, max_new_tokens=80)
print(llama_tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are more beautiful than you think.<|eot_id|><|start_header_id|>user<|end_header_id|>

Don't let those who don't matter impact you<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Those words of wisdom can be a powerful reminder to focus on what truly matters in life and to let go of negativity from people who don't have a significant impact on our well-being.<|eot_id|>


In [36]:
# Clean up model and cache

del inputs, outputs, model
torch.cuda.empty_cache()