In [1]:
!pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [3]:
# Print the tokenizer's vocab size
print(f"Vocab size: {tokenizer.vocab_size}")

Vocab size: 32000


In [4]:
# Test some example words
words = ["hello", "world", "AI", "learning", "A", "I"]

for word in words:
  print(f"{word}: {tokenizer.encode(word)}")

hello: [22172]
world: [3186]
AI: [319, 29902]
learning: [6509]
A: [319]
I: [306]




*   **I=306 :** When "I" is encoded as a single-character token, it is treated as the pronoun "I" (e.g., "I am learning"). This is a special-case token in many tokenizers, as "I" is commonly used in natural language.
*   **I=29902 :** When "I" appears as part of a larger token or sequence (e.g., in "AI"), it is tokenized differently.



In [5]:
# Test a sentence
sentence = "Artificial intelligence is the future."
encoded_sentence = tokenizer.encode(sentence)
decoded_sentence = tokenizer.decode(encoded_sentence)

print(f"Encoded sentence: {encoded_sentence}")
print(f"Decoded sentence: {decoded_sentence}")

Encoded sentence: [3012, 928, 616, 21082, 338, 278, 5434, 29889]
Decoded sentence: Artificial intelligence is the future.


In [6]:
# Decode the tokens
for token in encoded_sentence:
  print(f"{token}: {tokenizer.decode(token)}")

3012: Art
928: ific
616: ial
21082: intelligence
338: is
278: the
5434: future
29889: .


Although we can now use the model and tokenizer directly, it's much easier to wrap it in a `pipeline` object:

In [7]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Finally, we create our prompt as a user and give it to the model:

In [8]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


 Why did the chicken join the band? Because it had the drumsticks!


In [13]:
# The prompt (user input / query) with delimiters
prompt = """
<|user|>
Create a funny joke about chickens.<|end|>
<|assistant|>
"""

# Generate output
output = generator(prompt)
print(output)


[{'generated_text': ' Why did the chicken join the band? Because it had the drumsticks!'}]
