<h1>Chapter 1 - Introduction to Language Models</h1>
<i>Exploring the exciting field of Language AI</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter01/Chapter%201%20-%20Introduction%20to%20Language%20Models.ipynb)

---

This notebook is for Chapter 1 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

## Understanding What We're Building

As discussed in Chapter 1, Large Language Models are built on the transformer architecture from the 2017 "Attention is All You Need" paper. The model we're using today, **Phi-3**, is a **decoder-only model** - similar to GPT. These models generate text autoregressively, meaning they predict one token at a time based on all previous tokens.

Remember from Chapter 1: the "4k" in Phi-3-mini-4k-instruct refers to the context window - it can process up to 4,000 tokens at once. This is important because, as we learned, the context length determines how much text the model can "remember" during generation.

### Generating Your First Text

The main source for finding and downloading LLMs is the [HuggingFace Hub](https://huggingface.co/docs/hub/en/index)

**HuggingFace** is the organization behind the well-known Transformers
package, discussed heavily in the slides.

In [1]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Although tokenization will be discussed more comprehensively in Chapter 2, let's have a quick look of how it works:

In [3]:
# Understanding Tokenization from Chapter 1
# We know from Chapter 1 that tokenization is converting text to numbers

# Let's see how Phi-3 tokenizes text
sample_text = "I love llamas"
tokens = tokenizer(sample_text)
print(f"Text: {sample_text}")
print(f"Token IDs: {tokens['input_ids']}")
print(f"Back to text: {tokenizer.decode(tokens['input_ids'])}")

# Notice how the tokenizer splits text into subword units
# This is more sophisticated than the simple word-based tokenization we saw in Chapter 1

Text: I love llamas
Token IDs: [306, 5360, 11829, 294]
Back to text: I love llamas


Although we can now use the model and tokenizer directly, it's much easier to wrap it in a `pipeline` object:

In [4]:
from transformers import pipeline

# Create a pipeline with hyperparameters we learned about
# As discussed in Chapter 1, these control how the model generates text

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,  # Only return newly generated text
    max_new_tokens=500,      # Maximum tokens to generate (remember context windows!)
    do_sample=False          # Deterministic output (greedy decoding)
)

"""
Key generation parameters from Chapter 1:

- max_new_tokens: Limits generation length. Remember that GPT-style models
  are autoregressive - they generate one token at a time.

- do_sample: When False, always picks the most likely next token (greedy).
  When True, samples from the probability distribution.

- temperature (not set here): Controls randomness. Low values make the model
  more focused and deterministic, high values make it more creative.

- top_p and top_k: Control nucleus and top-k sampling as alternatives to
  temperature-based sampling.
"""

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


'\nKey generation parameters from Chapter 1:\n\n- max_new_tokens: Limits generation length. Remember that GPT-style models\n  are autoregressive - they generate one token at a time.\n\n- do_sample: When False, always picks the most likely next token (greedy).\n  When True, samples from the probability distribution.\n\n- temperature (not set here): Controls randomness. Low values make the model\n  more focused and deterministic, high values make it more creative.\n\n- top_p and top_k: Control nucleus and top-k sampling as alternatives to\n  temperature-based sampling.\n'

Finally, we create our prompt as a user and give it to the model:

In [5]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!


### Your Turn: Applying Chapter 1 Concepts

Now let's practice the concepts from Chapter 1 with hands-on exercises.


#### Exercise 1: Token Vocabulary Exploration
Run this to see how different types of text tokenize differently. Try using different characters.

In [6]:
# Exercise 1: Now modify test_texts and add your own examples
print("Different text types tokenize differently:")
print("=" * 60)

test_texts = {
    "English": "The cat sat on the mat",
    "Code": "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "Numbers": "3.14159 2.71828 1.41421",
    "Mixed": "GPT-3 has 175B parameters.",
    "Special": "Hello 世界 🦙 #AI"
}

for text_type, text in test_texts.items():
    tokens = tokenizer(text)
    print(f"\n{text_type}:")
    print(f"  Text: '{text[:50]}{'...' if len(text) > 50 else ''}'")
    print(f"  Tokens: {len(tokens['input_ids'])}")
    print(f"  First 5 token IDs: {tokens['input_ids'][:5]}")


Different text types tokenize differently:

English:
  Text: 'The cat sat on the mat'
  Tokens: 6
  First 5 token IDs: [450, 6635, 3290, 373, 278]

Code:
  Text: 'def fibonacci(n): return n if n <= 1 else fibonacc...'
  Tokens: 32
  First 5 token IDs: [822, 18755, 265, 21566, 29898]

Numbers:
  Text: '3.14159 2.71828 1.41421'
  Tokens: 24
  First 5 token IDs: [29871, 29941, 29889, 29896, 29946]

Mixed:
  Text: 'GPT-3 has 175B parameters.'
  Tokens: 12
  First 5 token IDs: [402, 7982, 29899, 29941, 756]

Special:
  Text: 'Hello 世界 🦙 #AI'
  Tokens: 11
  First 5 token IDs: [15043, 29871, 30793, 30967, 29871]


#### Questions:

**Q1)** Why does the code example tokenize differently from plain English text (32 tokens vs 6 tokens)? What does this tell you about how the tokenizer handles programming syntax versus natural language?

---

**Q2)** How would tokenization differ if you input text in a non-English language like Chinese or Arabic? What effects does this have for model performance across different languages?

---

**Q3)** Why do numbers like "3.14159" result in many more tokens (24 total) than you might expect? What does this tell about how transformers process numerical data?

---

**Q4)** What would happen if you tried to process text with a vocabulary that wasn't seen during the tokenizer's training? How might this affect the model's ability to understand and generate responses?

---

**Q5)** Why is it important that the tokenizer can handle special characters and mixed content (like "GPT-3 has 175B parameters")? How does subword tokenization help with out-of-vocabulary words?

#### Exercise 2: Autoregressive Generation Steps
Watch how the model generates text token by token:

In [7]:
# Exercise 2: Change the prompt to see different patterns
print("AUTOREGRESSIVE GENERATION")
print("Chapter 1: Decoder models generate one token at a time")
print("=" * 60)

prompt = [{"role": "user", "content": "The attention mechanism in transformers"}]

# Generate different lengths to see the progression
for num_tokens in [5, 15, 30]:
    output = generator(prompt, max_new_tokens=num_tokens, do_sample=False)
    print(f"\nAfter {num_tokens} tokens:")
    print(f"'{output[0]['generated_text']}'")

print("\nNotice how each output builds on the previous tokens!")



AUTOREGRESSIVE GENERATION
Chapter 1: Decoder models generate one token at a time

After 5 tokens:
' The attention mechanism in transform'

After 15 tokens:
' The attention mechanism in transformers is a critical component that allows the model to'

After 30 tokens:
' The attention mechanism in transformers is a critical component that allows the model to weigh the importance of different parts of the input data differently. It was'

Notice how each output builds on the previous tokens!


#### Questions:

**Q1)** Why does the model generate text one token at a time rather than producing the entire output at once? What are the computational implications of this approach?

---

**Q2)** How does the context window limitation (4,000 tokens for Phi-3) affect what the model can "remember" during generation? What happens when you exceed this limit?

---

**Q3)** What patterns do you notice in how the model completes the sentence as more tokens are generated? Why does the output become more coherent with additional tokens?

---

**Q4)** Why might setting `do_sample=False` (greedy decoding) produce more predictable but potentially less creative outputs? When would you want deterministic vs. probabilistic generation?

---

**Q5)** How would changing the prompt from "The attention mechanism in transformers" to a more specific technical question affect the autoregressive generation process and the quality of intermediate outputs?

#### Challenge 3: Comparing Models - Decoder Architecture
Chapter 1 discussed how decoder-only models like GPT generate text.

Load Qwen/Qwen2.5-1.5B-Instruct and compare it (This could take a little long):

In [8]:
# Exercise 3: Comparing decoder-only models
print("LOADING QWEN MODEL FOR COMPARISON")
print("=" * 60)

# Load Qwen (smaller model for faster loading)
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

print("Loading Qwen/Qwen2.5-1.5B-Instruct...")
qwen_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    device_map="cuda",
    torch_dtype="auto"
)
qwen_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

qwen_generator = pipeline(
    "text-generation",
    model=qwen_model,
    tokenizer=qwen_tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False
)

# Compare outputs for different prompts
test_prompts = [
    "Explain neural networks in simple terms:",
    "Write a Python function to sort a list:",
    "What is the meaning of life?"
]

for prompt_text in test_prompts:
    prompt = [{"role": "user", "content": prompt_text}]

    print(f"\nPrompt: '{prompt_text}'")
    print("-" * 40)

    phi3_output = generator(prompt, max_new_tokens=30, do_sample=False)
    print(f"Phi-3: {phi3_output[0]['generated_text'][:100]}...")

    qwen_output = qwen_generator(prompt, max_new_tokens=30, do_sample=False)
    print(f"Qwen:  {qwen_output[0]['generated_text'][:100]}...")

print("\nNotice the different styles and capabilities of each model!")

LOADING QWEN MODEL FOR COMPARISON
Loading Qwen/Qwen2.5-1.5B-Instruct...


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Prompt: 'Explain neural networks in simple terms:'
----------------------------------------
Phi-3:  Neural networks are a series of algorithms that attempt to recognize underlying relationships in a ...
Qwen:  Neural networks are a type of machine learning model inspired by the structure and function of the h...

Prompt: 'Write a Python function to sort a list:'
----------------------------------------
Phi-3:  Certainly! Below is a Python function that sorts a list using the built-in `sorted()` function, whi...
Qwen:  Certainly! Here's an example of a Python function that sorts a list using the built-in `sorted()` fu...

Prompt: 'What is the meaning of life?'
----------------------------------------
Phi-3:  The meaning of life is a philosophical question concerning the significance of life or existence in...
Qwen:  As an AI language model, I don't have personal beliefs or opinions about what constitutes the "meani...

Notice the different styles and capabilities of each model!


#### Questions:

**Q1)** Why do Phi-3 (7.6B parameters) and Qwen (1.5B parameters) produce different styles of responses to the same prompt? How does model size affect response quality and style?

---

**Q2)** What architectural differences between these decoder-only models might account for their varying approaches to explaining concepts or generating code?

---

**Q3)** How does the training data for each model influence their responses? Why might one model be better at technical explanations while another excels at creative tasks?

---

**Q4)** Why do both models use the same fundamental autoregressive generation approach despite their size differences? What does this tell you about the transformer architecture's scalability?

---

**Q5)** What trade-offs are involved in choosing a smaller model like Qwen versus a larger one like Phi-3? Consider factors like inference speed, memory usage, and task performance.

#### Challenge 4: Autoregressive Generation
Chapter 1 explained that decoder models generate text one token at a time. Let's observe this:

In [9]:
# Exercise 4: Try different prompts, temperatures, top_p values, etc.
print("EXPERIMENT")
print("=" * 60)
print("Change these examples and play with the models:")

# Example 1: Story generation with different temperatures
story_prompt = [{"role": "user", "content": "Once upon a time in a world where"}]

print("\n1. Story Generation:")
for temp in [0.5, 1.0, 1.5, 5.0]:
    output = generator(story_prompt, max_new_tokens=40, do_sample=True, temperature=temp)
    print(f"\nTemperature {temp}: {output[0]['generated_text'][:80]}...")

# Example 2: Technical explanation
tech_prompt = [{"role": "user", "content": "The transformer architecture works by"}]

print("\n\n2. Technical Explanation:")
output = generator(tech_prompt, max_new_tokens=50, do_sample=False)
print(output[0]['generated_text'])


EXPERIMENT
Change these examples and play with the models:

1. Story Generation:

Temperature 0.5:  Once upon a time in a world where magic and technology coexisted harmoniously, ...

Temperature 1.0:  Once upon a time in a world where...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Temperature 1.5:  Ah, where shall our story take place to set the scene? Any universe is availabl...

Temperature 5.0:  once Upon This Time Again begins deep in Tarniaville Village at exactly four.  ...


2. Technical Explanation:
 The transformer architecture works by using self-attention mechanisms to process input data in parallel, rather than sequentially. This allows for more efficient handling of long-range dependencies in data, such as in natural language processing tasks. The transformer


In [10]:
# Experiment with the hyperparameters:
'''
print("\nExperiment:")
temp2 = x.x # change to a float value
output = generator(story_prompt, max_new_tokens=40, do_sample=True, temperature=temp2)
print(f"\nTemperature {temp2}: {output[0]['generated_text'][:80]}...")
'''

'\nprint("\nExperiment:")\ntemp2 = x.x # change to a float value\noutput = generator(story_prompt, max_new_tokens=40, do_sample=True, temperature=temp2)\nprint(f"\nTemperature {temp2}: {output[0][\'generated_text\'][:80]}...")\n'

#### Questions:

**Q1)** Why does temperature 0.5 produce more coherent but predictable text while temperature 5.0 generates seemingly random or nonsensical output? What is happening to the probability distribution at different temperature values?

---

**Q2)** How does temperature mathematically modify the softmax distribution over the vocabulary? Why does dividing logits by temperature before softmax affect randomness?

---

**Q3)** What types of tasks would benefit from low temperature (0.5) versus high temperature (1.5) settings? Consider use cases like technical documentation vs. creative writing.

---

**Q4)** Why does the model sometimes produce grammatically incorrect or illogical text at very high temperatures even though it was trained on coherent text? What does this reveal about how language models store and retrieve information?

---

**Q5)** How would combining temperature with other sampling strategies like top-p (nucleus sampling) or top-k affect the generation quality? Why might you want to use multiple sampling techniques together?