# Deconstructing Local LLMs

When we download an LLM for local use, it comes with several essential components that work together to make text generation possible. In this lesson, we'll explore these components, understand what they do, and learn how they fit together.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import json
import torch
from safetensors import safe_open

# Set the directory where we'll save the model
save_directory = "D:/AIModel"  
os.makedirs(save_directory, exist_ok=True)

# Download a small model
model_name = "distilgpt2"
print(f"Downloading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Save the model to our local directory
print(f"Saving model to {save_directory}...")
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("Model and tokenizer saved successfully!")

Downloading distilgpt2...
Saving model to D:/AIModel...
Model and tokenizer saved successfully!


## 1. Exploring the Model Files

Let's see what files were created when we downloaded the model:

In [3]:
# List all files in the model directory
files = os.listdir(save_directory)
print("Files in the model directory:")
for file in sorted(files):
    # Get file size in MB
    file_path = os.path.join(save_directory, file)
    file_size = os.path.getsize(file_path) / (1024 * 1024)  # Convert to MB
    print(f"- {file} ({file_size:.5f} MB)")

Files in the model directory:
- config.json (0.00097 MB)
- generation_config.json (0.00012 MB)
- merges.txt (0.43518 MB)
- model.safetensors (312.47895 MB)
- special_tokens_map.json (0.00010 MB)
- tokenizer.json (3.39287 MB)
- tokenizer_config.json (0.00047 MB)
- vocab.json (0.76118 MB)


## Understanding the Key Components

The files we see in the model directory can be grouped into these categories:

1. **Model Configuration**
   - `config.json` - Contains model architecture and hyperparameters

2. **Model Weights**
   - `pytorch_model.bin` or `model.safetensors` - The actual trained parameters

3. **Tokenizer Components**
   - `tokenizer_config.json` - Tokenizer settings
   - `vocab.json` - The vocabulary mapping tokens to IDs
   - `merges.txt` - Byte-Pair Encoding (BPE) merges for subword tokenization
   - `special_tokens_map.json` - Defines special tokens like [PAD], [CLS], etc.

Let's examine each of these components in detail.

## Model Configuration (config.json)

The `config.json` file contains essential information about the model architecture and hyperparameters. This tells the framework how to construct the model's neural network layers.

Understanding these parameters:
- model_type: The architecture family (e.g., GPT-2)
- vocab_size: Number of tokens in the vocabulary
- n_positions: Maximum sequence length the model can handle
- n_embd: Dimension of embeddings and hidden layers
- n_layer: Number of transformer layers/blocks
- n_head: Number of attention heads in each layer
- activation_function: Non-linearity used (e.g., gelu, relu)
- *_pdrop: Dropout probabilities for different components

In [4]:
# Load and examine the config.json file
config_path = os.path.join(save_directory, "config.json")
with open(config_path, "r") as f:
    config = json.load(f)

# Let's see what's in the config
print("Key model configuration parameters:")
important_params = [
    "model_type", "vocab_size", "n_positions", "n_embd", "n_layer", "n_head", 
    "activation_function", "resid_pdrop", "embd_pdrop", "attn_pdrop"
]
for param in important_params:
    if param in config:
        print(f"- {param}: {config[param]}")

Key model configuration parameters:
- model_type: gpt2
- vocab_size: 50257
- n_positions: 1024
- n_embd: 768
- n_layer: 6
- n_head: 12
- activation_function: gelu_new
- resid_pdrop: 0.1
- embd_pdrop: 0.1
- attn_pdrop: 0.1


## Model Weights File

The model weights might be stored in one of these formats:
- `pytorch_model.bin` - PyTorch's native format
- `model.safetensors` - A newer, safer format for storing tensors

These files contain the actual trained parameters of the model. Let's examine what's inside:

In [5]:
# Find the weights file
weights_file = None
for file in files:
    if file.endswith(".bin") or file.endswith(".safetensors"):
        weights_file = file
        break

if weights_file:
    print(f"Found weights file: {weights_file}")

    if weights_file.endswith(".bin"):
        weights_path = os.path.join(save_directory, weights_file)
        state_dict = torch.load(weights_path)

        print("\nModel contains these weight matrices:")
        print(f"{'Layer Name':<50} {'Shape':<15} {'Preview'}")
        print("-" * 80)

        for i, (name, tensor) in enumerate(list(state_dict.items())[:10]):
            # Get first 3 values as preview
            preview = tensor.flatten()[:3].tolist()
            print(f"{name:<50} {str(tensor.shape):<15} {preview}...")

    elif weights_file.endswith(".safetensors"):
        try:
            weights_path = os.path.join(save_directory, weights_file)
            with safe_open(weights_path, framework="pt") as f:
                tensor_names = list(f.keys())[:10]

                print("\nModel contains these weight matrices:")
                print(f"{'Layer Name':<50} {'Shape':<15} {'Preview'}")
                print("-" * 80)

                for name in tensor_names:
                    tensor = f.get_tensor(name)
                    preview = tensor.flatten()[:3].tolist()
                    print(f"{name:<50} {str(tensor.shape):<15} {preview}...")
        except ImportError:
            print("safetensors library not installed. Run: pip install safetensors")

else:
    print("No weights file found")

Found weights file: model.safetensors

Model contains these weight matrices:
Layer Name                                         Shape           Preview
--------------------------------------------------------------------------------
transformer.h.0.attn.c_attn.bias                   torch.Size([2304]) [0.4693034589290619, -0.4959352910518646, -0.4157843589782715]...
transformer.h.0.attn.c_attn.weight                 torch.Size([768, 2304]) [-0.4988037049770355, -0.19897758960723877, -0.1046222522854805]...
transformer.h.0.attn.c_proj.bias                   torch.Size([768]) [0.16174378991127014, -0.16444097459316254, -0.15611258149147034]...
transformer.h.0.attn.c_proj.weight                 torch.Size([768, 768]) [0.25814932584762573, -0.16598303616046906, 0.062477629631757736]...
transformer.h.0.ln_1.bias                          torch.Size([768]) [0.00478767603635788, 0.01292799785733223, -0.018999796360731125]...
transformer.h.0.ln_1.weight                        torch.Size([768]) 

## Tokenizer Components

The tokenizer is responsible for converting text into token IDs that the model can process. Let's examine its components.

Config Info:
- model_max_length: Maximum sequence length the tokenizer will handle
- bos_token, eos_token, etc.: Special tokens for different purposes

In [9]:
# Examine tokenizer_config.json
tokenizer_config_path = os.path.join(save_directory, "tokenizer_config.json")
if os.path.exists(tokenizer_config_path):
    with open(tokenizer_config_path, "r") as f:
        tokenizer_config = json.load(f)

    print("Tokenizer Configuration:")
    for key, value in tokenizer_config.items():
        print(f"- {key}: {value}")
else:
    print("No tokenizer_config.json found")

Tokenizer Configuration:
- add_prefix_space: False
- added_tokens_decoder: {'50256': {'content': '<|endoftext|>', 'lstrip': False, 'normalized': True, 'rstrip': False, 'single_word': False, 'special': True}}
- bos_token: <|endoftext|>
- clean_up_tokenization_spaces: False
- eos_token: <|endoftext|>
- extra_special_tokens: {}
- model_max_length: 1024
- tokenizer_class: GPT2Tokenizer
- unk_token: <|endoftext|>


In [11]:
# Examine vocab.json
vocab_path = os.path.join(save_directory, "vocab.json")
if os.path.exists(vocab_path):
    with open(vocab_path, "r" , encoding="utf-8") as f:
        vocab = json.load(f)

    print(f"Vocabulary size: {len(vocab)} tokens")

    # Show the first 20 tokens
    print("\nSample tokens (first 20):")
    for i, (token, token_id) in enumerate(list(vocab.items())[:20]):
        print(f"{token_id:5d}: {repr(token)}")

    # Show some interesting tokens
    print("\nSome interesting tokens:")
    interesting_tokens = ["hello", "world", "programming", "AI", "model"]
    for token in interesting_tokens:
        if token in vocab:
            print(f"{vocab[token]:5d}: {repr(token)}")

    # Show some special tokens
    print("\nSpecial tokens:")
    special_tokens = ["<|endoftext|>", "<|pad|>", "<|mask|>"]
    for token in special_tokens:
        if token in vocab:
            print(f"{vocab[token]:5d}: {repr(token)}")
else:
    print("No vocab.json found")

Vocabulary size: 50257 tokens

Sample tokens (first 20):
    0: '!'
    1: '"'
    2: '#'
    3: '$'
    4: '%'
    5: '&'
    6: "'"
    7: '('
    8: ')'
    9: '*'
   10: '+'
   11: ','
   12: '-'
   13: '.'
   14: '/'
   15: '0'
   16: '1'
   17: '2'
   18: '3'
   19: '4'

Some interesting tokens:
31373: 'hello'
 6894: 'world'
20185: 'AI'
19849: 'model'

Special tokens:
50256: '<|endoftext|>'


In [12]:
# Examine merges.txt (BPE merges)
merges_path = os.path.join(save_directory, "merges.txt")
if os.path.exists(merges_path):
    with open(merges_path, "r", encoding="utf-8") as f:
        merges = f.readlines()

    print(f"Number of BPE merges: {len(merges)}")

    # Show the first few merges
    print("\nFirst 10 BPE merges:")
    for i, merge in enumerate(merges[:10]):
        print(f"{i+1}: {merge.strip()}")

    print("\nUnderstanding BPE merges:")
    print("- Each line shows two tokens that get merged into one")
    print("- The merges are applied in order during tokenization")
    print("- This enables the model to handle unknown words by breaking them into subwords")
else:
    print("No merges.txt found")

Number of BPE merges: 50001

First 10 BPE merges:
1: #version: 0.2
2: Ġ t
3: Ġ a
4: h e
5: i n
6: r e
7: o n
8: Ġt he
9: e r
10: Ġ s

Understanding BPE merges:
- Each line shows two tokens that get merged into one
- The merges are applied in order during tokenization
- This enables the model to handle unknown words by breaking them into subwords


## Tokenization in Action

Let's see how the tokenizer works with a real example:

In [13]:
# Reload the tokenizer to ensure we're using the local files
local_tokenizer = AutoTokenizer.from_pretrained(save_directory)

# Define a sample text
sample_text = "The quick brown fox jumps over the lazy dog. This is an example of tokenization in NLP."

# Tokenize the text
tokens = local_tokenizer.tokenize(sample_text)
token_ids = local_tokenizer.encode(sample_text)

# Display the results
print(f"Original text: {sample_text}")
print(f"\nTokenized into {len(tokens)} tokens:")
print(tokens)

print(f"\nConverted to {len(token_ids)} token IDs:")
print(token_ids)

# Show token to ID mapping
print("\nToken to ID mapping:")
for token, id in zip(tokens, token_ids[:-1] if token_ids[-1] == local_tokenizer.eos_token_id else token_ids):
    print(f"{token:15} → {id}")

# Decode back to text
decoded_text = local_tokenizer.decode(token_ids)
print(f"\nDecoded text: {decoded_text}")

Original text: The quick brown fox jumps over the lazy dog. This is an example of tokenization in NLP.

Tokenized into 21 tokens:
['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.', 'ĠThis', 'Ġis', 'Ġan', 'Ġexample', 'Ġof', 'Ġtoken', 'ization', 'Ġin', 'ĠN', 'LP', '.']

Converted to 21 token IDs:
[464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13, 770, 318, 281, 1672, 286, 11241, 1634, 287, 399, 19930, 13]

Token to ID mapping:
The             → 464
Ġquick          → 2068
Ġbrown          → 7586
Ġfox            → 21831
Ġjumps          → 18045
Ġover           → 625
Ġthe            → 262
Ġlazy           → 16931
Ġdog            → 3290
.               → 13
ĠThis           → 770
Ġis             → 318
Ġan             → 281
Ġexample        → 1672
Ġof             → 286
Ġtoken          → 11241
ization         → 1634
Ġin             → 287
ĠN              → 399
LP              → 19930
.               → 13

Decoded text: The quick brown fox jumps over the lazy

# Conclusion


### Model Architecture and Configuration

- **config.json**: Defines the neural network architecture and hyperparameters (layers, attention heads, dimensions).
generation_config.json: Contains default parameters for text generation (temperature, top_p, max length).

- **generation_config.json**: Contains default parameters for text generation (temperature, top_p, max length).

### Model Weights

- **model.safetensors**: Contains all trained neural network weights - the actual learned parameters.

### Tokenizer Components

- **vocab.json**: Maps text tokens to their corresponding IDs in the model's vocabulary.
- **merges.txt**: Contains the BPE merge rules that determine how characters combine into subword tokens.
- **tokenizer.json**: Optimized version combining vocabulary and merge rules for faster processing.
- **tokenizer_config.json**: Settings for the tokenizer like special token handling and padding.
- **special_tokens_map.json**: Defines special tokens like [PAD], [BOS], [EOS] that have specific functions.