## Example - IBM Granite models

Because of the limited resources, we won't be able to train or fine-tune full models here. Instead, we will demonstrate advantages of using MoE models by comparing inference speed and memory usage of IBM Granite MoE model with a dense model of similar size.

IBM Granite is a suite of large language models (LLMs) developed by IBM, release in decemeber 2024. Dense models are available in sizes 2b and 8b parameters, while MoE models are available in sizes 3b-a800m and 1b-a400m parameters. The MoE models use the same architecture as the dense models, but with the addition of Mixture of Experts layers, which allow the model to scale up in size without a proportional increase in computational cost.

To speed things up, we will use the quantized GGUF version of the models, which can be run on a single GPU with limited memory. The model with 3b parameters in Q4_K_M format weights only 2GB.

We will use the `llama-cpp-python` library, which is a python binding for the `llama.cpp` C++ library, one of the most popular and efficient inference engines for LLms.

In [3]:
import time
from llama_cpp import Llama

In [4]:
llm_dense = Llama.from_pretrained(
    "bartowski/granite-3.1-2b-instruct-GGUF",
    filename="granite-3.1-2b-instruct-Q4_K_M.gguf",
    n_gpu_layers=-1, # Comment this line to run on CPU
    n_ctx=1024,
    local_dir="models",
    cache_dir="cache",
    verbose=False
)

llm_sparse = Llama.from_pretrained(
    "bartowski/granite-3.1-3b-a800m-instruct-GGUF",
    filename="granite-3.1-3b-a800m-instruct-Q4_K_M.gguf",
    n_gpu_layers=-1,  # Comment this line to run on CPU
    n_ctx=1024,
    local_dir="models",
    cache_dir="cache",
    verbose=False
)

llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Now let's compare the inference speed of both models on the same prompt.

First, we will test a complex prompt, to showcase the difference in performance.

In [5]:
complex_chat = [
    {
        "role": "system",
        "content": (
            "You are a helpful assistant that explains complex scientific concepts "
            "in simple terms."
        )
    },
    {
        "role": "user",
        "content": (
            "Explain the theory of relativity in simple terms, "
            "and provide an example of how it applies to everyday life."
        )
    }
]

In [6]:
start = time.time()

completion_dense = llm_dense.create_chat_completion(messages=complex_chat, max_tokens=512, stream=False)
print("Time taken (dense): %.2f" % (time.time() - start))

Time taken (dense): 28.21


In [7]:
start = time.time()

completion_sparse = llm_sparse.create_chat_completion(messages=complex_chat, max_tokens=512, stream=False)
print("Time taken (sparse): %.2f" % (time.time() - start))

Time taken (sparse): 14.05


As we see from the results the MoE model is able to generate the completion significantly faster (~3x faster) than the dense model, despite having more parameters. This demonstrates the efficiency of MoE models in handling large-scale language tasks.

Now let's compare the outputs of both models.

In [8]:
print(completion_dense['choices'][0]['message']['content'])

The theory of relativity, proposed by Albert Einstein, is actually made up of two parts: the Special Theory of Relativity and the General Theory of Relativity. Let's break it down into simpler terms:

1. Special Theory of Relativity (1905): This theory has two main ideas. First, the laws of physics are the same for all observers moving at a constant speed (not accelerating) relative to each other. Second, the speed of light in a vacuum is constant, no matter how fast you're moving or where you're moving from. This means that as you approach the speed of light, time slows down for you compared to a stationary observer. It's like if you were on a super-fast spaceship and tried to throw a ball; to someone watching from Earth, the ball would appear to move slower and slower as you approached the speed of light.

2. General Theory of Relativity (1915): This theory extends the Special Theory of Relativity to include gravity. It states that massive objects cause a distortion in space-time, wh

In [9]:
print(completion_sparse['choices'][0]['message']['content'])

Sure, I'd be happy to explain the theory of relativity in simple terms and provide an everyday example.

1. **Special Theory of Relativity (1905)**: This theory, proposed by Albert Einstein, states that the laws of physics are the same for all non-accelerating observers, and that the speed of light in a vacuum is constant, regardless of the motion of the light source or the observer.

In simpler terms, it means that:
- **Time and Space are Relative**: They don't move at the same speed for everyone. For example, if you're traveling at high speeds, time might appear to move slower for you compared to someone who's stationary. This is known as time dilation.
- **Length Contraction**: Objects in motion appear shorter to a stationary observer. So, if you were traveling at a high speed, you might appear shorter to someone standing still.

2. **General Theory of Relativity (1915)**: This theory, also by Einstein, is an extension of the special theory of relativity. It introduces the idea that

#### Commentary
At the first glance both outputs seem similar, however the dense model provides a more detailed explanation of the theory of relativity, while the MoE model gives a more concise summary. Depending on the application, one might prefer the more detailed response or the more concise one. MoE models may be more suitable for real-time applications where speed is crucial.

Next example will highlight the difference in output quality between the two models even more clearly.

## Simple prompt

We will also evaluate both models on a simpler prompt, so that we will easily see the difference in output quality.

In [10]:
simple_chat = [
    {
        "role": "system",
        "content": "You are a professional storyteller who creates engaging and imaginative stories."
    },
    {
        "role": "user",
        "content": "Write a four-sentence story about a giraffe who gets a job as a window washer for a skyscraper."
    }
]

In [11]:
start = time.time()

completion_dense = llm_dense.create_chat_completion(messages=simple_chat, max_tokens=512, stream=False)
print("Time taken (dense): %.2f" % (time.time() - start))

Time taken (dense): 11.57


In [12]:
start = time.time()

completion_sparse = llm_sparse.create_chat_completion(messages=simple_chat, max_tokens=512, stream=False)
print("Time taken (sparse): %.2f" % (time.time() - start))

Time taken (sparse): 7.06


In [13]:
print(completion_dense['choices'][0]['message']['content'])

In the heart of a bustling metropolis, a unique giraffe named Gazelle, renowned for his extraordinary neck, found himself yearning for a change of pace. Unbeknownst to his herd, Gazelle had always been captivated by the city's towering skyscrapers, their glass facades reflecting the sun's dazzling dance. One day, a peculiar job posting caught his eye: "Giraffe Window Washer Needed." Intrigued, Gazelle applied, and to his astonishment, was offered the position. With a sturdy safety harness and a newfound appreciation for the city's heights, Gazelle began his unconventional career, washing windows from the 50th floor, his long neck a marvel to the city below, forever changing the perspective of both the giraffe and the urban jungle.


In [14]:
print(completion_sparse['choices'][0]['message']['content'])

In the heart of a bustling metropolis, a majestic giraffe named Kofi lived. Kofi was not your ordinary giraffe; he had a peculiar dream - to work as a window washer on the city's towering skyscrapers. One day, an opportunity knocked when a local construction company needed a unique window washer for their ambitious project.

With a sturdy branch and an unyielding determination, Kofi embarked on his new journey. He scaled the skyscrapers, his long neck reaching heights unimaginable to most. His daily routine was a sight to behold, as he swung from building to building, leaving a trail of sparkling windows.

Despite the challenges, Kofi found joy in his work. He became a beloved figure in the city, a symbol of resilience and a reminder that dreams, no matter how unconventional, can turn into reality. His story echoed through the city, inspiring everyone who saw him, proving that even the tallest of creatures could find their place in the world.


#### Commentary

See? For simple prompts, the performance does not make that big difference anymore. However, here the quality of responses does. The dense model produces a more vivid, creative and engagin story, while the MoE model's output is more generic and less imaginative. This highlights that while MoE models are efficient, they may not always match the quality of dense models, especially for creative tasks.

## Conclusion

In this example, we demonstrated the advantages of MoE models in terms of inference speed and memory efficiency compared to dense models of similar size. While MoE models excel in speed and resource usage, dense models may still hold an edge in output quality for certain tasks. The choice between MoE and dense models should be made based on the specific requirements of the application, balancing speed, resource constraints, and output quality.

Feel free to experiment with different prompts and tasks to further explore the capabilities of MoE models!

---
# (optional) Build Your Own MoE Model

Now that you've seen how MoE models work in practice, it's time to build and train your own! We have a complete implementation of a Mixture-of-Experts Transformer for text classification in this repository.

## Project Structure

The MoE implementation is organized in modular files:
- **`models.py`** - MoE architecture (SimpleMoE, SimpleMoEDecoderLayer, SimpleMoETransformer)
- **`data.py`** - Dataset loading and preprocessing
- **`training.py`** - Training and evaluation functions
- **`train.py`** - Main training script
- **`visualize_experts.py`** - Visualize expert activations per token


## Exercise 1: Visualize Expert Activations


In [2]:
from visualize_experts import visualize_text

# Visualize a positive review
visualize_text(
    text="This movie was absolutely fantastic! The acting was superb and the plot kept me engaged.",
    all_layers=True,
    output_path="positive_review.png"
)

# Visualize a negative review
visualize_text(
    text="Terrible movie. Boring plot and bad acting. Complete waste of time.",
    all_layers=True,
    output_path="negative_review.png"
)

# Visualize a mixed review
visualize_text(
    text="The cinematography was beautiful but the story was predictable.",
    all_layers=True,
    output_path="mixed_review.png"
)

Text: This movie was absolutely fantastic! The acting was superb and the plot kept me engaged.
Prediction: POS (probability=0.993)
Saved combined heatmap to: positive_review.png
Text: Terrible movie. Boring plot and bad acting. Complete waste of time.
Prediction: NEG (probability=0.002)
Saved combined heatmap to: negative_review.png
Text: The cinematography was beautiful but the story was predictable.
Prediction: POS (probability=0.841)
Saved combined heatmap to: mixed_review.png


**Questions to analyze:**
- Do different sentiments activate different experts?
- Are certain experts more "specialized" (used more often for specific words)?
- How does expert selection differ between Layer 0 and Layer 1?
- Which tokens activate the most diverse set of experts?

## Exercise 2: Experiment with Model Architecture (Intermediate)

**Questions to answer:**
- Which configuration gives the best accuracy?
- Which configuration trains the fastest?
- Is there a "sweet spot" for the number of experts?
- How much does adding more layers help?


## Exercise 3: Custom Routing Strategy (Challenge)

Implement a different expert routing strategy.

**Current implementation:** Top-k=2 soft routing (weighted average of top 2 experts)

**Your task:** Modify `models.py` to implement one of these alternatives:

**Option A: Top-1 Hard Routing**
- Select only the single best expert per token
- Faster but less smooth

**Option B: Load Balancing**
- Add auxiliary loss to encourage equal expert usage
- Prevents expert collapse (all tokens using same expert)

**Option C: Noisy Top-k**
- Add noise to routing probabilities during training
- Helps exploration and generalization
