## Expert Usage for Mixtral 8x7B

After experimenting with the expert hooks in Mixtral 8x7B, I want to see how many tokens of different datasets are sent to which 

In [1]:
import torch as t
import pandas as pd
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import altair as alt
from mlx_lm import load, generate

import os
from collections import defaultdict
from dotenv import load_dotenv
from huggingface_hub import login

In [2]:
load_dotenv()
login(os.getenv("HF_TOKEN"), add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (osxkeychain,store).
Your token has been saved to /Users/kj3moraes/.cache/huggingface/token
Login successful


## Datasets

I will be using 3 different datasets, and recording the percentages of tokens hanldled by each of the experts in the first and the last layers of the model. Let's see how this differs for different datasets. The datasets are
- `stanfordnlp/imdb` - this is a classification task for movie reviews.
- `databricks/databricks-dolly-15k` - plain 'ol question answering
- `bigcode/bigcodebench` - code generation from prompt

In [3]:
imdb_dataset = load_dataset("stanfordnlp/imdb", split="test")
qa_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
code_dataset = load_dataset("bigcode/bigcodebench", split="v0.1.0_hf")

## Model Setup

In [5]:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=t.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModel.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

RuntimeError: No GPU found. A GPU is needed for quantization.

In [None]:
NUM_LAYERS = len(model.layers)
NUM_EXPERTS = len(model.layers[0].block_sparse_moe.experts)
print(model)

In [None]:
first_expert_usage = defaultdict(int) 
last_expert_usage = defaultdict(int)

In [None]:
def first_expert_update(module, input, output):
    _, topk_index = t.topk(output, 2, dim=1) 
    # topk_list of of the shape [S_l, 2] where S_l is the length of the sequence
    topk_list = topk_index.tolist()

    # iterate over all the tokens in the sequence
    for topk in topk_list: 
        expert_1, expert_2 = tuple(topk) 
        first_expert_usage[expert_1] += 1
        first_expert_usage[expert_2] += 1

def last_expert_update(module, input, output): 
    _, topk_index = t.topk(output, 2, dim=1) 
    # topk_list of of the shape [S_l, 2] where S_l is the length of the sequence
    topk_list = topk_index.tolist()

    # iterate over all the tokens in the sequence
    for topk in topk_list: 
        expert_1, expert_2 = tuple(topk) 
        last_expert_usage[expert_1] += 1
        last_expert_usage[expert_2] += 1


I will only register the hooks for the first and the last experts. We can try and visualize these then.

In [None]:
model.layers[0].block_sparse_moe.gate.register_forward_hook(first_expert_update)
model.layers[-1].block_sparse_moe.gate.register_forward_hook(last_expert_update)

## Running the Experiment

In [7]:
for instruction in qa_dataset["instruction"]:
    tok_instruction = tokenizer(instruction, return_tensors="pt")
    print(tok_instruction)
    outputs = model(**tok_instruction)
    print(outputs)
    break

NameError: name 'tokenizer' is not defined

## Plotting 

Now we can plot these usage charts per dataset. 