# Tutorial 5: Neural Architecture Search (NAS) with Mase and Optuna

In this tutorial, we'll see how Mase can be integrated with Optuna, the popular hyperparameter optimization framework, to search for a Bert model optimized for sequence classification on the IMDb dataset. We'll take the Optuna-generated model and import it into Mase, then run the CompressionPipeline to prepare the model for edge deployment by quantizing and pruning its weights.

As we'll see, running Architecture Search with Mase/Optuna involves the following steps.

1. **Define the search space**: this is a dictionary containing the range of values for each parameter at each layer in the model.

2. **Write the model constructor**: this is a function which uses Optuna utilities to sample a model from the search space, and constructs the model using transformers from_config class method.

3. **Write the objective function**: this function calls on the model constructor defined in Step 2 and defines the training/evaluation setup for each search iteration.

4. **Go!** Choose an Optuna sampler, create a study and launch the search.

In [1]:
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

First, fetch the dataset using the `get_tokenized_dataset` utility.

In [2]:
from chop.tools import get_tokenized_dataset

dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

  import pynvml  # type: ignore[import]
  from .autonotebook import tqdm as notebook_tqdm
[32mINFO    [0m [34mTokenizing dataset imdb with AutoTokenizer for bert-base-uncased.[0m


## 1. Defining the Search Space

We'll start by defining a search space, i.e. enumerating the possible combinations of hyperparameters that Optuna can choose during search. We'll explore the following range of values for the model's hidden size, intermediate size, number of layers and number of heads, inspired by the [NAS-BERT paper](https://arxiv.org/abs/2105.14444).

In [3]:
import torch.nn as nn
from chop.nn.modules import Identity

search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    # hidden size is the embedding dimension in transformers
    "hidden_size": [128, 192, 256, 384, 512],
    # intermediate size is the dimension of the feedforward layer in transformers
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": [
        nn.Linear,
        Identity,
    ],
}

## 2. Writing a Model Constructor

We define the following function, which will get called in each iteration of the search process. The function is passed the `trial` argument, which is an Optuna object that comes with many functionalities - see the [Trial documentation](https://optuna.readthedocs.io/en/stable/reference/trial.html) for more details. Here, we use the `trial.suggest_int` and `trial.suggest_categorical` functions to trigger the chosen sampler to choose parameter choices and layer types. The suggested integer is the index into the search space for each parameter, which we defined in the previous cell.

In [4]:
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr


def construct_model(trial):
    config = AutoConfig.from_pretrained(checkpoint)

    # Update the paramaters in the config
    for param in [
        "num_layers",
        "num_heads",
        "hidden_size",
        "intermediate_size",
    ]:
        chosen_idx = trial.suggest_int(param, 0, len(search_space[param]) - 1)
        setattr(config, param, search_space[param][chosen_idx])

    trial_model = AutoModelForSequenceClassification.from_config(config)

    for name, layer in trial_model.named_modules():
        if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
            new_layer_cls = trial.suggest_categorical(
                f"{name}_type",
                search_space["linear_layer_choices"],
            )

            if new_layer_cls == nn.Linear:
                continue
            elif new_layer_cls == Identity:
                new_layer = Identity()
                deepsetattr(trial_model, name, new_layer)
            else:
                raise ValueError(f"Unknown layer type: {new_layer_cls}")

    return trial_model

## 3. Defining the Objective Function

Next, we define the objective function for the search, which gets called on each trial. In each trial, we create a new model instace with chosen hyperparameters according to the defined sampler. We then use the `get_trainer` utility in Mase to run a training loop on the IMDb dataset for a number of epochs. Finally, we use `evaluate` to report back the classification accuracy on the test split.

In [5]:
from chop.tools import get_trainer
import torch

def objective(trial):
    
    # Define the model and move to GPU
    model = construct_model(trial)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    trainer = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,
    )

    trainer.train()
    eval_results = trainer.evaluate()

    # Move back to CPU for storage
    trial.set_user_attr("model", model.cpu())

    return eval_results["eval_accuracy"]

## 4. Launching the Search

Optuna provides a number of samplers, for example:

* **GridSampler**: iterates through every possible combination of hyperparameters in the search space
* **RandomSampler**: chooses a random combination of hyperparameters in each iteration
* **TPESampler**: uses Tree-structured Parzen Estimator algorithm to choose hyperparameter values.

You can define the chosen sampler by simply importing from `optuna.samplers` as below.

In [6]:
from optuna.samplers import GridSampler, RandomSampler, TPESampler

sampler = RandomSampler()

With all the pieces in place, we can launch the search as follows. The number of trials is set to 1 so you can go get a coffee for 10 minutes, then proceed with the tutorial. However, this will essentially be a random model - for better results, set this to 100 and leave it running overnight!

In [None]:
import optuna

study = optuna.create_study(
    direction="maximize",
    study_name="bert-tiny-nas-study",
    sampler=sampler,
)

study.optimize(
    objective,
    n_trials=1,
    timeout=60 * 60 * 24,
)

Fetch the model associated with the best trial as follows, and export to be used in future tutorials. In Tutorial 6, we'll see how to run mixed-precision quantization search on top of the model we've just found through NAS to further find the optimal quantization mapping.

In [None]:
from pathlib import Path
import dill

model = study.best_trial.user_attrs["model"].cpu()

with open(f"{Path.home()}/tutorial_5_best_model.pkl", "wb") as f:
    dill.dump(model, f)

## Deploying the Optimized Model with CompressionPipeline

Now, we can run the CompressionPipeline in Mase to run uniform quantization and pruning over the searched model.

In [None]:
from chop.pipelines import CompressionPipeline
from chop import MaseGraph

mg = MaseGraph(model)
pipe = CompressionPipeline()

quantization_config = {
    "by": "type",
    "default": {
        "config": {
            "name": None,
        }
    },
    "linear": {
        "config": {
            "name": "integer",
            # data
            "data_in_width": 8,
            "data_in_frac_width": 4,
            # weight
            "weight_width": 8,
            "weight_frac_width": 4,
            # bias
            "bias_width": 8,
            "bias_frac_width": 4,
        }
    },
}

pruning_config = {
    "weight": {
        "sparsity": 0.5,
        "method": "l1-norm",
        "scope": "local",
    },
    "activation": {
        "sparsity": 0.5,
        "method": "l1-norm",
        "scope": "local",
    },
}

mg, _ = pipe(
    mg,
    pass_args={
        "quantize_transform_pass": quantization_config,
        "prune_transform_pass": pruning_config,
    },
)

Finally, export the MaseGraph for the compressed checkpoint to be used in future tutorials for hardware generation and distributed deployment.

In [None]:
mg.export(f"{Path.home()}/tutorial_5_nas_compressed", save_format="state_dict")

In [18]:
search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    # hidden size is the embedding dimension in transformers
    "hidden_size": [128, 192, 256, 384, 512],
    # intermediate size is the dimension of the feedforward layer in transformers
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": [
        nn.Linear,
        # Identity,
    ],
}
# search_space = {
#     "num_layers": [2, ],
#     "num_heads": [2, 4],
#     # hidden size is the embedding dimension in transformers
#     "hidden_size": [128, 512],
#     # intermediate size is the dimension of the feedforward layer in transformers
#     "intermediate_size": [512],
#     "linear_layer_choices": [
#         nn.Linear,
#         Identity,
#     ],
# }

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import optuna
from optuna.samplers import RandomSampler, TPESampler, GridSampler
import torch.nn as nn
from chop.nn.modules import Identity
import traceback
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning, module="optuna.distributions")

# --- 1. FULL Search Space (for Random/TPE) ---
search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": [nn.Linear, Identity],
}

# --- 2. FIXED Search Space (for GridSampler) ---
# GridSampler crashes if you ask for a parameter that isn't in its grid (e.g. conditional layer types).
# To make it work, we must define EVERY possible parameter name it might encounter.
# This is ugly but necessary for GridSampler + NAS.
grid_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
}

# We must pre-populate the grid with ALL possible layer types for ALL possible layers (up to max layers=8).
# This creates a massive search space (Grid Search is bad for NAS), but it fixes the crash.
max_layers = 8
for i in range(max_layers):
    for sublayer in ["attention.self.query", "attention.self.key", "attention.self.value", 
                     "attention.output.dense", "intermediate.dense", "output.dense"]:
        # The key name must match exactly what `construct_model` asks for: e.g. "bert.encoder.layer.0.output.dense_type"
        param_name = f"bert.encoder.layer.{i}.{sublayer}_type"
        
        # GridSampler needs values that match what suggest_categorical expects.
        # Since we are passing classes [nn.Linear, Identity], we must put those classes here too.
        # grid_search_space[param_name] = [nn.Linear, Identity]
        grid_search_space[param_name] = [nn.Linear]


# --- 3. Sampler Definition ---
samplers = {
    # We pass the fully-populated static grid to GridSampler
    "Grid Search": GridSampler(grid_search_space),
    "Random Search": RandomSampler(),
    "TPE (Bayesian)": TPESampler(),
}

n_trials = 30
sampler_history = {}
best_overall_value = -float("inf")
best_overall_model = None

# --- 4. Optimization Loop ---
for name, sampler in samplers.items():
    print(f"--- Starting {name} ---")
    try:
        # Standard Optuna setup
        study = optuna.create_study(direction="maximize", sampler=sampler)
        study.optimize(objective, n_trials=n_trials)
        
        values = [t.value for t in study.trials if t.value is not None]
        if values:
            sampler_history[name] = np.maximum.accumulate(values)
            
            if study.best_value > best_overall_value:
                best_overall_value = study.best_value
                best_overall_model = study.best_trial.user_attrs["model"]
                
    except Exception as e:
        # traceback.print_exc()
        print(f"Sampler {name} encountered an error: {e}")

# Plotting
plt.figure(figsize=(10, 6))
for name, best_values in sampler_history.items():
    if name in sampler_history:
        plt.plot(range(1, len(sampler_history[name]) + 1), sampler_history[name], marker="o", label=name)

plt.xlabel("Number of Trials")
plt.ylabel("Maximum Accuracy Achieved")
plt.title("NAS Sampler Comparison (GPU-Accelerated)")
plt.legend()
plt.grid(True)
plt.show()

In [1]:
from transformers import AutoModel
import torch.nn as nn

# Load the base model to inspect its architecture
checkpoint = "prajjwal1/bert-tiny"
model = AutoModel.from_pretrained(checkpoint)

# Get the first encoder layer (BertLayer)
encoder_block = model.encoder.layer[0]

print("--- Linear Layers in one BERT Encoder Block ---")
count = 0
for name, module in encoder_block.named_modules():
    # Filter for Linear layers
    if isinstance(module, nn.Linear):
        # We print the name relative to the block
        print(f"{count+1}. {name}")
        count += 1

print(f"\nVerified Total: {count} Linear Layers per Block")

  from .autonotebook import tqdm as notebook_tqdm
  import pynvml  # type: ignore[import]


--- Linear Layers in one BERT Encoder Block ---
1. attention.self.query
2. attention.self.key
3. attention.self.value
4. attention.output.dense
5. intermediate.dense
6. output.dense

Verified Total: 6 Linear Layers per Block
