<a href="https://colab.research.google.com/github/odenizddd/babylm-eval-nb/blob/main/BabyLM_Evaluation_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

copy necessary files from huggingface


In [1]:
!cd /

!rm -rf /content/evaluation-pipeline-2023
!rm -rf /content/elc-bert-replica

In [2]:
!apt-get install git-lfs
!git lfs install
!git clone https://huggingface.co/odenizddd/elc-bert-replica


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Git LFS initialized.
Cloning into 'elc-bert-replica'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 18 (delta 3), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (18/18), 50.87 KiB | 4.24 MiB/s, done.
Filtering content: 100% (2/2), 592.96 MiB | 122.27 MiB/s, done.


# Instructions
This notebook allows you to load and evaluate a huggingface model on a subset of BLiMP (a linguistic acceptability judgment dataset) and GLUE (a natural language understanding benchmark collection). It is HIGHLY recommended to clone the GitHub repository and evaluate your model in the command-line; this will give you more freedom in the kinds of models you can evaluate. However, Colab provides a GPU that will allow you to load and evaluate smaller models.

To use this notebook:

1. Start by making a copy of this notebook so that you can make edits and run the code: File > Save a copy in Drive.

2. Set Runtime > Change runtime type > Hardware accelerator to GPU if it isn't already.

3. Run the setup script to install the required packages for evaluating.

4. Upload your model to the colab in the `/content/model_folder/` directory. This folder should include the following files, and probably a couple more depending on the type of model and tokenizer you use:
* `config.json`
* `pytorch_model.bin`
* `tokenizer_config.json`
* `vocab.json`

  a. To obtain these files given your pre-trained model and your tokenizer, load them using huggingface `transformers` and save them using these commands:
```
tokenizer.save_pretrained("./model_dir")
model.save_pretrained("./model_dir")
```
  b. Then, upload all the contents of `model_dir` (including any other files not mentioned above) to the `model_folder` folder in this Colab.

5. Choose the proper model type in the dropdown in the "load model and evaluate" cell. Use "decoder" for autoregressive (sometimes called "causal") language models, like GPT/OPT; "encoder" for masked language models, like BERT/RoBERTa; or "encoder-decoder" for text-to-text models, like T5/BART.

6. Run the cells below to load and evaluate your model.

In [3]:
!sudo apt-get install python3.10-venv python3.10-distutils -y
!curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
!python3.10 get-pip.py



Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'python3-distutils' instead of 'python3.10-distutils'
python3-distutils is already the newest version (3.10.8-1~22.04).
The following additional packages will be installed:
  python3-pip-whl python3-setuptools-whl
The following NEW packages will be installed:
  python3-pip-whl python3-setuptools-whl python3.10-venv
0 upgraded, 3 newly installed, 0 to remove and 34 not upgraded.
Need to get 2,474 kB of archives.
After this operation, 2,885 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip-whl all 22.0.2+dfsg-1ubuntu0.5 [1,680 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-setuptools-whl all 59.6.0-1.2ubuntu0.22.04.2 [788 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3.10-venv amd64 3.10.12-1~22.04.9 [5,722 B]
Fetched 2,474 kB in 1s (1,707 kB

In [4]:
#@title Setup script { display-mode: "form" }
#@markdown Run this cell to install the necessary packages (may take a few minutes).
%%shell
# Remove previous installation if it exists
cd /content
mkdir -p model_folder
pip uninstall -y lm-eval
rm -rf evaluation-pipeline-2023/

# Install evaluation-pipeline
git clone https://github.com/odenizddd/evaluation-pipeline-2023 &> /dev/null
cd evaluation-pipeline-2023/
python3.10 -m pip install -e ".[colab]"
# Install other necessary packages
python3.10 -m pip install torch==1.13.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# Unpack dataset
unzip filter_data.zip

[0mObtaining file:///content/evaluation-pipeline-2023
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets>=2.0.0 (from lm_eval==0.2.0)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting nltk==3.6 (from lm_eval==0.2.0)
  Downloading nltk-3.6-py3-none-any.whl.metadata (2.9 kB)
Collecting openai==0.13.0 (from lm_eval==0.2.0)
  Downloading openai-0.13.0.tar.gz (37 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry==20.7.3 (from lm_eval==0.2.0)
  Downloading pycountry-20.7.3.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytablewriter==0.58.0 (from lm_eval==0.2.0)
  Downloading pytablewriter-0.58.0-py3-none-any.whl.metadata (31 kB)
Collecting rouge-score==0.0.4 (from lm_eval==0.2.0)
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl.metadata (3.8 kB)
Collectin



# Evaluation

In [5]:
!python3.10 -m pip install numpy



In [6]:
#@title Load model and evaluate (BLiMP) { display-mode: "form" }
model = "/content/elc-bert-replica" #@param {"type": "string"}
model_type = "encoder" #@param ["decoder", "encoder", "encoder-decoder"]
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline-2023
!python3.10 /content/evaluation-pipeline-2023/babylm_eval.py \
  "$model" \
  "$model_type" \
  -t "blimp" \
  --trust_remote_code

/content/evaluation-pipeline-2023
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/content/evaluation-pipeline-2023/babylm_eval.py", line 48, in <module>
    eval_model = lm_eval.get_model(MODEL_TYPE_REMAP[args.model_type],
  File "/content/evaluation-pipeline-2023/lm_eval/models/__init__.py", line 42, in get_model
    return model_api_class(**model_kwargs)
  File "/content/evaluation-pipeline-2023/lm_eval/models/huggingface.py", line 175, in __i

In [None]:
!python3.10 ./collect_results.py /content/evaluation-pipeline-2023/elc-bert-replica

Traceback (most recent call last):
  File "/content/evaluation-pipeline-2023/./collect_results.py", line 71, in <module>
    task_dicts[task] = make_task_dict(task, preds_path)
  File "/content/evaluation-pipeline-2023/./collect_results.py", line 38, in make_task_dict


In [7]:
import sys
!{sys.executable} -m pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
!rm -rf /content/elc-bert-replica/finetune

In [8]:
#@title Load model and evaluate ((Super)GLUE) { display-mode: "form" }
#@markdown Run this cell to fine-tune your model on (Super)GLUE tasks.
#@markdown We provide some default hyperparameters that you may adjust.
model = "/content/elc-bert-replica" #@param {"type": "string"}
learning_rate = 5e-5 #@param {"type": "number"}
batch_size = 64 #@param {"type": "integer"}
eval_every = 200 #@param {"type": "integer"}
patience = 10 #@param {"type": "integer"}
max_epochs = 10 #@param {"type": "integer"}
seed = 12 #@param {"type": "integer"}
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline-2023
!./finetune_all_tasks.sh \
    "$model" \
    "$learning_rate" \
    "$patience" \
    "$batch_size" \
    "$eval_every" \
    "$max_epochs" \
    "$seed"

/content/evaluation-pipeline-2023
2025-05-21 17:19:36.040284: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747847976.297918    3648 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747847976.363282    3648 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-21 17:19:36.883855: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/content/evaluation-pipeline-2023/finetune_class

In [None]:
!cd /content
from transformers import BertTokenizer, BertModel

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

save_directory = "./bert_model2"

tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory, safe_serialization=False)
!pwd


/content/evaluation-pipeline-2023


In [None]:
!pip install evaluate




In [None]:
import shutil

# Full paths to the folders you want to zip
shutil.make_archive('/content/finetune', 'zip', '/content/evaluation-pipeline-2023/elc-bert-replica/finetune')
shutil.make_archive('/content/zeroshot', 'zip', '/content/evaluation-pipeline-2023/elc-bert-replica/zeroshot')


'/content/zeroshot.zip'

In [None]:
import shutil

# Full paths to the folders you want to zip
shutil.make_archive('/content/evaluation-pipeline-2023', 'zip', '/content/evaluation-pipeline-2023')

'/content/evaluation-pipeline-2023.zip'

In [None]:
%%shell
cd /content/evaluation-pipeline-2023
ls -l

total 81160
drwxr-xr-x 2 root root     4096 May 11 18:28 aoa_data
drwxr-xr-x 2 root root     4096 May 11 18:28 assets
-rw-r--r-- 1 root root     4804 May 11 18:28 babylm_eval.py
-rw-r--r-- 1 root root       34 May 11 18:28 CODEOWNERS
-rw-r--r-- 1 root root     3713 May 11 18:28 collect_results.py
drwxr-xr-x 3 root root     4096 May 11 18:28 docs
drwxr-xr-x 7 root root     4096 May 11 18:28 filter-data
-rw-r--r-- 1 root root 61762326 May 11 18:28 filter_data.zip
-rwxrwxrwx 1 root root     1003 May 11 18:28 finetune_all_tasks.sh
-rw-r--r-- 1 root root    32453 May 11 18:46 finetune_classification.py
-rwxrwxrwx 1 root root     1255 May 11 18:28 finetune_model.sh
-rw-r--r-- 1 root root       40 May 11 18:28 ignore.txt
-rw-r--r-- 1 root root     1067 May 11 18:28 LICENSE.md
drwxr-xr-x 7 root root     4096 May 11 18:28 lm_eval
drwxr-xr-x 2 root root     4096 May 11 18:29 lm_eval.egg-info
-rw-r--r-- 1 root root     7347 May 11 18:28 main.py
-rw-r--r-- 1 root root    13536 May 11 18:28 README.



In [None]:
!chmod 777 ./*.sh

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("/content/elc-bert-replica", weights_only=False)
print(model)

The repository for /content/elc-bert-replica contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//content/elc-bert-replica.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Some weights of LtgBertForSequenceClassification were not initialized from the model checkpoint at /content/elc-bert-replica and are newly initialized: ['embedding.relative_embedding', 'embedding.relative_layer_norm.bias', 'embedding.relative_layer_norm.weight', 'embedding.word_embedding.weight', 'transformer.layers.0.attention.in_proj_qk.bias', 'transformer.layers.0.attention.in_proj_qk.weight', 'transformer.layers.0.attention.in_proj_v.bias', 'transformer.layers.0.attention.in_proj_v.weight', 'transformer.layers.0.attention.out_proj.bias', 'transformer.layers.0.attention.out_proj.weight', 'transformer.layers.0.attention.position_indices', 'transformer.layers.0.attention.post_layer_norm.bias', 'transformer.layers.0.attention.post_layer_norm.weight', 'transformer.layers.0.mlp.mlp.1.weight', 'transformer.layers.0.mlp.mlp.4.weight', 'transformer.layers.0.prev_layer_weights', 'transformer.layers.1.attention.in_proj_qk.bias', 'transformer.layers.1.attention.in_proj_qk.weight', 'transformer

LtgBertForSequenceClassification(
  (embedding): Embedding(
    (word_embedding): Embedding(6144, 384)
    (word_layer_norm): LayerNorm((384,), eps=1e-07, elementwise_affine=False)
    (dropout): Dropout(p=0.1, inplace=False)
    (relative_layer_norm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
  )
  (transformer): Encoder(
    (layers): ModuleList(
      (0-11): 12 x EncoderLayer(
        (attention): Attention(
          (in_proj_qk): Linear(in_features=384, out_features=768, bias=True)
          (in_proj_v): Linear(in_features=384, out_features=384, bias=True)
          (out_proj): Linear(in_features=384, out_features=384, bias=True)
          (pre_layer_norm): LayerNorm((384,), eps=1e-07, elementwise_affine=False)
          (post_layer_norm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (mlp): FeedForward(
          (mlp): Sequential(
            (0): LayerNorm((384,), eps=1e-07, elementwise_af

In [None]:
ls -lh /content/elc-bert-replica

total 594M
-rw-r--r--  1 root root  759 May 11 18:19 config.json
-rw-r--r--  1 root root 5.5K May 11 18:19 configuration_ltgbert.py
drwxr-xr-x 24 root root 4.0K May 11 18:44 [0m[01;34mfinetune[0m/
-rw-r--r--  1 root root 297M May 11 18:19 model.bin
-rw-r--r--  1 root root  35K May 11 18:19 modeling_ltgbert.py
-rw-r--r--  1 root root 297M May 11 18:19 pytorch_model.bin
-rw-r--r--  1 root root  173 May 11 18:19 special_tokens_map.json
-rw-r--r--  1 root root  106 May 11 18:19 tokenizer_config.json
-rw-r--r--  1 root root 132K May 11 18:19 tokenizer.json


In [None]:
import torch
from torch.serialization import add_safe_globals
from argparse import Namespace

add_safe_globals([Namespace])

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "/content/elc-bert-replica",
    trust_remote_code=True,
    ignore_mismatched_sizes=True,
    weights_only=True,
)

print("\n=== Parameter Device Check ===")
for name, param in model.named_parameters():
    print(f"{name}: {param.device} | requires_grad={param.requires_grad}")

print("\n=== Buffer Device Check ===")
for name, buffer in model.named_buffers():
    print(f"{name}: {buffer.device}")

print("\nModel successfully loaded without meta tensors if no 'meta' devices are shown above.")

Some weights of LtgBertForSequenceClassification were not initialized from the model checkpoint at /content/elc-bert-replica and are newly initialized: ['embedding.relative_embedding', 'embedding.relative_layer_norm.bias', 'embedding.relative_layer_norm.weight', 'embedding.word_embedding.weight', 'transformer.layers.0.attention.in_proj_qk.bias', 'transformer.layers.0.attention.in_proj_qk.weight', 'transformer.layers.0.attention.in_proj_v.bias', 'transformer.layers.0.attention.in_proj_v.weight', 'transformer.layers.0.attention.out_proj.bias', 'transformer.layers.0.attention.out_proj.weight', 'transformer.layers.0.attention.position_indices', 'transformer.layers.0.attention.post_layer_norm.bias', 'transformer.layers.0.attention.post_layer_norm.weight', 'transformer.layers.0.mlp.mlp.1.weight', 'transformer.layers.0.mlp.mlp.4.weight', 'transformer.layers.0.prev_layer_weights', 'transformer.layers.1.attention.in_proj_qk.bias', 'transformer.layers.1.attention.in_proj_qk.weight', 'transformer


=== Parameter Device Check ===
embedding.relative_embedding: cpu | requires_grad=True
embedding.word_embedding.weight: cpu | requires_grad=True
embedding.relative_layer_norm.weight: cpu | requires_grad=True
embedding.relative_layer_norm.bias: cpu | requires_grad=True
transformer.layers.0.prev_layer_weights: cpu | requires_grad=True
transformer.layers.0.attention.in_proj_qk.weight: cpu | requires_grad=True
transformer.layers.0.attention.in_proj_qk.bias: cpu | requires_grad=True
transformer.layers.0.attention.in_proj_v.weight: cpu | requires_grad=True
transformer.layers.0.attention.in_proj_v.bias: cpu | requires_grad=True
transformer.layers.0.attention.out_proj.weight: cpu | requires_grad=True
transformer.layers.0.attention.out_proj.bias: cpu | requires_grad=True
transformer.layers.0.attention.post_layer_norm.weight: cpu | requires_grad=True
transformer.layers.0.attention.post_layer_norm.bias: cpu | requires_grad=True
transformer.layers.0.mlp.mlp.1.weight: cpu | requires_grad=True
trans