# Instructions
This notebook allows you to load and evaluate a huggingface model on a subset of BLiMP (a linguistic acceptability judgment dataset) and GLUE (a natural language understanding benchmark collection). It is HIGHLY recommended to clone the GitHub repository and evaluate your model in the command-line; this will give you more freedom in the kinds of models you can evaluate. However, Colab provides a GPU that will allow you to load and evaluate smaller models.

To use this notebook:

1. Start by making a copy of this notebook so that you can make edits and run the code: File > Save a copy in Drive.

2. Set Runtime > Change runtime type > Hardware accelerator to GPU if it isn't already.

3. Run the setup script to install the required packages for evaluating.

4. Upload your model to the colab in the `/content/model_folder/` directory. This folder should include the following files, and probably a couple more depending on the type of model and tokenizer you use:
* `config.json`
* `pytorch_model.bin`
* `tokenizer_config.json`
* `vocab.json`

  a. To obtain these files given your pre-trained model and your tokenizer, load them using huggingface `transformers` and save them using these commands:
```
tokenizer.save_pretrained("./model_dir")
model.save_pretrained("./model_dir")
```
  b. Then, upload all the contents of `model_dir` (including any other files not mentioned above) to the `model_folder` folder in this Colab.

5. Choose the proper model type in the dropdown in the "load model and evaluate" cell. Use "decoder" for autoregressive (sometimes called "causal") language models, like GPT/OPT; "encoder" for masked language models, like BERT/RoBERTa; or "encoder-decoder" for text-to-text models, like T5/BART.

6. Run the cells below to load and evaluate your model.

# IMPORTANT STUFF 

Online version of this demo colab: https://colab.research.google.com/drive/1HX2D3wztO81tKcqCeV_ecRcEUseBVuTc?usp=sharing

### Important sites:
- https://babylm.github.io/ --> general
- https://github.com/babylm/evaluation-pipeline --> evaluation pipeline
- https://huggingface.co/babylm --> pretrained T5 baseline models
- https://github.com/babylm/baseline-pretraining --> Baseline T5 (the one we need w/ parameters I think)
- https://github.com/babylm/babylm.github.io/tree/main --> The readme actually explains what the assignment is about
- https://github.com/babylm/ --> their general github

## What we have to do according to professor:
- Dataset description (replicate what we did for the Brown corpus in A2): 10 points
    - 2 genres done, not sure if its a good idea to run over all of 10M or 100M. 10M ran for an hour yesterday and still wasn't finished. Maybe if we delete the 16 MB file it might go faster, because the 2 genres only took like 2 minutes to finish.
- Output file from running T5 baseline model on both tracks: 15 points
    - This is what we are doing right now
- Description of BLiMP evaluation and 12 linguistic phenomena evaluated: 15 points
    - Step below our model
- Description of (Super)Glue evaluation metric and 11 evaluated categories: 15 points
    - Still need to change this into Python format, (or just run it in Bash)
- Error analysis of 2 evaluation criteria: 15 points 
- Improvements to baseline model with code and short write-up on methods tried and how well they performed: 30 points

### Deadline: Friday 7 July


In [1]:
import os
import subprocess
import zipfile
import shutil

# Remove previous installation if it exists
os.makedirs('model_folder', exist_ok=True)
subprocess.run(['pip', 'uninstall', '-y', 'lm-eval'])

# Remove 'evaluation-pipeline/' directory if it exists 
# if os.path.exists('evaluation-pipeline/'): 
#     shutil.rmtree('evaluation-pipeline/') 

# Install evaluation-pipeline
subprocess.run(['git', 'clone', '-b', 'colab', 'https://github.com/babylm/evaluation-pipeline'], 
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
os.chdir('evaluation-pipeline/')
subprocess.run(['pip', 'install', '-e', '".[colab]"'])

# Install other necessary packages
subprocess.run(['pip', 'install', 'torch==1.11.0+cu113', 'torchvision==0.12.0+cu113', 'torchaudio==0.11.0', 
                '--extra-index-url', 'https://download.pytorch.org/whl/cu113'])
subprocess.run(['pip', 'install', 'sentencepiece==0.1.94'])
subprocess.run(['pip', 'install', 'transformers'])

# Unpack dataset
with zipfile.ZipFile('filter_data.zip', 'r') as zip_ref:
    zip_ref.extractall()

os.chdir('..')


In [3]:
# This is where we define our T5 model. The instructions can be found here: https://github.com/babylm/baseline-pretraining
# Please follow the instructions written on the github page.
# Mind you bro if you google about T5, we are training specifically Babylm T5. Not other instances of T5 (these exists too).

## ------------------------------------------------ ##

# Don't forget to install the framework environment https://github.com/chengxuz/pt_framework

# DONE : First, define the environment variable BABYLM_ROOT_DIR to be where your models and data will live
# DONE : Downloaded data should be put at ${BABYLM_ROOT_DIR}/datasets/ so this folder contains the following 4 subfolders: babylm_100M, babylm_10M, babylm_dev, and babylm_test.

# FAILED: Note that the T5 training script expects .txt file inputs, so we create a single dev file by running this command in the ${BABYLM_ROOT_DIR}/datasets/babylm_dev/ folder: cat *.dev > babylm_dev.txt
    # - I ran it via Bash in the folder, but it didn't create a .txt file

# TO-DO: once above is done, modify "baseline-pretraining-main\scripts\train_t5_babylm.sh" parameters and run ./train_t5_babylm.sh in the scripts folder "baseline-pretraining-main\scripts\" via Bash. 

## ------------------------------------------------ ##

    

In [2]:
################### ignore this for now ###################

# ''' This is where we import the T5 model (and tokenizer) from HuggingFace (URL: https://huggingface.co/babylm/) '''

# from transformers import T5Tokenizer, T5ForConditionalGeneration

# # Load model and tokenizer
# model = T5ForConditionalGeneration.from_pretrained("babylm/t5-base-strict-small") # Track 1: For babylm_10M
# tokenizer = T5Tokenizer.from_pretrained("babylm/t5-base-strict-small")            # Track 1: For babylm_10M

# # model = T5ForConditionalGeneration.from_pretrained("t5-base-strict") # 100M     # Track 2: For babylm_100M
# # tokenizer = T5Tokenizer.from_pretrained("babylm/t5-base-strict") # 100M         # Track 2: For babylm_100M

# # Save model and tokenizer
# tokenizer.save_pretrained("./model_dir")
# model.save_pretrained("./model_dir")

# Evaluation

### Blimp

In [3]:
#Load model and evaluate (BLiMP)

import os
import subprocess

model = "C:\\Users\\mzdog\\Desktop\\babylm_data\\model_folder"  # @param {type: "string"}
model_type = "encoder-decoder"                                  # @param ["decoder", "encoder", "encoder-decoder"]

os.chdir("C:\\Users\\mzdog\\Desktop\\babylm_data\\evaluation-pipeline")

try:
    output = subprocess.check_output(['python', 'babylm_eval.py', model, model_type, '-t', 'blimp'], stderr=subprocess.STDOUT)
    print(output.decode('utf-8'))
except subprocess.CalledProcessError as e:
    print("Command failed with exit code", e.returncode)
    print("Output:")
    print(e.output.decode('utf-8'))


### (Super)GLUE

In [31]:
#@title Load model and evaluate ((Super)GLUE) { display-mode: "form" }
#@markdown Run this cell to fine-tune your model on (Super)GLUE tasks.
#@markdown We provide some default hyperparameters that you may adjust.
model = "C:\\Users\\mzdog\\Desktop\\babylm_data\\model_folder" #@param {"type": "string"}
learning_rate = 5e-5 #@param {"type": "number"}
batch_size = 64 #@param {"type": "integer"}
eval_every = 200 #@param {"type": "integer"}
patience = 10 #@param {"type": "integer"}
max_epochs = 10 #@param {"type": "integer"}
seed = 12 #@param {"type": "integer"}
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd "C:\\Users\\mzdog\\Desktop\\babylm_data\\evaluation-pipeline"
!./finetune_all_tasks.sh \
    "$model" \
    "$learning_rate" \
    "$patience" \
    "$batch_size" \
    "$eval_every" \
    "$max_epochs" \
    "$seed"


C:\Users\mzdog\Desktop\babylm_data\evaluation-pipeline


'.' is not recognized as an internal or external command,
operable program or batch file.
