# Quick tour (Group B)

This quick tour will help you get started with ```mgit``` library and the concept of lineage graph. It will show you how to load preprocessors,i.e. tokenizers, and language models with mgit, and quickly train and evaluate the model. 
* You only need to read what is in this notebook as some hyper links are only for reference use. 
* You will need to run each cell of code in order.
* You can ask the instructor to clarify any concept that you feel unclear in this tutorial.
* **This tutorial can be referred back to during the later assignment, so please read this notebook carefully.** 

####  You have 15 minutes on this tutorial. Let the instructor start timing when you read this sentence.

# 1. Library Import (run the code, no need to read through it)

In [1]:
# no need to read the import block
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
os.environ['HF_HOME'] = '/workspace/HF_cache/'
os.environ['HF_DATASETS_CACHE'] = '/workspace/HF_cache/datasets'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/HF_cache/transformers_cache/'
os.environ['TF_ENABLE_ONEDNN_OPTS']='0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 

import sys
MGIT_PATH=os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
sys.path.append(MGIT_PATH)
from utils.lineage.graph import *

# 2. Natural Language Processing via the MGit Library 

An NLP model takes tokenized text as input and outputs numerical values to solve common NLP tasks, with some examples of each:

</Tip>

| **Task**                     | **Description**                                                                                              | **Application** |
|------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------|
| Masked language modeling  | predicts a masked token in a sequence                                                                                 | pre-training |          
| Sequence classification          | assign a label to a given sequence of text                                                                   | downstream: sentiment analysis |   

The ```mgit``` library provides the functionality and API to create and use such NLP models. The ```mgit``` library also provides a data structure ```LineageGraph``` to store and manage the models by recording their lineage relations, i.e. how a model is derived from/related to other models in the graph. **This data structure lets users efficiently retrieve, inspect and update models which show similar behavior because of their resemblance (e.g., similar layers).**

* The code below shows you an example lineage graph that stores three models locally under ```models```. 
* Among the models,  ```models/bert-base-uncased_v2``` and ```models/bert-base-uncased-sentiment``` are derived from ```models/bert-base-uncased```.
* Specifically, ```models/bert-base-uncased_v2``` is the next version of ```models/bert-base-uncased``` via fine-tuning and they perform the same pre-training task, i.e.  masked language modeling.
* ```models/bert-base-uncased-sentiment``` is adapted from ```models/bert-base-uncased``` and is trained to perform a downstream task, i.e. sequence classification.

In [2]:
#load the LineageGraph
g = LineageGraph.load_from_file('./')

In [3]:
# show the graph with interactive mode where you can drag the node/graph and zoom in/out 
from IPython.display import IFrame
g.show()
display(IFrame('LineageGraph.html', width=1000, height=450))

To see all the models in the LineageGraph ```g``` you can do:

In [4]:
for node in g.get_all_nodes():
    print(node.output_dir)

models/bert-base-uncased
models/bert-base-uncased_v2
models/bert-base-uncased-sentiment


### To see all the models in the subtree of a specific parent model, i.e. models that are originated from a model, you can pass in the name of the parent. This is useful when users want to quickly retrieve other models which might show similar behavior, e.g., accuracy fluctuations due to subtle changes in inputs.

In [5]:
# Here is an example to retrive the subtree of 'models/bert-base-uncased':
for node in g.get_all_nodes(parent='models/bert-base-uncased'):
    print(node.output_dir)

models/bert-base-uncased_v2
models/bert-base-uncased-sentiment


# 3. Loading Tokenizer

A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

In [6]:
#Each model is stored as a lineag node inside the lineage Graph, where the model instance and the tokenizer instance can be retrieved from.
tokenizer = g.get_node("models/bert-base-uncased-sentiment").get_pt_tokenizer()

loading model: models/bert-base-uncased-sentiment
attempting to load model by config


Pass your text to the tokenizer:

In [7]:
encoding = tokenizer("We are very happy to show you the mgit library.")
print(encoding)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 18437, 10517, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The tokenizer returns a dictionary containing:

* [input_ids](https://huggingface.co/docs/transformers/main/en/./glossary#input-ids): numerical representations of your tokens.
* [attention_mask](https://huggingface.co/docs/transformers/main/en/.glossary#attention-mask): indicates which tokens should be attended to.

A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:

In [8]:
pt_batch = tokenizer(
    ["We are very happy to show you the mgit library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

# 4. Loading Model 

As stated above, ```mgit``` provides a simple and unified way to load different instances. This means you can also retrieve a model instance from a ```LineageNode```.

In [9]:
model = g.get_node("models/bert-base-uncased-sentiment").get_pt_model()

</Tip>

Now pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding `**`:

In [10]:
model(**pt_batch).logits

tensor([[-2.6026, -2.7280, -0.7415,  2.0235,  3.1213],
        [ 0.0064, -0.1258, -0.0503, -0.1655,  0.1329]],
       grad_fn=<AddmmBackward0>)

# 5. Model Training

All models are a standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) so you can use them in any typical training loop. While you can write your own training loop, ```mgit``` provides a ```LineageTrain``` class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mixed precision, and more.

Depending on your task, you'll typically pass the following parameters to use LineageTrain:

1. A LineagNode added to the loaded graph:
   ```py
   
   >>> node = LineageNode(
                    init_checkpoint='models/bert-base-uncased-sentiment',
                    output_dir='models/bert-base-uncased-sentiment_versioned',
                )
   >>> g.add(node, etype='adapted',parent='models/bert-base-uncased-sentiment')
   ```
3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:

   ```py
   >>> tokenizer = node.get_pt_tokenizer()
   ```
   
4. A ```LineageDataset```:

   ```py
   >>> lieange_dataset = LineageDataset('datasets/rotten_tomatoes', feature_keys=['text'])
   ```

5. Create a function to tokenize and preprocess the dataset:

   ```py
   >>> def preprocess_function(lieange_dataset, tokenizer):
           lieange_dataset.dataset = lieange_dataset.dataset.map(tokenizer(dataset["text"]), batched=True)
           return lieange_dataset
   ```

   Then apply it over the entire dataset:

   ```py
   >>> lieange_dataset = preprocess_function(lieange_dataset, tokenizer)
   ```

6. Now gather all these classes in LineageTrain and add it to the node. LineageTrain contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. The [default](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) values are used if you don't specify any training arguments:

```py
    >>> lineage_train = LineageTrain(
    ...     train_dataset=dataset['train'],
    ...     eval_dataset=dataset['test'],
    ...     per_device_train_batch_size=256,
    ...     per_device_eval_batch_size=256,
    ... )
    >>> node.lineage_train = lineage_train
```
7. Call ```node.train()``` to start training

## * Useful Note:
```mgit``` also provides an advanced function ```update_cascade``` to enable **efficient training** if the user wants to replay the adaptation process of a downstream model, e.g. sequence classification, **on a different pre-training model**, e.g. masked language modeling. In the above example, ```models/bert-base-uncased-sentiment``` was adapted from ```models/bert-base-uncased```. To create a new version of ```models/bert-base-uncased-sentiment``` from ```models/bert-base-uncased_v2``` instead of ```models/bert-base-uncased```, the user can simply do: 
```py
>>> #old_base is the reference node where the training process to produce its down stream models is recorded by mgit, and new_base is the new pre-training node that the recorded training process will be replayed on.
>>> #new_target is then fine-tuned from new_base and is returned as the next version of old_target 
>>> new_target = g.update_cascade(old_base=g.get_node('models/bert-base-uncased'), new_base=g.get_node('models/bert-base-uncased_v2'), old_target=g.get_node('models/bert-base-uncased-sentiment'))
```
#### Note that the new_target is automatically named with a suffix '_verisoned' compared to odel_target, to retrieve it from the LineageGraph ```g```, the user can simply do:
```py
>>> new_target = g.get_node(g.get_node('models/bert-base-uncased-sentiment').output_dir + '_versioned')
```

# 6. Model Evaluation

1. ```mgit``` provides a customizable ```LineageTest``` class for model evaluation. You'll need to pass LineageTest a function to compute and report metrics. As an example, the [Evaluate](https://huggingface.co/docs/evaluate/index) library provides a simple [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) function you can load with the [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load) function:

```py
    >>> import numpy as np
    >>> import evaluate

    >>> metric = evaluate.load("accuracy")
```

2. Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits:

```py
    >>> def compute_metrics(eval_pred):
    ...    logits, labels = eval_pred
    ...    predictions = np.argmax(logits, axis=-1)
    ...    return metric.compute(predictions=predictions, references=labels)
```

3. Now gather all these classes in LineageTest:

```py
    >>> test = LineageTest(      
    ...    eval_dataset=dataset['test'],
    ...    compute_metrics=compute_metrics,
    ...    name='test_1',
    )
```
4. Call ```node.run_test(test, return_results=True)``` to start evaluation 

# Don't close this tab when you are done reading!