<a href="https://colab.research.google.com/github/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/IntroLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Language models (LMs)

Author: Archit Vasan , including materials on LLMs by Varuni Sastri and Carlo Graziani at Argonne, and discussion/editorial work by Taylor Childers, Bethany Lusch, and Venkat Vishwanath (Argonne)

Inspiration from the blog posts "The Illustrated Transformer" and "The Illustrated GPT2" by Jay Alammar, highly recommended reading.

Although the name "language models" is derived from Natural Language Processing, the models used in these approaches can be applied to diverse scientific applications as illustrated below.

## Outline
During this session I will cover:
1. Scientific applications for language models
2. General overview of Transformers
3. Tokenization
4. Model Architecture
5. Pipeline using HuggingFace  
6. Model loading

## Modeling Sequential Data

Sequences are variable-length lists with data in subsequent iterations that depends on previous iterations (or tokens).

Mathematically:
A sequence is a list of tokens: $$T = [t_1, t_2, t_3,...,t_N]$$ where each token within the list depends on the others with a particular probability:

$$P(t_N | t_{N-1}, ..., t_3, t_2, t_1)$$

The purpose of sequential modeling is to learn these probabilities for possible tokens in a distribution to perform various tasks including:
* Sequence generation based on a prompt
* Language translation (e.g. English --> French)
* Property prediction (predicting a property based on an entire sequence)
* Identifying mistakes or missing elements in sequential data

## Scientific sequential data modeling examples

 ### Nucleic acid sequences + genomic data

 <div style="text-align: center">
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/RNA-codons.svg.png?raw=1"
 width="200">
</div>

Nucleic acid sequences can be used to predict translation of proteins, mutations, and gene expression levels.


Here is an image of GenSLM. This is a language model developed by Argonne researchers that can model genomic information in a single model. It was shown to model the evolution of SARS-COV2 without expensive experiments.

<div>

<img src="https://github.com/architvasan/ai_science_local/blob/main/images/genslm.png?raw=1" width="450"/>
</div>

[Zvyagin et. al 2022. BioRXiv](https://www.biorxiv.org/content/10.1101/2022.10.10.511571v1)

### Protein sequences
Protein sequences can be used to predict folding structure, protein-protein interactions, chemical/binding properties, protein function and many more properties.
<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/Protein-Structure-06.png?raw=1" width="400"/>
</div>

<div>
<img src="https://github.com/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/images/ESMFold.png?raw=1" width="700"/>
</div>

[Lin et. al. 2023. Science](https://www.science.org/doi/10.1126/science.ade2574)

### Other applications:

* Biomedical text
* SMILES strings
* Weather predictions
* Interfacing with simulations such as molecular dynamics simulation

## Overview of Language models

We will now briefly talk about the progression of language models.

### Transformers

The most common LMs base their design on the Transformer architecture that was introduced in 2017 in the "Attention is all you need" paper.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/attention_is_all_you_need.png?raw=1" width="500"/>
</div>

[Vaswani 2017. Advances in Neural Information Processing Systems](https://arxiv.org/pdf/1706.03762)

Since then a multitude of LLM architectures have been designed.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/en_chapter1_transformers_chrono.svg?raw=1" width="600"/>
</div>

[HuggingFace NLP Course](https://huggingface.co/learn/nlp-course/chapter1/4)

## Coding example of LLMs in action!

Let's look at an example of running inference with a LLM as a block box to generate text given a prompt and we will also initiate a training loop for an LLM:

Here, we will use the `transformers` library which is as part of HuggingFace, a repository of different models, tokenizers and information on how to apply these models

*Warning: Large Language Models are only as good as their training data. They have no ethics, no judgement, or editing ability. We will be using some pretrained models from Hugging Face which used wide samples of internet hosted text. The datasets have not been strictly filtered to restrict all malign content so the generated text may be surprisingly dark or questionable. They do not reflect our core values and are only used for demonstration purposes.*

In [17]:
'''
Uncomment below section if running on sophia jupyter notebook
'''
import os
os.environ["HTTP_PROXY"]="proxy.alcf.anl.gov:3128"
os.environ["HTTPS_PROXY"]="proxy.alcf.anl.gov:3128"
os.environ["http_proxy"]="proxy.alcf.anl.gov:3128"
os.environ["https_proxy"]="proxy.alcf.anl.gov:3128"
os.environ["ftp_proxy"]="proxy.alcf.anl.gov:3128"

In [2]:
!pip install transformers
!pip install pandas
!pip install torch

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [3]:
from transformers import AutoTokenizer,AutoModelForCausalLM, AutoConfig
input_text = "My dog really wanted to eat icecream because"
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
pipe = pipeline("text-generation", model="gpt2")
generator(input_text, max_length=20, num_return_sequences=5)

2024-11-25 22:30:07.105519: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732573807.312578 3531174 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732573807.391461 3531174 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-25 22:30:07.959995: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. M

[{'generated_text': 'My dog really wanted to eat icecream because she was so small," says David. He says he'},
 {'generated_text': 'My dog really wanted to eat icecream because we were all so worried and she was just such a'},
 {'generated_text': "My dog really wanted to eat icecream because you know what's bad for me? I could eat"},
 {'generated_text': 'My dog really wanted to eat icecream because to do it, I had to use a blender.'},
 {'generated_text': 'My dog really wanted to eat icecream because the other day, his girlfriend and her friends came back'}]

## What's going on under the hood?
There are two components that are "black-boxes" here:

1. The method for tokenization
2. The model that generates novel text.


## Tokenization and embedding of sequential data

Humans can inherently understand language data because they previously learned phonetic sounds.

Machines don’t have phonetic knowledge so they need to be told how to break text into standard units to process it.

They use a system called “tokenization”, where sequences of text are broken into smaller parts, or “tokens”, and then fed as input.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/text-processing---machines-vs-humans.png?raw=1" width="400"/>
</div>

Tokenization is a data preprocessing step which transforms the raw text data into a format suitable for machine learning models. Tokenizers break down raw text into smaller units called tokens. These tokens are what is fed into the language models. Based on the type and configuration of the tokenizer, these tokens can be words, subwords, or characters.

Types of tokenizers:

1. Character Tokenizers: Split text into individual characters.
2. Word Tokenizers: Split text into words based on whitespace or punctuation.
3. Subword Tokenizers: Split text into subword units, such as morphemes or character n-grams. Common subword tokenization algorithms include:
  1. Byte-Pair Encoding (BPE),
  2. SentencePiece,
  3. WordPiece.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/tokenization_image.webp?raw=1" width="400"/>
</div>

[nlpiation](https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa)

### Example of tokenization
Let's look at an example of tokenization using byte-pair encoding.

In [4]:
from transformers import AutoTokenizer

# A utility function to tokenize a sequence and print out some information about it.

def tokenization_summary(tokenizer, sequence):

    # get the vocabulary
    vocab = tokenizer.vocab
    # Number of entries to print
    n = 10

    # Print subset of the vocabulary
    print("Subset of tokenizer.vocab:")
    for i, (token, index) in enumerate(tokenizer.vocab.items()):
        print(f"{token}: {index}")
        if i >= n - 1:
            break

    print("Vocab size of the tokenizer = ", len(vocab))
    print("------------------------------------------")

    # .tokenize chunks the existing sequence into different tokens based on the rules and vocab of the tokenizer.
    tokens = tokenizer.tokenize(sequence)
    print("Tokens : ", tokens)
    print("------------------------------------------")

    # .convert_tokens_to_ids or .encode or .tokenize converts the tokens to their corresponding numerical representation.
    #  .convert_tokens_to_ids has a 1-1 mapping between tokens and numerical representation
    # ids = tokenizer.convert_tokens_to_ids(tokens)
    # print("encoded Ids: ", ids)

    # .encode also adds additional information like Start of sequence tokens and End of sequene
    print("tokenized sequence : ", tokenizer.encode(sequence))

    # .tokenizer has additional information about attention_mask.
    # encode = tokenizer(sequence)
    # print("Encode sequence : ", encode)
    # print("------------------------------------------")

    # .decode decodes the ids to raw text
    ids = tokenizer.convert_tokens_to_ids(tokens)
    decode = tokenizer.decode(ids)
    print("Decode sequence : ", decode)


tokenizer_1  =  AutoTokenizer.from_pretrained("gpt2") # GPT-2 uses "Byte-Pair Encoding (BPE)"

sequence = "Counselor, please adjust your Zoom filter to appear as a human, rather than as a cat"

tokenization_summary(tokenizer_1, sequence)

Subset of tokenizer.vocab:
Pocket: 45454
é£: 45617
Ġpair: 5166
Mit: 43339
1200: 27550
Ġenvelop: 16441
Sent: 31837
ĠScy: 32252
)",: 42501
ĠCD: 6458
Vocab size of the tokenizer =  50257
------------------------------------------
Tokens :  ['Coun', 'sel', 'or', ',', 'Ġplease', 'Ġadjust', 'Ġyour', 'ĠZoom', 'Ġfilter', 'Ġto', 'Ġappear', 'Ġas', 'Ġa', 'Ġhuman', ',', 'Ġrather', 'Ġthan', 'Ġas', 'Ġa', 'Ġcat']
------------------------------------------
tokenized sequence :  [31053, 741, 273, 11, 3387, 4532, 534, 40305, 8106, 284, 1656, 355, 257, 1692, 11, 2138, 621, 355, 257, 3797]
Decode sequence :  Counselor, please adjust your Zoom filter to appear as a human, rather than as a cat


### Token embedding:

Words are turned into vectors based on their location within a vocabulary.

The strategy of choice for learning language structure from tokenized text is to find a clever way to map each token into a moderate-dimension vector space, adjusting the mapping so that

Similar, or associated tokens take up residence nearby each other, and different regions of the space correspond to different position in the sequence.
Such a mapping from token ID to a point in a vector space is called a token embedding. The dimension of the vector space is often high (e.g. 1024-dimensional), but much smaller than the vocabulary size (30,000--500,000).

Various approaches have been attempted for generating such embeddings, including static algorithms that operate on a corpus of tokenized data as preprocessors for NLP tasks. Transformers, however, adjust their embeddings during training.

## Transformer Model Architecture

Now let's look at the base elements that
make up a Transformer by dissecting the popular GPT2 model

In [5]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')
param_count = sum([p.numel() for p in model.parameters() if p.requires_grad])

print(model)
print(f"Number of trainable parameters: {param_count}")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
Number of trainable parameters: 124439808


GPT2 is an example of a Transformer Decoder which is used to generate novel text.

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models. The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

The architecture of GPT-2 is inspired by the paper: "Generating Wikipedia by Summarizing Long Sequences" which is another arrangement of the transformer block that can do language modeling. This model threw away the encoder and thus is known as the “Transformer-Decoder”.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/transformer-decoder-intro.png?raw=1" width="500"/>
</div>

[Illustrated GPT2](https://jalammar.github.io/illustrated-gpt2/)

Key components of the transformer architecture include:

* Input Embeddings: Word embedding or word vectors help us represent words or text as a numeric vector where words with similar meanings have the similar representation.

* Positional Encoding: Injects information about the position of words in a sequence, helping the model understand word order.

* Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sentence, enabling it to effectively capture contextual information.

* Feedforward Neural Networks: Process information from self-attention layers to generate output for each word/token.

* Layer Normalization and Residual Connections: Aid in stabilizing training and mitigating the vanishing gradient problem.

* Transformer Blocks: Comprised of multiple layers of self-attention and feedforward neural networks, stacked together to form the model.

### Attention mechanisms

Since attention mechanisms are arguably the most powerful component of the Transformer, let's discuss this in a little more detail.

Suppose the following sentence is an input sentence we want to translate using an LLM:

`”The animal didn't cross the street because it was too tired”`

To understand a full sentence, the model needs to understand what each word means in relation to other words.

For example, when we read the sentence:
`”The animal didn't cross the street because it was too tired”`
we know intuitively that the word `"it"` refers to `"animal"`, the state for `"it"` is `"tired"`, and the associated action is `"didn't cross"`.

However, the model needs a way to learn all of this information in a simple yet generalizable way.
What makes Transformers particularly powerful compared to earlier sequential architectures is how it encodes context with the **self-attention mechanism**.

As the model processes each word in the input sequence, attention looks at other positions in the input sequence for clues to a better understanding for this word.

<div>
<img src="https://github.com/architvasan/ai_science_local/blob/main/images/transformer_self-attention_visualization.png?raw=1" width="400"/>
</div>

[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

#### Multi-head attention
In practice, multiple attention heads are used simultaneously.

This:
* Expands the model’s ability to focus on different positions.
* Prevents the attention to be dominated by the word itself.

#### Let's see multi-head attention mechanisms in action!

We are going to use the powerful visualization tool bertviz, which allows an interactive experience of the attention mechanisms. Normally these mechanisms are abstracted away but this will allow us to inspect our model in more detail.

In [6]:
!pip install bertviz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


Let's load in the model, GPT2 and look at the attention mechanisms.

**Hint... click on the different blocks in the visualization to see the attention**

In [7]:
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM

from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'openai-community/gpt2'
input_text = "No, I am your father"
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

## Pipeline using HuggingFace

Now, let's see a practical application of LLMs using a HuggingFace pipeline for classification.

This involves a few steps including:
1. Setting up a prompt
2. Loading in a pretrained model
3. Loading in the tokenizer and tokenizing input text
4. Performing model inference
5. Interpreting inference output

In [8]:
# STEP 0 : Installations and imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch
import torch.nn.functional as F

### 1. Setting up a prompt

A "prompt" refers to a specific input or query provided to a language model. They guide the text processing and generation by providing the context for the model to generate coherent and relevant text based on the given input.

The choice and structure of the prompt depends on the specific task, the context and desired output. Prompts can be "discrete" or "instructive" where they are explicit instructions or questions directed to the language model. They can also be more nuanced by more providing suggestions, directions and contexts to the model.

We will use very simple prompts in this tutorial section, but we will learn more about prompt engineering and how it helps in optimizing the performance of the model for a given use case in the following tutorials.

In [9]:
# STEP 1 : Set up the prompt
input_text = "The panoramic view of the ocean was breathtaking."

### 2. Loading Pretrained Models

The AutoModelForSequenceClassification from_pretrained() method instantiates a sequence classification model.

Refer to https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodels for the list of model classes supported.

"from_pretrained" method downloads the pre-trained weights from the Hugging Face Model Hub or the specified URL if the model is not already cached locally. It then loads the weights into the instantiated model, initializing the model parameters with the pre-trained values.

The model cache contains:

* model configuration (config.json)
* pretrained model weights (model.safetensors)
* tokenizer information (tokenizer.json, vocab.json, merges.txt, tokenizer.model)

In [10]:
# STEP 2 : Load the pretrained model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
print(config)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.44.0",
  "vocab_size": 30522
}



### 3. Loading in the tokenizer and tokenizing input text

Here, we load in a pretrained tokenizer associated with this model.

In [11]:
#STEP 3 : Load the tokenizer and tokenize the input text
tokenizer  =  AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"]
print(input_ids)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tensor([[  101,  1996,  6090,  6525,  7712,  3193,  1997,  1996,  4153,  2001,
          3052, 17904,  1012,   102]])


### 4. Performing inference and interpreting

Here, we:
* load data into the model,
* perform inference to obtain logits,
* Convert logits into probabilities
* According to probabilities assign label

The end result is that we can predict whether the input phrase is positive or negative.

In [12]:
# STEP 5 : Perform inference
outputs = model(input_ids)
result = outputs.logits
print(result)

# STEP 6 :  Interpret the output.
probabilities = F.softmax(result, dim=-1)
print(probabilities)
predicted_class = torch.argmax(probabilities, dim=-1).item()
labels = ["NEGATIVE", "POSITIVE"]
out_string = "[{'label': '" + str(labels[predicted_class]) + "', 'score': " + str(probabilities[0][predicted_class].tolist()) + "}]"
print(out_string)

tensor([[-4.2767,  4.5486]], grad_fn=<AddmmBackward0>)
tensor([[1.4695e-04, 9.9985e-01]], grad_fn=<SoftmaxBackward0>)
[{'label': 'POSITIVE', 'score': 0.9998530149459839}]


### Saving and loading models

Model can be saved and loaded to and from a local model directory.

In [14]:
from transformers import AutoModel, AutoModelForCausalLM

# Instantiate and train or fine-tune a model
model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")

# Train or fine-tune the model...

# Save the model to a local directory
directory = "my_local_model"
model.save_pretrained(directory)

# Load a pre-trained model from a local directory
loaded_model = AutoModel.from_pretrained(directory)

## Model Hub
The Model Hub is where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing.

* Download pre-trained models with the huggingface_hub client library, with Transformers for fine-tuning.
* Make use of Inference API to use models in production settings.
* You can filter for different models for different tasks, frameworks used, datasets used, and many more.
* You can select any model, that will show the model card.
* Model card contains information of the model, including the description, usage, limitations etc. Some models also have inference API's that can be used directly.

Model Hub Link : https://huggingface.co/docs/hub/en/models-the-hub

Example of a model card : https://huggingface.co/bert-base-uncased/tree/main

## Recommended reading

* ["The Illustrated Transformer" by Jay Alammar](https://jalammar.github.io/illustrated-transformer/)
* ["Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)" by Jay Alammar](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
* ["The Illustrated GPT-2 (Visualizing Transformer Language Models)"](https://jalammar.github.io/illustrated-gpt2/)
* ["A gentle introduction to positional encoding"](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/)
* ["LLM Tutorial Workshop (Argonne National Laboratory)"](https://github.com/brettin/llm_tutorial)
* ["LLM Tutorial Workshop Part 2 (Argonne National Laboratory)"](https://github.com/argonne-lcf/llm-workshop)

## Homework

1. Load in a generative model using the HuggingFace pipeline and generate text using a batch of prompts.
  * Play with generative parameters such as temperature, max_new_tokens, and the model itself and explain the effect on the legibility of the model response. Try at least 4 different parameter/model combinations.
  * Models that can be used include:
    * `google/gemma-2-2b-it`
    * `microsoft/Phi-3-mini-4k-instruct`
    * `meta-llama/Llama-3.2-1B`
    * Any model from this list: [Text-generation models](https://huggingface.co/models?pipeline_tag=text-generation)
    * `gpt2` if having trouble loading these models in
  * This guide should help! [Text-generation strategies](https://huggingface.co/docs/transformers/en/generation_strategies)
2. Load in 2 models of different parameter size (e.g. GPT2, meta-llama/Llama-2-7b-chat-hf, or distilbert/distilgpt2) and analyze the BertViz for each. How does the attention mechanisms change depending on model size?

# Ozan Gokdemir's Homework Starts Here.

## Part 1

### Part 1.1. Variations on Microsoft Phi-3-mini-4k-instruct.

In [26]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
os.environ["TOKENIZERS_PARALLELISM"] = "false"

model_name = "microsoft/Phi-3-mini-4k-instruct"

#loading the model and pushing to the GPU.
model = AutoModelForCausalLM.from_pretrained( 
    model_name,  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
# number of parameters in this model. 
param_count = sum([p.numel() for p in model.parameters() if p.requires_grad])
print(f"{model_name} has {param_count} parameters")

microsoft/Phi-3-mini-4k-instruct has 3821079552 parameters


In [30]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
print(config)

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Phi3Config {
  "_name_or_path": "microsoft/Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.0",
  "use_cache": true,
  "v

In [40]:
"""
These messages set the context for the model. They're a part of the prompt engineering. In this case, 
I am doing a few-shot learning example to direct the model to explain its reasoning. 
"""
messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant that not only answers questions but also explains her reasoning."}, 
    # below is an example, so this is an example of in-context, a.k.a few-shot learning.
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    # here comes a question related to my research.
    {"role": "user", "content": "What are the some of the protein-encoding genes that encode inherently disordered proteins (IDPs) which take part in cancer-related pathways?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 1024, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text']) 

 There are several protein-encoding genes that encode inherently disordered proteins (IDPs) which play a role in cancer-related pathways. Some of these genes include:

1. BCL2: This gene encodes the B-cell lymphoma 2 (BCL2) protein, which is an IDP involved in regulating apoptosis (programmed cell death). Dysregulation of BCL2 has been implicated in various types of cancer, including lymphoma, leukemia, and breast cancer.

2. MYC: The MYC gene encodes the MYC protein, which is an IDP that functions as a transcription factor. MYC is involved in regulating cell growth, proliferation, and differentiation. Overexpression of MYC has been associated with various types of cancer, including breast, colon, and lung cancer.

3. P53: The TP53 gene encodes the p53 protein, which is an IDP that acts as a tumor suppressor. p53 plays a crucial role in regulating cell cycle progression, DNA repair, and apoptosis. Mutations in the TP53 gene are found in a wide range of cancers, including lung, breast, 

### Part 1.2 Changing the Generation Hyperparameters to of microsoft/Phi-3-mini-4k-instruct

In [36]:
generation_args = { 
    "max_new_tokens": 1024, 
    "return_full_text": False, 
    "temperature": 0., 
    "do_sample": True, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text']) 

 Protein-encoding genes that encode inherently disordered proteins (IDPs) which take part in cancer-related pathways are:

1. CSNK2A1: It encodes the casein kinase 2 alpha 1 protein, which has an intrinsically disordered region (IDR) that plays a role in glycogen metabolism and cancer progression.

2. MAP3K4: It codes for the mitogen-activated protein kinase kinase kinase 4, which is involved in oncogenic signaling pathways and affects cell proliferation, migration, and invasion.

3. FYN: This gene encodes FYN, a tyrosine kinase, and its IDR has been implicated in the development of cancer, particularly in the signaling pathways of cell proliferation, migration, and invasion.

4. CDC42: CDC42 encodes a serine/threonine-protein kinase, which plays a role in cellular signaling related to cancer progression and metastasis.

5. NFKBIA: This gene codes for the inhibitor of nuclear factor kappa-B alpha, a negative regulator of NF-kB signaling and associated with cancer development and progre

## Commentary on the impact: ## 

**I can see that dropping the max_new_tokens in half and increasing the temperature 9x leads the model to generate a lot more conjectures than it did before. Reasoning for each conjecture becomes more succinct, yet lacking. It tries to fit too much information into half the number of output tokens. I won't get into detail but at least 4 of these genes that it is outputting are definite halluciniations, or far-reaching speculations at best. I liked the answers much better when the temperature was 0.0. The model seemed to play it much safer then.** 

## Part 1.3 Analysis on the Qwen/Qwen2.5-1.5B-Instruct model. 

Note: This model has 1.5B parameters. I wanted to play with the QWEN models for a while now. This little model is quite performant in its class on the MMLU benchmark. Let's see how much it knows about cancer proteins!

In [68]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

import transformers
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer

#loading the model and pushing to the GPU.
bloom = AutoModelForCausalLM.from_pretrained( 
    model_name,  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,
    # I don't want this huge model on my local directory, so changin the cache dir to somewhere on Eagle.
    cache_dir = "/lus/eagle/projects/argonne_tpc/ogokdemir/hf_models"
) 
bloom_tokenizer = AutoTokenizer.from_pretrained(model_name)
bloom_config = AutoConfig.from_pretrained(model_name)
param_count = sum([p.numel() for p in bloom.parameters() if p.requires_grad])
print(f"{model_name} has {param_count} trainable parameters.")
print(bloom_config)

Qwen/Qwen2.5-1.5B-Instruct has 1543714304 trainable parameters.
Qwen2Config {
  "_name_or_path": "Qwen/Qwen2.5-1.5B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}



**I will ask the model to generate IDP-encoding genes. Then I will ask it to generate hypothesized interactions
between the given proteins, along with the reasoning for the hypothesis and an estimated likelihood of it being true.**

In [82]:
messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant that not only answers questions but also explains her reasoning."}, 
    # below is an example, so this is an example of in-context, a.k.a few-shot learning.
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    # here comes a question related to my research.
    {"role": "user", "content": """What are the some of the protein-encoding genes that encode 
    inherently disordered proteins (IDPs) which take part in cancer-related pathways? For each of those proteins, I would like you to
    generate 2 proteins that might be interacting with them. I would like to produce hypotheses so make sure that 1) 
    these interactions are currently not confirmed experimentally and 2) make sure the interactants you are proposing are real
    proteins. Also, produce a sound reasoning for your hypothesis and assing a likelihood to each hypothesis being true from 0
    to 100%. """}, 
]

In [83]:
generation_args = { 
    "max_new_tokens": 2048, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

In [84]:
output = pipe(messages, **generation_args) 
print(output[0]['generated_text']) 

 This is a complex task that requires a deep understanding of molecular biology and protein interactions. Here are some examples of IDPs involved in cancer-related pathways and hypothetical interacting proteins:

1. Protein: p53 (IDP)
   Interacting proteins:
   a. MDM2 (Hypothesis likelihood: 80%) - MDM2 is a known regulator of p53, and it's plausible that it could interact with p53 in a non-canonical manner.
   b. BCL-2 (Hypothesis likelihood: 70%) - BCL-2 is involved in apoptosis, and it's possible that it could interact with p53 in a non-canonical manner.

2. Protein: BRCA1 (IDP)
   Interacting proteins:
   a. RAD51 (Hypothesis likelihood: 85%) - RAD51 is involved in DNA repair, and it's plausible that it could interact with BRCA1 in a non-canonical manner.
   b. PALB2 (Hypothesis likelihood: 75%) - PALB2 is involved in DNA repair, and it's possible that it could interact with BRCA1 in a non-canonical manner.

3. Protein: NF-kB (IDP)
   Interacting proteins:
   a. IKK (Hypothesis l

### **Commentary on the output:** 

**The model has done a good job on following the instruction. The IDP-encoding proteins that it generated are indeed known to encode IDPs. The hypotheses are speculative, yet might be true. I believe the model might be overestimating the likelihood correctness of its own hypotheses. I would be interested in writing a paper comparing the entropy at each step to the likelihood that the model is outputting. It would be cool to see a negative correlation between the two because it would mean that LLMs are actually "aware" of the probabilistic uncertainty in their own answers. 

In [85]:
###Part 1.4 Changing the Generation Hyperparameters to of Qwen/Qwen2.5-1.5B-Instruct Model

In [89]:
generation_args = { 
    "max_new_tokens": 1024, 
    "return_full_text": False, 
    "temperature": 0.9, 
    "do_sample": True, #setting this to true, otherwise temperature does not mean anything.
} 

output = pipe(messages, **generation_args)
print(output[0]["generated_text"])

 1. Gene: APEX1

    APEX1 encodes a peroxiredoxin-like protein that is involved in oxidation-reduction reactions and has been identified as an IDP. It has been implicated in the activation of tumor suppressor p53.

    Protein 1: ATRX 

    ATRX is a chromatin remodeling protein that has been shown to interact with reactive oxygen species (ROS) such as hydrogen peroxide. It has been suggested that APEX1 might interact with ATRX to facilitate the activation of p53. Although no experimental evidence supports this interaction, it is plausible given the similarities in their biology.
    
    Hypothesis: APEX1 interacts with ATRX to facilitate p53 activation 

    Likelihood: 65%

    Reasoning: Both APEX1 and ATRX are involved in chromatin modification and p53 regulation. Given their roles in the p53 pathway, it is plausible that they might interact to facilitate p53 activation.

    Protein 2: BRCA1

    BRCA1 is a tumor suppressor protein involved in DNA repair and cell cycle regulatio

## Commentary on the impact of increased temperature:

**Reducing the max new tokens lead to fewer proteins and hypotheses. This means that the model prioritized following the instruction even within restricted token count. That is, it choose quality over quantity.**

**Increasing to temperature lead to a clear decrease in self-assessed confidence in the hypotheses which is interesting. It might be that the entropy increased and the model picked up on this! Qualitatively speaking, these hypotheses are highly generic. Therefore, don't inspire much of an effort to investigate experimentally. I liked the previous config (part 1.3) much better.**

## Part 2: BertViz

GPT2 Visualization (from the code provided above)

In [97]:
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'openai-community/gpt2'
input_text = "What is the meaning of life to a stoic man?"
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

param_count = sum([p.numel() for p in model.parameters() if p.requires_grad])
print(f"GPT-2 has {param_count} parameters.")

<IPython.core.display.Javascript object>

GPT-2 has 124439808 parameters.


In [100]:
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'distilbert/distilgpt2'
input_text = "What is the meaning of life to a stoic man?"
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

param_count = sum([p.numel() for p in model.parameters() if p.requires_grad])
print(f"{model_name} has {param_count} parameters.")

<IPython.core.display.Javascript object>

distilbert/distilgpt2 has 81912576 parameters.


## Commentary: 

GPT-2 has 124M parameters. DistillBert model here has 81M. Looks like the GPT-2 has 12 layers with 12 self-attention heads in each. DistilBert has 6 layers with 12 heads each. The plots can be seen above.
