<a href="https://colab.research.google.com/github/iavdic/ai-science-training-series/blob/main/04_intro_to_llms/11032024_Irma_Avdic_HW_Session4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework

1. Load in a generative model using the HuggingFace pipeline and generate text using a batch of prompts.
  * Play with generative parameters such as temperature, max_new_tokens, and the model itself and explain the effect on the legibility of the model response. Try at least 4 different parameter/model combinations.
  * Models that can be used include:
    * `google/gemma-2-2b-it`
    * `microsoft/Phi-3-mini-4k-instruct`
    * `meta-llama/Llama-3.2-1B`
    * Any model from this list: [Text-generation models](https://huggingface.co/models?pipeline_tag=text-generation)
    * `gpt2` if having trouble loading these models in
  * This guide should help! [Text-generation strategies](https://huggingface.co/docs/transformers/en/generation_strategies)
2. Load in 2 models of different parameter size (e.g. GPT2, meta-llama/Llama-2-7b-chat-hf, or distilbert/distilgpt2) and analyze the BertViz for each. How does the attention mechanisms change depending on model size?

## Homework: Irma Avdic

1. Loading in a generative model using the HuggingFace pipeline and generating text using a batch of prompts. Model used: gpt2

In [25]:
# Step 1: Import necessary libraries
from transformers import pipeline

# Step 2: Load the generative model
model_name = "gpt2"
text_generator = pipeline("text-generation", model=model_name)

# Step 3: Define a batch of prompts
prompts = [
    "Explain large language models to a fourth grader.",
    "Write a short story about a cow that can talk.",
    "What are the main downsides of drinking coffee?",
    "Describe the potential outcomes of the 2024 election."
]

# Step 4: Generate text with different parameter combinations

# Configuration 1: Low temperature, moderate token limit
output1 = text_generator(prompts, temperature=0.3, max_new_tokens=50, top_k=50)

# Configuration 2: High temperature, longer token limit
output2 = text_generator(prompts, temperature=0.9, max_new_tokens=100, top_k=100)

# Configuration 3: Moderate temperature, short token limit
output3 = text_generator(prompts, temperature=0.7, max_new_tokens=30, top_p=0.9)

# Configuration 4: High temperature and high diversity
output4 = text_generator(prompts, temperature=1.2, max_new_tokens=75, top_p=0.8)

# Step 5: Print and compare outputs
print("Configuration 1:")
for i, text in enumerate(output1):
    print(f"Prompt {i + 1}: {text[0]['generated_text']}")

print("\nConfiguration 2:")
for i, text in enumerate(output2):
    print(f"Prompt {i + 1}: {text[0]['generated_text']}")

print("\nConfiguration 3:")
for i, text in enumerate(output3):
    print(f"Prompt {i + 1}: {text[0]['generated_text']}")

print("\nConfiguration 4:")
for i, text in enumerate(output4):
    print(f"Prompt {i + 1}: {text[0]['generated_text']}")


Configuration 1:
Prompt 1: Explain large language models to a fourth grader.

In this course, you will learn how to create a language model for a language. You will be able to use the language model to create a language model for your students.

You will learn how to use the language model to create
Prompt 2: Write a short story about a cow that can talk.

I'm not sure if this is a good idea, but I'm sure it's a good idea.

The cow will talk to you.

You will be able to tell her what you want.

You will be
Prompt 3: What are the main downsides of drinking coffee?

The main downsides of drinking coffee are:

1. It's hard to get the right amount of caffeine.

2. It's hard to get the right amount of caffeine.

3. It's hard to get
Prompt 4: Describe the potential outcomes of the 2024 election.

"The election of President Trump is a significant milestone in our nation's history," said President Barack Obama. "We must continue to build on the progress we have made in the last four years, and

**Configuration 1: low temperature, moderate token limit**

*   Repetitive but relevant
*   Relatively cohesive output and fairly concise

**Configuration 2: high temperature, longer token limit**

*   More random than config. 1 but still relevant
*   Longer responses than config. 1

**Configuration 3: moderate temperature, short token limit**

*   As random as config 2
*   Moderately long and coherent responses

**Configuration 4: high temperature, high diversity**

*   The most random configuration
*   Long and incoherent responses


2. Loading in 2 models of different parameter size (e.g. GPT2, meta-llama/Llama-2-7b-chat-hf, or distilbert/distilgpt2) and analyzing the BertViz for each

In [29]:
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM

from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'openai-community/gpt2'
input_text = "Winter is coming."
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

In [28]:
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'distilbert/distilgpt2'
input_text = "Winter is coming."
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

<IPython.core.display.Javascript object>

From the above example, the attention mechanism for the smaller model, though faster, less frequently captures the fullstop at the end of the sentence than a larger model, suggesting the smaller model is not as well equipped to cpature context and details.

