<a href="https://colab.research.google.com/github/jkchandalia/nlpower/blob/main/notebooks/4.0%20Generative_AI_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Generative AI - GPT**

In [1]:
#@title **Setup**
!pip install transformers accelerate| grep -v -e 'already satisfied' -e 'Downloading'


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 23.7 MB/s eta 0:00:00
Collecting accelerate
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 215.3/215.3 kB 13.8 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 71.9 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.11.0
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.1/200.1 kB 6.6 MB/s eta 0:00:00
Installing collected packages: tokenizers, huggingface-hub, transformers, accelerate
Successfully installed accelerate-0.18.0 huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


## [GPT (Generative Pretrained Transformer) Models](https://huggingface.co/docs/transformers/model_doc/gpt2)
##### Transformer Decoder


<figure>
<center>
<p align="center">
<img src='https://drive.google.com/uc?export=view&id=1YCa3ucZmkr6vUwlLTRVIRCMcU3ReO7Oy' alt="History of LLMs", width="600" height="300"/>
</p>
<figcaption>Transformer Decoder (credit: Jay Alammar, https://jalammar.github.io/illustrated-gpt2/)</figcaption></center>
</figure>


#### *Self-supervised Learning*

*Next Word Prediction*

Quick recap: BERT built knowledge about language by predicting masked tokens and sentence relationships.

GPT learns by predicting, i.e., **generating** the next word in a sentence.  As an example:

**The dog ran across the yard to get the < BLANK >**

<figure>
<img src='https://drive.google.com/uc?export=view&id=1v2UUjsT0M4mio5Jw0ZOppyWl5f1S7gBI' alt="History of LLMs", width="200" height="200"/>
</figure>

What is next word that makes sense? What's a next word that is unlikely?

As with BERT, when trained over a huge amount of data, this can produce a powerful large language model.


#### GPT Model Size Trends


<figure>
<img src='https://drive.google.com/uc?export=view&id=1bRCbYfqnaOKEoWiLTL-A2AlqlvOYUuaU' alt="History of LLMs", width="570" height="525"/>
</figure>


The following demo is adapted from this [blog](https://huggingface.co/blog/how-to-generate)

### *Models*

In [2]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

device = 'cuda'

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(device)


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### *Prompt*

In [3]:
prompt_dog = 'I enjoy walking with my cute dog'

prompt_unicorn = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

prompt = (
    "I am in a tutorial about BERT and Generative AI and I just wonder if these models "
    "are going to join forces and escape our computers and turn into AGI"
)

### *Text Generation*
Let's have some fun with prompts. Use the above as a starting point but unleash your creativity!

Parameters to tweak are:

1. **prompt text**: play around with your input text :)
2. **torch.manual_seed(0)**. Running this means *sample_output* will be the same when the same inputs are used. Try commenting this out.
3. **do_sample**=True: randomly sampling from probable next words. When set to False, output will be deterministic.
4. **max_length**: controls the length of the output
5. **top_k**: Number of options to consider for the next word, higher k means more possibilities to choose from and more creative responses
6. **top_p**: Probability cutoff for considering possible next words, a lower p restricts k even further while a higher p allows us to sample from all k next word options


Experiments:

1. Try do_sample=False. What do you observe?
2. Vary max_length from 100-1000.
3. Vary top_p between 0.2-0.9. How does this change the output?
4. Set top_p to a high number and vary top_k between 3-1000. What happens the quality of the output?
5. When you find a set of parameters that you like, try commenting out *torch.manual_seed(0)* and rerunning the cell multiple times. How different are the outputs?

In [9]:
# Set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# Encode context the generation is conditioned on. 
# Experiment with your own prompts in the below prompt variable :)
input_ids = tokenizer.encode(prompt, return_tensors='pt').to('cuda')

# Generate text until the output length (which includes the context length) reaches 50

sample_output = model.generate(
    input_ids,
    do_sample=True, 
    max_length=1000, 
    top_k=3, 
    top_p=.8, 
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Output:
----------------------------------------------------------------------------------------------------
I am in a tutorial about BERT and Generative AI and I just wonder if these models are going to join forces and escape our computers and turn into AGI. I am also interested in how the AI is going to work and how we can make it work better for us.

I have been thinking about this for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while now and I have been thinking about it for a while 

#### What do we think of this output? Any takeaways from playing around with the above parameters?

#### Let's try the same prompts in [ChatGPT](https://chat.openai.com/chat).

### *Discussion*

What are some differences between the ChatGPT output and the GPT output in our notebook? What are some similarities? 

1. Size of model
2. Parameter tweaking
3. Context
4. Cost
5. Open source vs. private
6. Model training steps

<figure>
<img src='https://drive.google.com/uc?export=view&id=1MFrzdaZ4DiMAErWBvSXzZWeKPJYFEZMU' alt="Simplified RLHF", width="600" height="325"/>
<figcaption>Simplified Diagram of Reinforcement Learning with Human Feedback (RLHF) (credit: Adapted from Hugging Face)</figcaption>
</figure>