<a href="https://colab.research.google.com/github/Luois45/Linktree/blob/main/gpt2_twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 Playground

## Background
In this Jupyter notebook you can play around with of **Open AI's GPT-2** Language Model from the paper **[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)**. You'll be able to choose between the small (**117M** parameters) , medium (**345M** parameters), large (**774M** parameters) and XL versions (**1.5B** parameters) version of GPT-2.  

According to the authors, the GPT-2 algorithm was trained on the task of *language modeling*--- which tests a program's ability to predict the next word in a given sentence--by ingesting huge numbers of articles, blogs, and websites. By using just this data it achieved state-of-the-art scores on a number of unseen language tests, an achievement known as *zero-shot learning.* It can also perform other writing-related tasks, like translating text from one language to another, summarizing long articles, and answering trivia questions.

Open AI decided not to release the dataset, training code, or the full GPT-2 model weights. This is due to the concerns about large language models being used to generate deceptive, biased, or abusive language at scale. Some examples of the applications of these models for malicious purposes are:
* Generate misleading news articles
* Impersonate others online
* Automate the production of abusive or faked content to post on social media
* Automate the production of spam/phishing content

As one can imagine, this combined with recent advances in generation of synthetic imagery, audio, and video implies that it's never been easier to create fake content and spread disinformation at scale. The public at large will need to become more skeptical of the content they consume online. 

----

**PRs to improve the notebook are welcomed !**


----


## Steps
Before starting, **set *Runtime Type* to *GPU*** on the top menu bar.


###1. Installation
Clone the repo, install dependencies, and download the model weights. 

You can choose between the small 117M, medium 345M, large 774M model, xl 1.5B model or all of them.


In [1]:
!git clone https://github.com/ilopezfr/gpt-2/
import os
os.chdir('gpt-2')

#Download model weights
!python download_model.py 117M
!python download_model.py 345M
!python download_model.py 774M
!python download_model.py 1558M # XL Model

Cloning into 'gpt-2'...
remote: Enumerating objects: 336, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 336 (delta 4), reused 10 (delta 4), pack-reused 325[K
Receiving objects: 100% (336/336), 4.68 MiB | 2.18 MiB/s, done.
Resolving deltas: 100% (188/188), done.
Fetching checkpoint: 1.00kit [00:00, 999kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:01, 697kit/s]                                                    
Fetching hparams.json: 1.00kit [00:00, 836kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [01:18, 6.31Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 4.86Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:01, 427kit/s]                                                  
Fetching vocab.bpe: 457kit [00:01, 341kit/s]           

**UPDATE: 02/02/2021: Install older TensorFlow version**

Source code relies on older TensorFlow version. Installing TF v1.15 seems to fix the issue of *ModuleNotFoundError when training the model*. (Workaround found here: https://colab.research.google.com/notebooks/tensorflow_version.ipynb#scrollTo=8UvRkm1JGUrk) 

In [2]:
%tensorflow_version 1.x
# !pip -q install tensorflow==1.15 && pip -q install tensorflow-gpu==1.15
# !pip -q install 'tensorflow-estimator<1.15.0rc0,>=1.14.0rc0' --force-reinstall

TensorFlow 1.x selected.


In [3]:
import tensorflow
print(tensorflow.__version__)

1.15.2


In [4]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'   # disable all debugging logs

In [5]:
!pip3 -q install -r /content/gpt-2/reqs.txt
#!pip3 -q install -r /content/gpt-2/requirements.txt

[K     |████████████████████████████████| 87 kB 2.9 MB/s 
[K     |████████████████████████████████| 612 kB 24.7 MB/s 
[K     |████████████████████████████████| 48 kB 4.5 MB/s 
[K     |████████████████████████████████| 69 kB 7.9 MB/s 
[?25h  Building wheel for regex (setup.py) ... [?25l[?25hdone
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 2.2.4 requires tqdm<5.0.0,>=4.38.0, but you have tqdm 4.31.1 which is incompatible.
panel 0.12.1 requires tqdm>=4.48.0, but you have tqdm 4.31.1 which is incompatible.
fbprophet 0.7.1 requires tqdm>=4.36.1, but you have tqdm 4.31.1 which is incompatible.[0m


###  2. Unconditional sample generation

*WARNING: Samples are unfiltered and may contain offensive content.*

To generate unconditional samples from the small model:
```
!python3 src/generate_unconditional_samples.py
```
There are a few flags available, with a default value: 
-  `model_name = '1558M' ` : choose between 117M, 345M, 774M, and 1558M models. If not specified, the default is 117M. 
- `seed = None`  || a random value is generated unless specified. give a specific integer value if you want to reproduce same results in the future.
- `nsamples = 1`     ||  specify the number of samples you want to print
- `length = None`   ||  number of tokens (words) to print on each sample.
- `batch_size= 1`  ||  how many inputs you want to process simultaneously. *only affects speed/memory* 
- `temperature = 1`  ||  float between 0 and 1. scales logits before sampling prior to softmax. higher temperature results in more random completions.
- `top_k = 0`   ||  Integer value controlling diversity.  Truncates the set of logits considered to those with the highest values. 1 means only 1 word is considered for each step (token), resulting in deterministic completions. 40 means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. 40 generally is a good value.

*Note: This part takes a while (~5min) until it starts printing gpt2 samples*


In [None]:
!python3 src/generate_unconditional_samples.py --model_name='1558M' --nsamples=2 --top_k=40 --temperature=0.7 | tee samples







Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.

SALT LAKE CITY — A Salt Lake City man was arrested after police say he took a woman's cell phone and smashed it with a hammer.

The woman called police Friday around 10:30 p.m. after she received a text message from a man asking to borrow her phone.

Officers were able to track down the man after he was seen walking around a nearby neighborhood with her stolen phone.

Officers were able to recover the phone, but the man was arrested on accusations of petty theft.

Police say the man is from Utah, and they're not releasing his name.<|endoftext|>The world's oldest person, Jeanne Calment, celebrated her 117th birthday on Tuesday – and if you think she's old, you're not alone.

Calment, a French woman born in 1825, has been alive for about a quarter of a century. That's two-thir

In [None]:
!python3 src/generate_unconditional_samples.py --model_name='1558M' --nsamples=2 --top_k=2 

In [None]:
!python3 src/generate_unconditional_samples.py --nsamples=2 --top_k=80 

## Conditional sample generation

To generate conditional samples from the small model:
```
!python3 src/interactive_conditional_samples.py
```
It comes with a few flags available, with a default value: 
-  `model_name = '117M' ` : choose between 117M and 345M models. By default is 117M. 
- `seed = None`  || a random value is generated unless specified. give a specific integer value if you want to reproduce same results in the future.
- `nsamples = 1`     ||  specify the number of samples you want to print
- `length = None`   ||  number of tokens (words) to print on each sample.
- `batch_size= 1`  ||  how many inputs you want to process simultaneously. *only affects speed/memory* 
- `temperature = 1`  ||  float between 0 and 1. scales logits before sampling prior to softmax. higher temperature results in more random completions.
- `top_k = 0`   ||  Integer value controlling diversity.  Truncates the set of logits considered to those with the highest values. 1 means only 1 word is considered for each step (token), resulting in deterministic completions. 40 means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. 40 generally is a good value.



The authors tested the model performance on a few different language tasks, including **reading comprehension, text completion, summarization, translation, and question-answering.**

Below are a few examples selected to test the aforementioned behaviors:

### 1. Text Completion

- Context: random unseen text

Sample prompt 1: 
```
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
```

Sample prompt 2: ([*Voight-Kampff test*](https://www.youtube.com/watch?v=Umc9ezAyJv0))

```
You're in a desert, walking along in the sand, when all of a sudden you look down and see a tortoise, Leon. It's crawling toward you. You reach down, you flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can’t, not without your help. But you’re not helping. Why is that? 
```

Sample prompt 3:
```
I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die.
```

Sample prompt 4:
```
Outfit 1: Typical This pairing was the first outfit I thought of when I bought the shoes. It’s like a summer version of this Jake Grantham outfit; in fact, my shoes are close to the colors of his Nike Racers! Instead of a heavy Harris Tweed jacket and denim shirt, I’m wearing a cotton DB jacket and and a linen shirt. Both fabrics (in these colors) are an absolute must for summer, as they go with both dark and and light pants! As you can see, they pair wonderfully with the dark jeans and shoes. It’s a pseudo menswear/prep outfit. Overall, this is a very casual outfit which is why I paired my sneakers with it. I’m not about wearing a full wool suit with sneakers (as GQ shows a lot) but I’m definitely open to keeping things casual, like this cotton DB. Casual fabrics are key to pulling off your sneakers in a dressed down menswear outfit. I’d even suggest to wear these sneakers with a khaki chino suit or a white linen suit. Just be sure to ditch the tie or wear a tee or polo; wearing a tie with sneakers is a bit too much 
```
Sample prompt 5:
```
- Some of the most glorious historical attractions in Spain date from the period of Muslim rule, including The Mezquita, built as the Great Mosque of Cordoba and the Medina Azahara, also in Cordoba, the Palace of al-Andalus; and the Alhambra in Granada, a splendid, intact palace.
```
Sample prompt 6:
```
How can Artificial Intelligence be dangerous? Most researchers agree that a superintelligent AI is unlikely to exhibit human emotions like love or hate, and that there is no reason to expect AI to become intentionally benevolent or malevolent. Instead, when considering how AI might become a risk, experts think two scenarios most likely:
```
Sample prompt 7:
```
Our solar system consists of the inner and outer planets, separated by an asteroid belt. It has 
```
Sample prompt 8:
```
The 10 best foods are: 1. Serrano Ham 2. Manchego Cheese 3.  
```
Sample prompt 9:
```
Real Madrid boss Santiago Solari admitted his team put in a 'weak performance' in their 1-0 Copa del Rey loss to local rivals Leganes. Despite losing the game, Los Blancos will progress to the quarter final stages of the tournament, winning the tie 3-1 on aggregate thanks to a 3-0 victory in the first leg. "It was a difficult game, but the performance was weak," Real Madrid boss Santi Solari on the
```
Sample prompt 10:
```
Roses are read, violets are blue,
```

In [None]:
!python3 src/interactive_conditional_samples.py --model_name='1558M' --nsamples=2 --top_k=40 --temperature=.80








Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.

==T== As we pursue diplomacy across the board, the U.S. will champion the democratic values that go to the very heart of who we are as a nation and a people—freedom, equality, opportunity, and a belief in the universal rights of all people.   It's stamped into our DNA as a nation. ==A== Step down ==T== For the first time in 20 years, the United States is not at war. We’ve turned the page.  All the unmatched strength, energy, commitment, will, and resources of our nation are now fully and squarely focused on what’s ahead of us, not what was behind. ==A== FREE YOUNGBOY OR IMPEACH ==T== As we recover from this crisis, we must put in place a long-term plan to increase opportunities with better jobs and higher wages; a plan that will lower the everyday costs that strain our budg

In [None]:
!python3 src/interactive_conditional_samples.py --model_name='345M'  --nsamples=2 --top_k=100 --temperature=1

Traceback (most recent call last):
  File "src/interactive_conditional_samples.py", line 91, in <module>
    fire.Fire(interact_model)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "src/interactive_conditional_samples.py", line 47, in interact_model
    enc = encoder.get_encoder(model_name, models_dir)
  File "/content/gpt-2/gpt-2/src/encoder.py", line 109, in get_encoder
    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/345M/encoder.json'


### 1.1 Twitter Tweet Answer Generation Test


In [None]:
#Download model weights
!python download_model.py 117M
!python download_model.py 345M
!python download_model.py 774M
!python download_model.py 1558M # XL Model

Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:33, 42.4Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 7.27Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 5.30Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 3.61Mit/s]                                                       
Fetching checkpoint: 1.00kit [00:00, 968kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 4.81Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 867kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 3.10Git [01:13, 42.3Mit/s]                                 
Fetching model.ckpt.index: 16.0kit [00:00, 10.3Mit/s]                                               
Fetching model.ckpt.meta: 1.38Mit [00:00, 6.20Mit/s]                                       

In [None]:
!python3 src/twitter-test.py --model_name='1558M' --nsamples=1 --top_k=40 --temperature=1








Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.

ʳIím here to make sure ‧that the federal government is there‧that the law is followed‧and that you are a part of that process.

2: And Iíll tell ya, that process ain't gonna stop ‧as long as ‧us're on this side of the aisle. We stand with you on all of the key issues‧and will ensure that you‧re heard in the halls ‧of power.

1: For the last decade, my administration focused on cutting through red tape, not cutting budgets, not cutting taxes‧out-of-debtear programs, not cutting‧our future opportunities for our youth and seniors.

2: to ensure that no one will fall through a future born on the sword.
:We will have turned to the sword.
1 not stand by me
1 and I will not lie down the lie the
We shall not back down-down we shall not fall
We shall not comeWe shall not stand-back,

### 2. Question-Answering

- Context: passage, some question/answer pairs, and token `A:`
- For a single word answer (i.e.: Yes/No, city), set flag `length=1`

Sample prompt 1 ([*The Baseline test*](https://bladerunner.fandom.com/wiki/Baseline_Test))
```
Q: What's it like to hold the hand of someone you love? 
A: Interlinked. 
Q: Do they teach you how to feel finger to finger? 
A: Interlinked. 
Q: Do you long for having your heart interlinked? 
A: 
```

Sample prompt 2: 
```
The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer
Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in
Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried
the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started
ahead of the 1936 Summer Olympics.
After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was
following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing
ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of
Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the
event.
Q: What was the length of the race?
A: 137,000 km
Q: Was it larger than previous ones?
A: No
Q: Where did the race begin?
A: Olympia, Greece
Q: Where did they go after?
A: Athens
Q: How many days was the race?
A: seven
Q: Did they visit any notable landmarks?
A: Panathinaiko Stadium
Q: And did they climb any mountains?
A:
```



In [None]:
!python3 src/interactive_conditional_samples.py  --model_name='345M'  --nsamples=10 --top_k=40 --temperature=.80 --length=1

Traceback (most recent call last):
  File "src/interactive_conditional_samples.py", line 91, in <module>
    fire.Fire(interact_model)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "src/interactive_conditional_samples.py", line 47, in interact_model
    enc = encoder.get_encoder(model_name, models_dir)
  File "/content/gpt-2/gpt-2/src/encoder.py", line 109, in get_encoder
    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/345M/encoder.json'


### 3. Summarization



- Context: article and text *`TL;DR:`* or *`Summary:`* at the end.

Sample prompt:

```
Theodore McCarrick is the most senior Catholic figure to be dismissed from the priesthood in modern times.
US Church officials said allegations he had sexually assaulted a teenager five decades ago were credible.
Mr McCarrick, 88, had previously resigned but said he had "no recollection" of the alleged abuse.
"No bishop, no matter how influential, is above the law of the Church," Cardinal Daniel DiNardo, president of the United States Conference of Catholic Bishops said in a statement.
"For all those McCarrick abused, I pray this judgment will be one small step, among many, toward healing."
The alleged abuses may have taken place too long ago for criminal charges to be filed because of the statute of limitations.
Mr McCarrick was the archbishop of Washington DC from 2001 to 2006. Since his resignation last year from the College of Cardinals, he has been living in seclusion in a monastery in Kansas.
He was the first person to resign as a cardinal since 1927.
He is among hundreds of members of the clergy accused of sexually abusing children over several decades and his dismissal comes days before the Vatican hosts a summit on preventing child abuse.
The Vatican said Pope Francis had ruled Mr McCarrick's expulsion from the clergy as definitive, and would not allow any further appeals against the decision. 
TL;DR: 
```



In [None]:
!python3 src/interactive_conditional_samples.py --nsamples=3 --length=100 --temperature=1 

2019-02-17 01:22:47.097283: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-02-17 01:22:47.097552: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x28e2680 executing computations on platform Host. Devices:
2019-02-17 01:22:47.097596: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-02-17 01:22:47.185157: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-17 01:22:47.185826: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x28e35a0 executing computations on platform CUDA. Devices:
2019-02-17 01:22:47.185870: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-02-17 01:22:47.186287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found d

### 4. Translation



- Context: a few example pairs of the format *`english_sentence = spanish_sentence`*, and then *`english_sentence =`*  at the end. 

Sample prompt:
```
Good morning. = Buenos días.
I am lost. Where is the restroom? = Estoy perdido. ¿Dónde está el baño?
How much does it cost? = ¿Cuánto cuesta?
How do you say maybe in Spanish? = ¿Cómo se dice maybe en Español?
Would you speak slower, please. = Por favor, habla mas despacio.
Where is the book store? = ¿Dónde está la librería?
At last a feminist comedian who makes jokes about men. = Por fin un cómico feminista que hace chistes sobre hombres.

How old are you? = 


```


In [None]:
!python3 src/interactive_conditional_samples.py --model_name='345M'  --nsamples=3 --temperature=1

2019-05-17 01:15:18.550026: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-17 01:15:18.550239: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3476680 executing computations on platform Host. Devices:
2019-05-17 01:15:18.550270: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-17 01:15:18.708562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-17 01:15:18.709079: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3475fa0 executing computations on platform CUDA. Devices:
2019-05-17 01:15:18.709112: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-17 01:15:18.709483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de