<a href="https://colab.research.google.com/github/minhphat2000/mini_project_python/blob/main/Module05%20-%20NLP%20Applications/Project08%20-%20Text%20Generation%20with%20Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="http://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project08%20-%20Text%20Generation%20with%20Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Transformers

It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

Here we will use the GPT-2 Model to generate text based on an input sequence of text.

![](https://i.imgur.com/z4k1IzU.png)

# Install Dependencies

In [1]:
!pip install pytorch-transformers

Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting boto3 (from pytorch-transformers)
  Downloading boto3-1.34.35-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from pytorch-transformers)
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting botocore<1.35.0,>=1.34.35 (from boto3->pytorch-transformers)
  Downloading botocore-1.34.35-py3-none-any.whl (11.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jmespath<2.0.0,>=0.7.1 (from boto3->pytorch-transformers)
 

# Load GPT2 Model

In [2]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:00<00:00, 3292285.58B/s]
100%|██████████| 456318/456318 [00:00<00:00, 1819564.69B/s]


# Next Word Generation with GPT-2

GPT-2 is a successor of GPT, the original NLP framework by OpenAI. The full GPT-2 model has 1.5 billion parameters, which is almost 10 times the parameters of GPT. GPT-2 give State-of-the Art results as you might have surmised already (and will soon see when we get into Python).

The pre-trained model contains data from 8 million web pages collected from outbound links from Reddit.

![](https://i.imgur.com/TbnGbjX.png)

The architecture of GPT-2 is based on the very famous Transformers concept that was proposed by Google in their paper “Attention is all you need”. The Transformer provides a mechanism based on encoder-decoders to detect input-output dependencies.

At each step, the model consumes the previously generated symbols as additional input when generating the next output.

![](https://i.imgur.com/0XSSXBd.png)

Modifications in GPT-2 include:

- The model uses larger context and vocabulary size
- After the final self-attention block, an additional normalization layer is added
- Similar to a residual unit of type “building block”, layer normalization is moved to the input of each sub-block. It has batch normalization applied before weight layers, which is different from the original type “bottleneck”

In [4]:
text = "Welcome to the open data science conference it is"
indexed_tokens = tokenizer.encode(text)
indexed_tokens

[19134, 284, 262, 1280, 1366, 3783, 4495, 340, 318]

In [5]:
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor

tensor([[19134,   284,   262,  1280,  1366,  3783,  4495,   340,   318]])

In [6]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
model

100%|██████████| 665/665 [00:00<00:00, 478999.17B/s]
100%|██████████| 548118077/548118077 [00:16<00:00, 33857940.17B/s]


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [7]:
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [8]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [9]:
predictions.shape

torch.Size([1, 9, 50257])

In [10]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
predicted_text

' Welcome to the open data science conference it is a'

In [21]:
start = 'Hello everyone, we'
indexed_tokens = tokenizer.encode(start)

for i in range(75):
  tokens_tensor = torch.tensor([indexed_tokens])
  tokens_tensor = tokens_tensor.to('cuda')
  with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    indexed_tokens = indexed_tokens + [predicted_index]

In [22]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

 Hello everyone, we are very sorry for the inconvenience. We are working on a new version of the game, and we are working on a new version of the game, and we are working on a new version of the game, and we are working on a new version of the game, and we are working on a new version of the game, and we are working on a new version of of


# Paragraph Generation with GPT-2

Refer to this [source code](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py#L106-L129) to deep dive.

- `length`: It represents the number of tokens in the generated text. If the length is None, then the number of tokens is decided by model hyperparameters
- `temperature`: This controls randomness in Boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions
- `top_k`: This parameter controls diversity. If the value of top_k is set to 1, this means that only 1 word is considered for each step (token). If top_k is set to 40, that means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. top_k = 40 generally is a good value

In [13]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 181527, done.[K
remote: Counting objects: 100% (354/354), done.[K
remote: Compressing objects: 100% (185/185), done.[K
remote: Total 181527 (delta 182), reused 260 (delta 131), pack-reused 181173[K
Receiving objects: 100% (181527/181527), 201.52 MiB | 25.04 MiB/s, done.
Resolving deltas: 100% (127136/127136), done.


In [14]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=500 \
    --model_name_or_path=gpt2 \

python3: can't open file '/content/pytorch-transformers/examples/run_generation.py': [Errno 2] No such file or directory
