<a href="https://colab.research.google.com/github/jonkrohn/NLP-with-LLMs/blob/main/code/GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT

In this notebook (based on [Sinan Ozdemir's](https://github.com/sinanuozdemir/oreilly-gpt-hands-on-nlg/blob/main/notebooks/Introduction_to_GPT.ipynb)), we:

1. Use `transformers` pipeline objects to generate text very easily (using a GPT model)
2. Explore tokens

### Load dependencies

In [None]:
%%capture
! pip install transformers==4.28.0

In [None]:
from transformers import pipeline, GPT2Tokenizer

### Hello, Pipeline! 

Let's use the `pipeline` object to generate text.

Other examples of tasks we can carry out with pipelines include:
* `"sentiment-analysis"`
* `"ner"` (named entity recognition)
* `"summarization"`
* `"translation_en_to_fr"`
* `"feature-extraction"`

In [None]:
generator = pipeline('text-generation', model='gpt2')

generator("The capital of Germany is Berlin. The capital of China is Beijing. The capital of France is",
          max_new_tokens=2,)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The capital of Germany is Berlin. The capital of China is Beijing. The capital of France is Paris.'}]

### Exploring tokens

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # load up a tokenizer

In [None]:
'love' in tokenizer.get_vocab()

True

In [None]:
'Sinan' in tokenizer.get_vocab()

False

Encode a string:

In [None]:
tokenizer.encode('Sinan loves a beautiful day')

[46200, 272, 10408, 257, 4950, 1110]

...then convert the ids into tokens: 

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode('Sinan loves a beautiful day'))

['Sin', 'an', 'Ġloves', 'Ġa', 'Ġbeautiful', 'Ġday']

(The `Ġ` character denotes a space before the token.)