## Generative Pre-Trained (GPT)

GPT é um modelo de linguagem que gera texto de forma autoregressiva, ou seja, ele prediz a próxima palavra ou token na sequência com base no contexto anterior. Treinado com a tarefa de prever a próxima palavra em uma sequência de texto, o que permite a geração de texto fluente e coerente. Pode ser usado para uma variedade de tarefas de linguagem natural, incluindo geração de texto, tradução automática, resumo de texto, entre outros.

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'  # You can also try other versions like 'gpt2-medium', 'gpt2-large', 'gpt2-xl'.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Encode input text and add end of sequence token
input_text = "Once upon a time, in a land far, far away,"
input_tokens = tokenizer.encode(input_text, return_tensors="pt")

# Generate text
output_tokens = model.generate(input_tokens, max_length=100, pad_token_id=tokenizer.eos_token_id)

# Decode the generated tokens to a readable string
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Generated Text: ", generated_text)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Generated Text:  Once upon a time, in a land far, far away, the world was a land of the dead, and the dead were the living.

The dead were the living, and the living were the living.

The dead were the living, and the living were the living.

The dead were the living, and the living were the living.

The dead were the living, and the living were the living.

The dead were the living, and the living


## T5: Text-to-Text Transfer Transformer

T5 segue uma abordagem diferente conhecida como "text-to-text", onde todas as tarefas são formuladas como tarefas de "texto para texto". Em vez de ser treinado para prever a próxima palavra em uma sequência, T5 é treinado para traduzir uma entrada de texto em uma saída de texto, independentemente da tarefa. Isso permite que T5 seja mais flexível e generalizado, podendo ser aplicado a uma ampla gama de tarefas de processamento de linguagem natural, desde tradução até sumarização, questionamento e resposta, entre outras.

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the pre-trained T5 model and tokenizer
model_name = 't5-small'  # 't5-small' is a smaller version of the model; other versions include 't5-base', 't5-large', 't5-3b', and 't5-11b'.
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Define the task and input text
task_prefix = "summarize: "  # T5 uses prefixes to indicate the task it should perform.
input_text = task_prefix + "The COVID-19 pandemic has led to a dramatic loss of human life worldwide and presents an unprecedented challenge to public health, food systems, and the world of work. The economic and social disruption caused by the pandemic is devastating: tens of millions of people are at risk of falling into extreme poverty, while the number of undernourished people, currently estimated at nearly 690 million, could increase by up to 132 million by the end of the year."
input_tokens = tokenizer.encode(input_text, return_tensors="pt")

# Generate summary
summary_ids = model.generate(input_tokens, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary: ", summary)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Summary:  the COVID-19 pandemic has led to a dramatic loss of human life worldwide. tens of millions of people are at risk of falling into extreme poverty. the number of undernourished people could increase by up to 132 million by the end of the year.
