<a href="https://colab.research.google.com/github/jonkrohn/NLP-with-LLMs/blob/main/code/T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5 

In this notebook, we use T5 "out of the box" for a broad range of NLP/generation tasks.

**TO DO**: 
1. Develop a little further based on [Sinan's notebook](https://github.com/sinanuozdemir/oreilly-hands-on-transformers/blob/main/notebooks/t5.ipynb).
2. Understand all arguments.

### Load dependencies

In [1]:
%%capture
!pip install transformers==4.28.0 sentencepiece==0.1.98

In [2]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

### Load model

In [3]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Perform inference

In [4]:
# translate
input_ids = tokenizer.encode('translate English to German: Where is the chocolate?', return_tensors='pt')

translate_ids = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print (f"Translated text:\n{output}")

Translated text:
Wo ist die Schokolade?


In [7]:
# summarize abstract of T5 paper (arxiv.org/abs/1910.10683)
text_to_summarize = """Transfer learning, where a model is first pre-trained on a 
data-rich task before being fine-tuned on a downstream task, has emerged as a 
powerful technique in natural language processing (NLP). The effectiveness of 
transfer learning has given rise to a diversity of approaches, methodology, and 
practice. In this paper, we explore the landscape of transfer learning techniques 
for NLP by introducing a unified framework that converts all text-based language 
problems into a text-to-text format. Our systematic study compares pre-training 
objectives, architectures, unlabeled data sets, transfer approaches, and other 
factors on dozens of language understanding tasks. By combining the insights from 
our exploration with scale and our new Colossal Clean Crawled Corpus, we 
achieve state-of-the-art results on many benchmarks covering summarization, 
question answering, text classification, and more. To facilitate future work on 
transfer learning for NLP, we release our data set, pre-trained models, and code."""

preprocess_text = text_to_summarize.strip().replace("\n","")

t5_prepared_text = "summarize: " + preprocess_text # add prompt

input_ids = tokenizer.encode(t5_prepared_text, return_tensors="pt")

# summmarize 
summary_ids = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    min_length=30,
    max_length=50,
    early_stopping=True
)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print (f"Summarized text: \n{output}")

Summarized text: 
transfer learning has emerged as a powerful technique in natural language processing (NLP) a unified framework converts all text-based language problems into a text-to-text format. our study compares pre-training objectives
