### Transformers


using model from HuggingFace make a translator

#### Transformers provides the following tasks out of the box: Sentiment analysis: is a text positive or negative?

Text generation (in English): provide a prompt and the model will generate what follows.
Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)
Question answering: provide the model with some context and a question, extract the answer from the context.
Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.
Summarization: generate a summary of a long text.
Language Translation: translate a text into another language.
Feature extraction: return a tensor representation of the text.

### The Transformers package contains over 30 pre-trained models and 100 languages, along with eight major architectures for natural language understanding (NLU) and natural language generation (NLG):

    BERT (from Google);
    GPT (from OpenAI);
    GPT-2 (from OpenAI);
    Transformer-XL (from Google/CMU);
    XLNet (from Google/CMU);
    XLM (from Facebook);
    RoBERTa (from Facebook);
    DistilBERT (from Hugging Face).


### References:
- https://www.kdnuggets.com/2021/02/hugging-face-transformer-basics.html
- https://www.kaggle.com/code/scratchpad/notebook24768ac198/edit
- https://huggingface.co/languages
- https://github.com/Helsinki-NLP/UkrainianLT
- https://huggingface.co/docs/transformers/model_doc/t5

In [19]:
from transformers import pipeline, set_seed
import warnings
warnings.filterwarnings("ignore")


from transformers import pipeline # Text Summarization

ImportError: cannot import name 'T5ForConditionalGeneratio' from 'transformers' (/opt/conda/lib/python3.10/site-packages/transformers/__init__.py)

### Install Transformer

In [1]:
!pip install transformers



### Model: GPT2

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.


### T5

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: …. T5 comes in different sizes: small, base, large, 3b and 11b

### Different functions

In [3]:
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

generator("Hello, I like to play cricket,", max_length=60, num_return_sequences=7)
#generator("This year we are waiting", max_length=10, num_return_sequences=5)
#generator("Image processing has a great potential for ", max_length=10, num_return_sequences=5)

##############################################################################################

# Sentiment analysis
classifier = pipeline('sentiment-analysis')
classifier('The secret of getting ahead is getting started.')

##############################################################################################

# Allocate a pipeline for question-answering
question_answerer = pipeline('question-answering')
question_answerer({
    'question': 'What is the Newtons third law of motion?',
    'context': 'Newton’s third law of motion states that, "For every action there is equal and opposite reaction"'})

# Question-Answer
##############################################################################################
nlp = pipeline("question-answering")

context = r"""
Computers can do lots of jobs. They can do maths, store information, or play music. You can use a computer to write or to play games. What do you know about the history of computers?
The first computers were very big. They were the size of a room! They were so big that people didn't have them at home. Early computers could also only do simple maths, like a calculator. In the 1930s Alan Turing had the idea for a computer you could program to do different things.
In 1958 Jack Kilby invented the microchip. Microchips are tiny but can store lots of information. They helped make computers smaller. In the 1970s computers were smaller and cheaper so people started to use them at home. In the 1980s computer games were very popular. Lots of people bought computers just to play games.
In 1989 Tim Berners-Lee invented the World Wide Web, which is a way to organise information on the internet. Now people all over the world can look for and share information on websites.
Today people can use smartphones to play games, email and go on the internet. In the past a simple computer was the size of a room. Now it can go in your pocket!
Fun facts
More than 5 billion people use the internet!
More than 300 billion emails are sent every day!
The first computer mouse was made of wood!
"""

#Question 1
result = nlp(question="Is early computers were very big and could do maths?", context=context)
print(f"Answer 1: '{result['answer']}'")

#Question 2
result = nlp(question="Is Alan Turing had the idea for a computer you could program?", context=context)
print(f"Answer 2: '{result['answer']}'")

##############################################################################################
# Text prediction
unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("Hello, My name is [MASK].")

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I like to play cricket, but what I do in that sport is different—I've been here for over 12 years, and I've heard what people were saying: 'Well, your father should have been playing cricket in this country by now'. The first thing to know is that people"},
 {'generated_text': 'Hello, I like to play cricket, sometimes, but they\'re the same rules," said his nephew, who has been playing cricket for many years and has not been seen on television.\n\n\'One bad cricket match might not be enough\'\n\nA year ago the retired ex-colleg'},
 {'generated_text': "Hello, I like to play cricket, but I've never been in a place that can get me started as a bat pro, so that has to be my goal.\n\nWith the Ashes there is just too much talk about the World Cup qualifiers and how they don't get any more important."},
 {'generated_text': 'Hello, I like to play cricket, not cricket."\n\nAnd you\'re not the smartest.\n\n"Yeah, yeah, yeah, yeah. Actually, I\'m pretty good at cricket

### English to German translation

In [17]:
# English to German
translator_ger = pipeline("translation_en_to_de")
print("German: ",translator_ger("We can make it together", max_length=40)[0]['translation_text'])

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


German:  Wir können es gemeinsam machen


### Ukrainian to English translator

In [18]:
translator_ger = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-zle-en")
print(translator_ger("Мене звати Вольфґанґ і я живу в Берліні."))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/478M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Downloading source.spm:   0%|          | 0.00/1.02M [00:00<?, ?B/s]

Downloading target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.49M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

[{'translation_text': 'My name is Wolfgang and I live in Berlin.'}]


### Translation using T5


In [22]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("We can make it together", return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Wir können es gemeinsam schaffen.
