# Chapter 1: Hello Transformers

## The encoder-decoder framework

Awaiting rewrite of introduction

In [1]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure
    from your online store in Germany. Unfortunately, when I opened the package,
    I discovered to my horror that I had been sent an action figure of Megatron
    instead! As a lifelong enemy of the Decepticons, I hope you can understand my
    dilemma. To resolve the issue, I demand an exchange of Megatron for the
    Optimus Prime figure I ordered. Enclosed are copies of my records concerning
    this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

In [2]:
from transformers import pipeline 
classifier = pipeline("text-classification")
import pandas as pd 
outputs = classifier(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,NEGATIVE,0.901546


## Named Entity Recognition

In [3]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,94,101
3,MISC,0.556567,Mega,216,220
4,PER,0.590256,##tron,220,224
5,ORG,0.669691,Decept,265,271
6,MISC,0.498349,##icons,271,276
7,MISC,0.775361,Megatron,366,374
8,MISC,0.987854,Optimus Prime,387,400
9,PER,0.812097,Bumblebee,526,535


## Question answering

In [4]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])


No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.631292,351,374,an exchange of Megatron


## Summarization

In [5]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=65, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that he had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I hope you can understand my dilemma.


## Translation


In [6]:
translator = pipeline("translation_en_to_de",
                          model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee.


## Text Generation

In [7]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up." 
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure
    from your online store in Germany. Unfortunately, when I opened the package,
    I discovered to my horror that I had been sent an action figure of Megatron
    instead! As a lifelong enemy of the Decepticons, I hope you can understand my
    dilemma. To resolve the issue, I demand an exchange of Megatron for the
    Optimus Prime figure I ordered. Enclosed are copies of my records concerning
    this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. However, we are currently working a deal with YouTuber HypeBusters about an "action figure",

and am happy to inform you that this figure will not be available on the US release shelves for


## 🤗 Ecosystem

![hf_ecosystem](../images/Chapter_1/hf_ecosystem.png)

Consists of two parts:
1. Family of libraries\
Provides the code for the model
2. The hub\
Provides the pretrained model weights, datasets, scripts for metrics and more

### 🤗 Tokenizers

Tokenization is how raw text gets split into smaller pieces called tokens.
HF provides many tokenization strategies and is extremely fast at tokenizing thanks to its Rust backend.
It also does all the pre/post-processing steps for you such as normalizing the input and transforming the 
model outputs to the required format.

### 🤗 Datasets

HF datasets works around the RAM issue by providing smart caching and leveraging a special mechanism called 
*memory mapping* that stores the contents of a file in virtual memory and enables multiple processes to 
modify a file more efficiently, thereby bypassing RAM limitations.

### 🤗 Accelerate

Accelerate adds a layer of abstraction to training loops and takes care of all custom logic necessary for the 
training infrastructure.


## Main challenge with transformers

1. Language\
NLP is overwhelmingly dominated by English therefore pretrained models for low-resource languages are rare.
2. Data availability\
Transfer learning is able to cut down on the task specific labeled training data required, however the requirements
are still high compared to humans.
3. Working with long documents\
Self-attention works well with paragraph-long texts, but it becomes very expensive as you move to longer
texts like whole documents.
4. Opacity\
It is hard or impossible to unravel "why" a model makes a particular decision.
5. Bias\
Transformer models are trained on large corpora which usually comes from the internet and therefore this 
imprints all the inherent bias on the internet onto the model. Therefore railguards are required to address this.