## Transformer Model
Transformer models are used to solve all kinds of NLP task classification, language modeling, NER.

### Working with pipeline
-> The `transformer` library provide `**pipeline**` connect a model with its necessary preprocessing and postprocessing step allowing us to directly input any text and get an intelligible response.

In [1]:
from transformers import pipeline

In [4]:
# classifier pipeline
classifier = pipeline("sentiment-analysis") # if no model provided it will use by it own
classifier("I am very happy to learn.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998824596405029}]

In [7]:
# passing multiple sentence
classifier(
    ["I am happy", "I hate to study"]
)

[{'label': 'POSITIVE', 'score': 0.9998801946640015},
 {'label': 'NEGATIVE', 'score': 0.9979158043861389}]

### Pipeline
- First the text is preprocessed into a format the model can understand,
- Then the preprocessed inputs are sent to the model,
- Prediction of the model are post processed so we can make sense

### Some common pipeline
- Feature-extraction
- Fillmask
- ner
- question answering
- sentiment-analysis
- summarization
- translation
- zero shot classification

In [9]:
### zero shot classification

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197388917207718, 0.0434267558157444]}

In [12]:
### Text generation

generator = pipeline("text-generation")
generator("The capital city")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The capital city has seen a steady stream of tourists in recent years, with the capital taking in some 8.8 million visitors between 2009 and 2011. This is up 18 percent year-on-year, from 7 million in 2008 and 7 million a'}]

In [17]:
generator("we can enjoy", num_return_sequences=2, max_length=15) # length and num of sequence we want

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'we can enjoy a whole new sense of happiness without having to constantly worry that'},
 {'generated_text': 'we can enjoy the game without seeing any of the annoying side effect or bugs'}]

In [20]:
from transformers import pipeline

## using the model of our own choice
generator = pipeline("text-generation", model="NYTK/text-generation-news-gpt2-small-hungarian")
generator(
    "we can enjoy",
    max_length=30,
    num_return_sequences=1
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': 'we can enjoy, azaz a magyar nyelv és irodalom, valamint a magyar irodalom és irodalom, valamint a magyar irodalom és a magyar irodalom,'}]

### mask filling

In [22]:
unmasker = pipeline("fill-mask")
unmasker("I am happy to learn <mask>.", top_k=2)

# The argument control how many possibility we want to show.

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3803078234195709,
  'token': 55,
  'token_str': ' more',
  'sequence': 'I am happy to learn more.'},
 {'score': 0.07073522359132767,
  'token': 402,
  'token_str': ' something',
  'sequence': 'I am happy to learn something.'}]

In [26]:
### NER
ner = pipeline("ner", grouped_entities=True)
ner("Paris is the capital city of France and Mbappe is the world cup winner.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'LOC',
  'score': 0.9992811,
  'word': 'Paris',
  'start': 0,
  'end': 5},
 {'entity_group': 'LOC',
  'score': 0.9998055,
  'word': 'France',
  'start': 29,
  'end': 35},
 {'entity_group': 'PER',
  'score': 0.99758655,
  'word': 'Mbappe',
  'start': 40,
  'end': 46}]

In [27]:
### Q-A
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



In [29]:
qa(
    question="what I like to do in holiday",
    context="I like playing in holiday"
) # doesnot generate but extract the answer

{'score': 0.8474295735359192, 'start': 7, 'end': 14, 'answer': 'playing'}

In [31]:
### language translate

translate = pipeline("translation", model="Tanhim/translation-En2De")
translate("I am happy to learn")

[{'translation_text': 'Ich bin froh zu lernen'}]