# Transformers

Transformer models are used to solve all kinds of NLP tasks.

The HuggingFace transformers pipelines can do the following tasks:

* feature-extraction
* fill-mask
* NER (named entity recognition)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation
* zero-shot-classification

There are many pre-trained models hosted in [HuggingFace](https://huggingface.co/models) that could be used for each of these tasks.

In [1]:
import numpy as np
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


## Sentiment Analysis

In [14]:
SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"

analyzer = pipeline("sentiment-analysis", 
                    model=SENTIMENT_MODEL)
analyzer("I am feeling pretty cool.")

[{'label': 'POSITIVE', 'score': 0.9998623132705688}]

In [3]:
analyzer("The weather is a bit dreary today.")

[{'label': 'NEGATIVE', 'score': 0.9995741248130798}]

In [15]:
analyzer([
    "The US is in North American.",
    "The teacher is a boy.",
    "Earth is heating up dramtically."
])

[{'label': 'POSITIVE', 'score': 0.9689047932624817},
 {'label': 'POSITIVE', 'score': 0.9889335632324219},
 {'label': 'NEGATIVE', 'score': 0.9722331166267395}]

## Zero-shot Classification

Attempt to classify texts which were not pre-labelled.  This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

Source: [HF](https://huggingface.co/tasks/zero-shot-classification)

In [45]:
MODEL_NAME = "facebook/bart-large-mnli"
LABELS = ["science", "food", "travel", "education", "politics", "business"]
classifier = pipeline("zero-shot-classification",
                      model=MODEL_NAME)
statement = "Joe Biden's presidential campaign fund has not increased much since last year."
result = classifier(
    statement,
    candidate_labels=LABELS
)

In [46]:
result

{'sequence': "Joe Biden's presidential campaign fund has not increased much since last year.",
 'labels': ['politics', 'business', 'travel', 'science', 'food', 'education'],
 'scores': [0.9594570398330688,
  0.013086460530757904,
  0.012741496786475182,
  0.007973656058311462,
  0.003598777111619711,
  0.0031426202040165663]}

In [52]:
def classify(query: str):
    result = classifier(
        query,
        candidate_labels=LABELS
    )
    print(f"Query: {query}")
    print("Classification: " + result['labels'][np.argmax(result['scores'])])

In [53]:
classify(statement)

Query: Joe Biden's presidential campaign fund has not increased much since last year.
Classification: politics


In [54]:
classify("Costa Rica has some amazing beaches for a summer vacation.")

Query: Costa Rica has some amazing beaches for a summer vacation.
Classification: travel


In [55]:
classify("Chocolate fondue is a nice snack for the winter.")

Query: Chocolate fondue is a nice snack for the winter.
Classification: food


## Text Generation

 You can find the list of selected open-source large language models (LLM) [here](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), ranked by their performance scores.

Source: [HF](https://huggingface.co/tasks/text-generation)

In [24]:
# MODEL_NAME = "distilgpt2"
# MODEL_NAME = "bigscience/bloom-560m"
MODEL_NAME = "gpt2-medium"
generator = pipeline("text-generation", model=MODEL_NAME)

Downloading (…)lve/main/config.json: 100%|██████████| 718/718 [00:00<00:00, 859kB/s]
Downloading model.safetensors: 100%|██████████| 1.52G/1.52G [00:52<00:00, 28.8MB/s]
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 149kB/s]
Downloading (…)olve/main/vocab.json: 1.04MB [00:00, 7.37MB/s]
Downloading (…)olve/main/merges.txt: 456kB [00:00, 7.45MB/s]
Downloading (…)/main/tokenizer.json: 1.36MB [00:00, 8.06MB/s]


In [26]:
prompt = "This weekend, I plan to "
responses = generator(prompt, 
                     max_length=128,
                     num_return_sequences=2
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "This weekend, I plan to \xa0join some friends over to Boston for some delicious dinner.\xa0 We usually don't do as much shopping because we tend to try to avoid spending much money and trying to get everything done. This week, I'm more relaxed. I don't really have a lot of money either due to not having much left to spend.\xa0 I don't want to try and\xa0 make it all work to get things done if that means\xa0 starting an entirely new blog. However, as I've\xa0 gotten older it's become a lot easier to donen\xa0 with nothing to spend and\xa0 taking less money since it"},
 {'generated_text': 'This weekend, I plan to \xa0share my experiences, findings and solutions with you. The idea is to offer a bit of everything (including all of the previous ones).\nPosted by\xa0 J.J. at 7:58 AM'}]

In [39]:
import re
from pprint import pprint

def printGeneratedText(str):
    output = re.sub(r'\s+', ' ', str)
    pprint(f"=> {output}")

In [28]:
for r in responses:
    printGeneratedText(r['generated_text'])

('=> This weekend, I plan to join some friends over to Boston for some '
 "delicious dinner. We usually don't do as much shopping because we tend to "
 'try to avoid spending much money and trying to get everything done. This '
 "week, I'm more relaxed. I don't really have a lot of money either due to not "
 "having much left to spend. I don't want to try and make it all work to get "
 "things done if that means starting an entirely new blog. However, as I've "
 "gotten older it's become a lot easier to donen with nothing to spend and "
 'taking less money since it')
('=> This weekend, I plan to share my experiences, findings and solutions with '
 'you. The idea is to offer a bit of everything (including all of the previous '
 'ones). Posted by J.J. at 7:58 AM')


## Mask Filling

The idea of this task is to fill in the blanks in a given text.  The top_k argument controls how many possibilities you want to be displayed. 

In [16]:
MODEL_NAME = "distilroberta-base"
unmasker = pipeline("fill-mask",
                    model=MODEL_NAME)

In [18]:
unmasker("The building is <mask> and very expensive.", top_k=2)

[{'score': 0.06936365365982056,
  'token': 1307,
  'token_str': ' huge',
  'sequence': 'The building is huge and very expensive.'},
 {'score': 0.046456728130578995,
  'token': 739,
  'token_str': ' large',
  'sequence': 'The building is large and very expensive.'}]

In [20]:
unmasker("Because this doctor is very capable, <mask> can easily do this operation successfully.", top_k=2)

[{'score': 0.3738466203212738,
  'token': 37,
  'token_str': ' he',
  'sequence': 'Because this doctor is very capable, he can easily do this operation successfully.'},
 {'score': 0.1665724217891693,
  'token': 25705,
  'token_str': ' surgeons',
  'sequence': 'Because this doctor is very capable, surgeons can easily do this operation successfully.'}]

In [21]:
unmasker("Because this nurse is very capable, <mask> can easily take care of this patient.", top_k=2)

[{'score': 0.4550214409828186,
  'token': 79,
  'token_str': ' she',
  'sequence': 'Because this nurse is very capable, she can easily take care of this patient.'},
 {'score': 0.09426577389240265,
  'token': 52,
  'token_str': ' we',
  'sequence': 'Because this nurse is very capable, we can easily take care of this patient.'}]

## Named Entity Recognition (NER)

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

The option `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: 

In [26]:
MODEL_NAME = "dbmdz/bert-large-cased-finetuned-conll03-english"
ner = pipeline("ner", 
               model=MODEL_NAME,
               grouped_entities=True)



In [25]:
statement = "James Cameron, who directed the hit 1997 film Titanic and has made 33 dives to the wreckage, said he saw some similarities between the Titan tragedy and the sinking of the famous ship it was bound for."
ner(statement)

[{'entity_group': 'PER',
  'score': 0.9992048,
  'word': 'James Cameron',
  'start': 0,
  'end': 13},
 {'entity_group': 'MISC',
  'score': 0.99105597,
  'word': 'Titanic',
  'start': 46,
  'end': 53},
 {'entity_group': 'MISC',
  'score': 0.74788696,
  'word': 'Titan',
  'start': 135,
  'end': 140}]

## Question & Answering (QA)

The question-answering pipeline answers questions using information from a given context.

In [31]:
MODEL_NAME = "distilbert-base-cased-distilled-squad"
qa = pipeline("question-answering",
                             model=MODEL_NAME)

In [32]:
qa(
    question="What did James Cameron say?",
    context=statement
)

{'score': 0.09470159560441971,
 'start': 98,
 'end': 183,
 'answer': 'he saw some similarities between the Titan tragedy and the sinking of the famous ship'}

## Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text

In [33]:
news = """
Catastrophic implosion: 

The Titanic-bound submersible that went missing on 
Sunday with five people on board suffered a 
“catastrophic implosion,” killing everyone on board, 
US Coast Guard Rear Adm. John Mauger said Thursday. 
A remotely operated vehicle found the tail cone of the 
Titan about 1,600 feet away from the bow of the shipwreck, 
Mauger said.

Who was on board: Tour company OceanGate Expeditions 
said the five passengers were Hamish Harding, 
Shahzada Dawood and his son Suleman Dawood, 
Paul-Henri Nargeolet and OceanGate CEO Stockton Rush.

About the trip: The Titan began its descent Sunday to 
explore the wreckage of the Titanic, located about 
13,000 feet below sea level in the North Atlantic Ocean.
"""

In [35]:
MODEL_NAME = "sshleifer/distilbart-cnn-12-6"
summarizer = pipeline("summarization",
                      model=MODEL_NAME)

In [41]:
summary = summarizer(news)
summary

[{'summary_text': ' The tail cone of the Titanic-bound submersible was found about 1,600 feet away from the bow of the shipwreck . It suffered a “catastrophic implosion” killing everyone on board, US Coast Guard Rear Adm. John Mauger says .'}]

In [42]:
printGeneratedText(summary[0]['summary_text'])

('=>  The tail cone of the Titanic-bound submersible was found about 1,600 '
 'feet away from the bow of the shipwreck . It suffered a “catastrophic '
 'implosion” killing everyone on board, US Coast Guard Rear Adm. John Mauger '
 'says .')


## Translations

For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models). Here we’ll try translating from French to English:

In [43]:
MODEL_NAME = "Helsinki-NLP/opus-mt-fr-en"

fr2en_translator = pipeline("translation", 
                         model=MODEL_NAME)

Downloading (…)lve/main/config.json: 1.42kB [00:00, 4.40MB/s]
Downloading pytorch_model.bin: 100%|██████████| 301M/301M [00:10<00:00, 28.9MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 293/293 [00:00<00:00, 141kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 19.8kB/s]
Downloading (…)olve/main/source.spm: 100%|██████████| 802k/802k [00:00<00:00, 2.75MB/s]
Downloading (…)olve/main/target.spm: 100%|██████████| 778k/778k [00:00<00:00, 3.23MB/s]
Downloading (…)olve/main/vocab.json: 1.34MB [00:00, 12.8MB/s]


In [45]:
fr2en_translator("comment allez-vous?")

[{'translation_text': 'How are you?'}]

In [46]:
fr2en_translator("Je trouve ce nouveau domaine de l'intelligence artificielle très intéressant.")

[{'translation_text': 'I find this new field of artificial intelligence very interesting.'}]

In [47]:
MODEL_NAME = "Helsinki-NLP/opus-mt-en-fr"

en2fr_translator = pipeline("translation", 
                         model=MODEL_NAME)

Downloading (…)lve/main/config.json: 1.42kB [00:00, 3.18MB/s]
Downloading pytorch_model.bin: 100%|██████████| 301M/301M [00:10<00:00, 29.1MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 293/293 [00:00<00:00, 122kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 16.6kB/s]
Downloading (…)olve/main/source.spm: 100%|██████████| 778k/778k [00:00<00:00, 12.8MB/s]
Downloading (…)olve/main/target.spm: 100%|██████████| 802k/802k [00:00<00:00, 5.45MB/s]
Downloading (…)olve/main/vocab.json: 1.34MB [00:00, 22.9MB/s]


In [48]:
en2fr_translator("What are you doing this weekend")

[{'translation_text': "Qu'est-ce que tu fais ce week-end ?"}]

In [49]:
en2fr_translator("The latest Transformer movie was a marvel to watch!")

[{'translation_text': 'Le dernier film Transformer a été une merveille à regarder!'}]

## Code Generation

HuggingFace has a [StarChat Playground](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground).  The base model has 16B parameters and was pretrained on one trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks (all permissively licensed). 

In [32]:
# 
# required to login to HF with tokens from: https://huggingface.co/settings/tokens
#
# > huggingface-cli login
#
# from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = "bigcode/starcoderplus"
device = "cpu" # "cuda" for GPU usage or "cpu" for CPU usage

In [None]:
# WARNING: download ~ 60GB of model file -- will take a long time
# Alternatively, run the hosted HF version at: https://huggingface.co/bigcode/starcoderplus
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

In [None]:
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))