# TRANSFORMER MODELS

# Natural Language Processing

Common NLP tasks:
* **Classifying whole sentences**: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not.
* **Classifying each word in a sentence**: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization).
* **Generating text content**: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words.
* **Extract an answer from a text**: Given a question and a context, extracting the answer to the question based on the information provided in the context.
* **Generating a new sentence from an input text**: Translating a text into another language, summarizing a text.

# Transformers, what can they do?

Install required packages for Colab environment

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## Working with pipelines

THe most basic object is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps:

In [None]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')

classifier("I've been waiting for a HuggingFace course ")

For multiple sentences,

In [3]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when we create the `classifier` object.

Main steps to pass text to a pipeline:
1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so we can make sense of them.

## Zero-shot classification

The `zero-shot-classification` allows us to specify which labels to use for the classification, so we do not have to rely on the labels of the pretrained model.

In [None]:
classifier = pipeline('zero-shot-classification')

classifier(
    'This is a course about the Transformers library',
    candidate_labels=['education', 'politics', 'business'],
)

This pipeline is `zero-shot` because we do not need to fine-tune the model on our data to use it. It can directly return probability scores for any list of labels we want.

## Text generation

The main idea of the `text-generation` pipeline is that we provide a prompt and the model will auto-complete it by generating the remaining text.

In [None]:
generator = pipeline('text-generation')

generator('In this course, we will teach you how to')

In [6]:
generator('In this course, we will teach you how to',
          num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to understand an application code that is being read from a command line and then how to execute those commands. This is an easy and quick way to develop any command line program with easy code completion, but this'},
 {'generated_text': "In this course, we will teach you how to build a simple web app.\n\nIt won't really need much learning to get a better grasp of development rules - but it will allow you to focus on your code.\n\nYou will also"}]

In [7]:
generator('In this course, we will teach you how to',
          max_length=15)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to develop an effective approach to'}]

## Using any model from the Hub in a pipeline

Choose a particular model from the Hub to use in a pipieline for a specific task.

For example, try `distilgpt2`

In [None]:
generator = pipeline('text-generation',
                     model='distilgpt2')

In [9]:
generator(
    'In this course, we will teach you how to',
    max_length=30,
    num_return_sequences=3,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to make your own custom custom "Nights" "Thirsty Thinks" and how to get'},
 {'generated_text': 'In this course, we will teach you how to create and develop a portfolio or practice a simple application. The course was created for the web developer.'},
 {'generated_text': 'In this course, we will teach you how to apply the rules of physics to the mathematical game, including how we use the physics of the game to'}]

## Mask filling

The idea of the `fill-mask` is to fill in the blanks in a given text:

In [None]:
unmasker = pipeline('fill-mask')

In [11]:
unmasker('This course will teach us all about <mask> models.', top_k=3)

[{'score': 0.189777210354805,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach us all about mathematical models.'},
 {'score': 0.0465598925948143,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach us all about computational models.'},
 {'score': 0.04100976511836052,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach us all about predictive models.'}]

The `top_k` argument controls how many possibilities we want to be displayed. The special `<mask>` word is a *mask token*. Other mask-filling models may have different mask tokens.

## Named entity recognition

Named entity recognition (`ner`) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

In [None]:
ner = pipeline('ner',
               grouped_entities=True)

In [13]:
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

The model identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).

## Question answering

The `question-answering` pipeloine answers questions using information from a given context:

In [None]:
question_answerer = pipeline('question-answering')

In [15]:
question_answerer(
    question='Where do I work?',
    context='My name is Sylvain and I work at Hugging Face in Brooklyn',
)

{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

This pipeline works by extracting information from the provided context; it does NOT generate the answer.

## Summarization

The `summarization` pipeline reduces a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

In [None]:
summarizer = pipeline('summarization')

In [17]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Translation

In [None]:
translator = pipeline('translation',
                      model='Helsinki-NLP/opus-mt-zh-en')

In [20]:
translator("这节课由Hugging Face主导。")

[{'translation_text': 'The lesson is run by Huging Face.'}]

# Bias and Limitations

In [None]:
unmasker = pipeline('fill-mask',
                    model='bert-base-uncased')

In [25]:
result = unmasker('This man works as a [MASK].')
print([r['token_str'] for r in result])

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


In [26]:
result = unmasker('This woman works as a [MASK].')
print([r['token_str'] for r in result])

['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
