# ** NLP applications
---

2 main approaches to NLP: traditional method (rule-based), vs machine learning/deep learning method. `Transformers` is based on the deep learning approach. A type of neural network architecture and first introduced in the [Attention is All You Need](https://https://paperswithcode.com/paper/attention-is-all-you-need) by Vaswani et al. in 2017.

`transformers` can be used to perform:
* sentiment analysis
* text classification
* named entity recognition (NER)
* text summarization
* question answering
* text generation
* mask filling
* translation

In [1]:
!pip install --quiet transformers

In [None]:
pip install tensorflow

In [None]:
pip install tf-keras

In [None]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


In [None]:
pip install --upgrade transformers torch


In [1]:
from transformers import pipeline
import pandas as pd

## Sentiment analysis

---

The most basic object in the `transformers` library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing you to directly input any text and get an intelligible answer.

You will need to instantiate a pipeline by calling the `pipeline()` function and providing the name of the task you are interested in. Here we use `sentiment-analysis`.


In [None]:
from transformers import pipeline
import torch

# Detect available device
device = 0 if torch.cuda.is_available() else -1

# Sentiment analysis pipeline
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased", device=device)

# Test input
result = classifier("I am so excited to start learning NLP.")
print(result)


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'label': 'LABEL_0', 'score': 0.5379720330238342}]


The first time you run this code, you will see a few progress bars appear because the pipeline automatically selects a particular pretrained model and downloads the model from the [Hugging Face Hub](https://huggingface.co/models). The next time you instantiate the pipeline, the library will notice that you have already downloaded the model &mdash; so it will use the cached version instead and there is no need to download the model again.

You can even pass several sentences to the `pipeline()`!

In [None]:
# pass several texts as a list
text = ["I love to learn about Natural Language Processing.",
        "I hate boring lessons!"]

classifier = pipeline("sentiment-analysis")
output = classifier(text)
pd.DataFrame(output)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,POSITIVE,0.999813
1,NEGATIVE,0.996406


Other than `sentiment-analysis`, some of the [available pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) are:
* `zero-shot-classification`
* `ner` (named entity recognition)
* `summarization`
* `question-answering`
* `text-generation`
* `fill-mask`
* `translation`



## Text classification
---

Often, you will need to classify texts that have not been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise.

For this use case, the `zero-shot-classification` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don't have to rely on the labels of the pretrained model.

In [None]:
text = "This is a course on Natural Language Processing"

classifier = pipeline("zero-shot-classification")
output = classifier(text, candidate_labels=["education", "politics", "sports"])
pd.DataFrame(output)


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Unnamed: 0,sequence,labels,scores
0,This is a course on Natural Language Processing,education,0.923464
1,This is a course on Natural Language Processing,sports,0.048273
2,This is a course on Natural Language Processing,politics,0.028263


This pipeline is called *zero-shot* because you do not need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

---


## Named entity recognition (NER)
---

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons (PER), locations (LOC), or organizations (ORG).

In [None]:
text = """SpaceX is an aerospace manufacturer and space transport services
    company headquartered in California. It was founded in 2002 by entrepreneur
    and investor Elon Musk with the goal of reducing space transportation costs
    and enabling the colonization of Mars."""

ner_tagger = pipeline("ner", aggregation_strategy="simple")
output = ner_tagger(text)
pd.DataFrame(output)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.999039,SpaceX,0,6
1,LOC,0.999038,California,95,105
2,PER,0.998207,Elon Musk,164,173
3,LOC,0.995641,Mars,265,269


In [None]:
text = """SpaceX is an aerospace manufacturer and space transport services
    company headquartered in California. It was founded in 2002 by entrepreneur
    and investor Elon Musk with the goal of reducing space transportation costs
    and enabling the colonization of Mars."""

ner_tagger = pipeline("ner", aggregation_strategy="simple")

# beautify output
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.999039,SpaceX,0,6
1,LOC,0.999038,California,95,105
2,PER,0.998207,Elon Musk,164,173
3,LOC,0.995641,Mars,265,269


Here, the model correctly identified that SpaceX is an organization (ORG), California a location (LOC) and Elon Musk a person (PER).

Pass the option `aggregation_strategy="simple"` in the pipeline creation function to tell pipeline to group together parts of the sentence that correspond to the same entity: here the model correctly grouped “Elon” and “Musk” as a single person, even though the name consists of multiple words.

---


## Text summarization
---

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here is an example:

In [None]:
text = """America has changed dramatically during recent years. Not only has
    the number of graduates in traditional engineering disciplines such as
    mechanical, civil, electrical, chemical, and aeronautical engineering
    declined, but in most of the premier American universities engineering
    curricula now concentrate on and encourage largely the study of engineering
    science. As a result, there are declining offerings in engineering subjects
    dealing with infrastructure, the environment, and related issues, and
    greater concentration on high technology subjects, largely supporting
    increasingly complex scientific developments. While the latter is important,
    it should not be at the expense of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering
    graduates and a lack of well-educated engineers."""

summarizer = pipeline("summarization")
summarizer(text, clean_up_tokenization_spaces=True)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years. The number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering has declined. Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering.'}]

You can also specify a `max_length` or a `min_length` for the result.

In [None]:
summarizer(text, clean_up_tokenization_spaces=True, min_length=100)

[{'summary_text': ' America has changed dramatically during recent years. The number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering has declined. Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering. There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects. While the latter is important, it should not be at the expense of more traditional engineering.'}]

---



## Question answering
---
The `question-answering` pipeline answers questions using information from a given context.

In [None]:
question_answerer = pipeline("question-answering")

question = "What does the customer want?"
text = """Dear Amazon, last week I ordered an Optimus Prime action figure
    from your online store in Germany. Unfortunately, when I opened the package,
    I discovered to my horror that I had been sent an action figure of Megatron
    instead! As a lifelong enemy of the Decepticons, I hope you can understand
    my dilemma. To resolve the issue, I demand an exchange of Megatron for the
    Optimus Prime figure I ordered. Enclosed are copies of my records concerning
    this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

outputs = question_answerer(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.631292,352,375,an exchange of Megatron


---


## Text generation
---

Use a pipeline to generate some text. Provide a prompt and the model will auto-complete remaining text. Similar to predictive text feature on phones. Text generation involves randomness, so will get different results every time you execute the code.

In [None]:
generator = pipeline("text-generation")

generator("In this course, you will learn about Natural Language Processing. \
You")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, you will learn about Natural Language Processing. You will develop an effective way of processing text from words, words that don't need context, and text in words, words that need context. You will use Natural Language Processing to create visual"}]

---


In [None]:
# insert your code here


You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

---
**Use `num_return_sequences` and `max_length` arguments to generate three sentences of 100 words each.

In [None]:
generator = pipline("text-generation")

generator(
    "In this course, you will learn about Natural Language Processing. You",
    num_return_sequences = 3,
    max_length = 100
)

<details>
<summary><font color="red">Click to show solution</font></summary>
    
```python
generator = pipeline("text-generation")

generator(
    "In this course, you will learn about Natural Language Processing. You",
    num_return_sequences=3,
    max_length=100
)
```
</details>

### Using any model from the Hub in a `pipeline`
---

Previous examples used the default model, can also choose a particular model from the Hub to use in a `pipeline` for a specific task &mdash; say, text generation. Go to the [Model Hub](https://huggingface.co/models) and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like [this one](https://huggingface.co/models?pipeline_tag=text-generation).

The following code uses the [distilgpt2](https://huggingface.co/distilgpt2) model! Here is how to load it in the same pipeline as before:

In [None]:
generator = pipeline("text-generation", model="distilgpt2")

generator(
    "In this course, you will learn about Natural Language Processing. You",
    max_length=30,
    num_return_sequences=2
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, you will learn about Natural Language Processing. You will learn how to read and write code from top to bottom. You also will learn'},
 {'generated_text': 'In this course, you will learn about Natural Language Processing. You will learn at EASN.com and at the EASN Technical Center for'}]

## Mask filling
---

The next pipeline you will try is `fill-mask`. The idea of this task is to fill in the blanks in a given text:

In [None]:
unmasker = pipeline("fill-mask")

unmasker("Singapore is a sovereign <mask> in Southeast Asia.", top_k=3)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.35618600249290466,
  'token': 247,
  'token_str': ' country',
  'sequence': 'Singapore is a sovereign country in Southeast Asia.'},
 {'score': 0.1979658454656601,
  'token': 1226,
  'token_str': ' nation',
  'sequence': 'Singapore is a sovereign nation in Southeast Asia.'},
 {'score': 0.0949191302061081,
  'token': 194,
  'token_str': ' state',
  'sequence': 'Singapore is a sovereign state in Southeast Asia.'}]

The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a *mask token*.

---


## Translation
---

Can use default model if provide a language pair in the task name (such as `translation_en_to_fr`), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models).

Here, try translation from English (`en`) to French (`fr`):

In [None]:
translator = pipeline("translation_en_to_fr")

text = """Established in 1954, Singapore Polytechnic is an open institution of
    higher education. It is considered to be the first institute of technology
    in Singapore. The institute mainly focuses on research, training, and
    studies in the field of technology, commerce, arts, and science. The
    industry-oriented institution admits students only after their completion
    of two to three years of studies. The institute began its first classes
    with a few students. Over the years, it emerged rapidly, and the number of
    enrolments increased to about 15,900 students. And in 2018, the institute
    became the first technological college to have 200,000 graduates."""

translator(text, clean_up_tokenization_spaces=True, max_length=200)

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': "Créée en 1954, l'Institut polytechnique de Singapour est un établissement d'enseignement supérieur ouvert et est considéré comme le premier institut de technologie de Singapour. L'Institut se concentre principalement sur la recherche, la formation et les études dans les domaines de la technologie, du commerce, des arts et des sciences. L'établissement axé sur l'industrie n'accepte les étudiants qu'après l'achève"}]

---
**Search for translation models in other languages and translate the previous sentence into a few different languages.

In [None]:
translator = pipeline("translation_en_to_zh", model = "Helsinki-NLP/opus-mt-en-zh")

translator(text, clean_up_tokenization_spaces=True, max_length=200)

## <font color="blue">**Conclusion**</font>
---
Used `transformers` for some NLP applications.

In greater detail in other .ipnyb :
* Sentiment analysis 
* Text classification 
* Text summarization
* Question answering
* Image-to-Text

Some of the applications will also use `transformers`.

Models that used so far have been pre-trained using some default datasets, which may not be suitable for your use case. 
For more advanced applications, may need to retrain the model or tweak the `pipeline()` function.