# BERT Pipeline Functions

This notebook demonstrates some of the various pipeline functions that the HuggingFace Transformers library provides out of the box. For a complete list of functions, see this [task summary document](https://huggingface.co/transformers/task_summary.html). Each of these functions relies upon a different default language model that has been trained for that particular task.

## **Import necessary Python libraries and modules**

To use the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html), we will install it (again) with `pip`.

In [None]:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 81.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 70.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.0 tokenizers-0.13.2 transformers-4.24.0


Once `transformers` is installed, we will import the pipeline module

In [None]:
from transformers import pipeline

## **Set parameters**


In [None]:
# all we need for now
device_name = 'cuda'       

## **First up: Sentiment Analysis**

Note that by default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”. It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

In [None]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Try it out!

In [None]:
classifier('This is the last day of class before Thanksgiving break.')

[{'label': 'POSITIVE', 'score': 0.594679057598114}]

Slightly more nicely formatted:

In [None]:
results = classifier(["This is the last day of class before Thanksgiving break.", "I hope you have fun on vacation!", "Everyone seems to be getting sick! Rest up!", "People seem to be overwhelmed by exams and deadlines"])

for result in results:
  print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.5947
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.9981
label: NEGATIVE, with score: 0.9969


## **Next up: Masked Language Modeling**

In [None]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Try it out!

In [None]:
from pprint import pprint

pprint(unmasker(f"I am certain that you will create very {unmasker.tokenizer.mask_token} final projects."))

[{'score': 0.16116251051425934,
  'sequence': 'I am certain that you will create very interesting final '
              'projects.',
  'token': 2679,
  'token_str': ' interesting'},
 {'score': 0.086085245013237,
  'sequence': 'I am certain that you will create very good final projects.',
  'token': 205,
  'token_str': ' good'},
 {'score': 0.07449886947870255,
  'sequence': 'I am certain that you will create very successful final '
              'projects.',
  'token': 1800,
  'token_str': ' successful'},
 {'score': 0.04554571583867073,
  'sequence': 'I am certain that you will create very impressive final '
              'projects.',
  'token': 3444,
  'token_str': ' impressive'},
 {'score': 0.03319435566663742,
  'sequence': 'I am certain that you will create very nice final projects.',
  'token': 2579,
  'token_str': ' nice'}]


## **Third up: Text generation!**

In [None]:
text_generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
print(text_generator("Over Thanksgiving break, I plan to see my family and ", max_length=50, do_sample=False))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Over Thanksgiving break, I plan to see my family and \xa0my friends. I'm going to be back in the house for Thanksgiving. I'm going to be back in the house for Thanksgiving. I'm going to be back in the house for"}]


## **NER!**

HuggingFace's NER identifies tokens as belonging to one of 9 classes:

1. O, Outside of a named entity
2. B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
3. I-MIS, Miscellaneous entity
4. B-PER, Beginning of a person’s name right after another person’s name
5. I-PER, Person’s name
6. B-ORG, Beginning of an organisation right after another organisation
7. I-ORG, Organisation
8. B-LOC, Beginning of a location right after another location
9. I-LOC, Location



In [None]:
ner_pipe = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
sequence = "Emory University is a premier research university that is located in Atlanta, GA. The president of Emory University is Greg Fenves."

In [None]:
for entity in ner_pipe(sequence):
  print(entity)

{'entity': 'I-ORG', 'score': 0.9994702, 'index': 1, 'word': 'Em', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9980411, 'index': 2, 'word': '##ory', 'start': 2, 'end': 5}
{'entity': 'I-ORG', 'score': 0.9971501, 'index': 3, 'word': 'University', 'start': 6, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9964798, 'index': 13, 'word': 'Atlanta', 'start': 69, 'end': 76}
{'entity': 'I-LOC', 'score': 0.8767788, 'index': 15, 'word': 'GA', 'start': 78, 'end': 80}
{'entity': 'I-ORG', 'score': 0.99938345, 'index': 20, 'word': 'Em', 'start': 99, 'end': 101}
{'entity': 'I-ORG', 'score': 0.9979298, 'index': 21, 'word': '##ory', 'start': 101, 'end': 104}
{'entity': 'I-ORG', 'score': 0.99549425, 'index': 22, 'word': 'University', 'start': 105, 'end': 115}
{'entity': 'I-PER', 'score': 0.9994894, 'index': 24, 'word': 'Greg', 'start': 119, 'end': 123}
{'entity': 'I-PER', 'score': 0.9990897, 'index': 25, 'word': 'Fen', 'start': 124, 'end': 127}
{'entity': 'I-PER', 'score': 0.99418986, 'index': 26, 'wo