# Transformers

## NLP Overview

The intersection of Machine Learning and Linguistics is NLP.

The underlying task in NLP is to understand what words mean in the context of their sentences.

Some problems NLP aims to solve:
- Classifying sentences
  - Ham/spam emails, real/fake news, sentiment analysis
- Classifying words
  - Parts of speech
- Text Generation
  - Prompt completion
  - Answer extraction
  - Text summarization
  - Translation
 
To stress "what words mean in the context of their sentences", take a moment to think about how each NLP problem above relies context.

NLP also deals with Computer Vision and Speech Recognition for tasks like generating a description of an image and generating a text-transcript from an audio sample.

## Hugging Face Transformers Library

The transformers library allows us to easily perform NLP tasks with the use of the `pipeline` class.

### Pipelines

As a first introduction to the `pipeline` object in the `transformers` library, know that *each task NLP task* for `pipeline` modes (like "text-generation", "summarization", "translation", etc) *includes a default model* that has been fine-tuned for its task. See below that there are no other arguments passed with the task-type to the `pipeline` object.

In [1]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm





In [5]:
'''
pipeline has no other arguments other than the task-type
and downloads a default model for the task.
'''
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.9998635053634644}]

In [None]:
classifier('you look really nice today!')

In [6]:
messages = ['you look really nice today!', 'you look really terrible today!']
classifier(messages)

[{'label': 'POSITIVE', 'score': 0.9998635053634644},
 {'label': 'NEGATIVE', 'score': 0.9995715022087097}]

**We can also specify models from the Hugging Face website by using part of the path to its page (on Hugging Face).**

In [36]:
# classifier = pipeline('text-generation',model="distilgpt2",
#                      trust_remote_code=True)

We can search for models here https://huggingface.co/models

Hugging Face Pipelines are comprehensive solutions to performing the back-end of an NLP task.

Pipeline jobs:
1. Preprocess input
2. pass the input to the model (which has been downloaded and cached by the `pipeline` object)
3. post-process input for human readability

Effectively, Hugging Face has done all the work for us, we just need to tell the `pipeline` which task we are performing, among other parameters which we won't cover in this notebook.

Below, we will test out the `pipeline`s for tasks listed on the Hugging Face tutorial:
- Zero-shot classification
- Text generation
- mask filling
- named entity recognition
- question answering
- summarization
- translation

But this is not a comprehensive list of Hugging Face Pipeline options.

See https://huggingface.co/docs/transformers/main_classes/pipelines for a comprehensive list of `pipelines`.

#### Zero-shot classification

The "zero-shot-classification" pipeline allows us to classify text-input by giving a set of possible categories. It is called zero-shot because there is no need to fine-tune the model.

It outputs the probability that the text belongs to each output, and adds up to 100%.

In [8]:
classifier = pipeline('zero-shot-classification')

No model was supplied, defaulted to roberta-large-mnli and revision 130fb28 (https://huggingface.co/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [10]:
classifier('This product smells really nice',
         candidate_labels = ['aesthetics', 'sports', 'academics'])

{'sequence': 'This product smells really nice',
 'labels': ['aesthetics', 'academics', 'sports'],
 'scores': [0.9762329459190369, 0.01219076570123434, 0.011576291173696518]}

In [11]:
classifier('This product smells really nice',
         candidate_labels = ['aesthetics', 'food', 'academics'])

{'sequence': 'This product smells really nice',
 'labels': ['aesthetics', 'food', 'academics'],
 'scores': [0.9493024349212646, 0.038843072950839996, 0.011854469776153564]}

In [12]:
classifier('This product smells really nice',
         candidate_labels = ['aesthetics', 'food', 'home goods'])

{'sequence': 'This product smells really nice',
 'labels': ['aesthetics', 'home goods', 'food'],
 'scores': [0.9077967405319214, 0.05505849048495293, 0.03714476525783539]}

#### Text Generation

Like GPT, the "text-generation" pipeline will complete a prompt.

It's worth noting that the given, default model performs nowhere near similarly to GPT3.5; GPT3.5 is much, much better.

In [25]:
generator = pipeline('text-generation')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [18]:
a = generator('System: You are an assistant that tells what day it is.\nUser: what day is today?')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "System: You are an assistant that tells what day it is.\nUser: what day is today?\nUser: the day before?\nUser: is tomorrow night?\nUser: no day tomorrow\nUser: it's Monday tomorrow\nUser"}]

In [21]:
print(a[0]['generated_text'])

System: You are an assistant that tells what day it is.
User: what day is today?
User: the day before?
User: is tomorrow night?
User: no day tomorrow
User: it's Monday tomorrow
User


In [22]:
a = generator('What is the numeric date of the day after Christmas day?')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [23]:
print(a[0]['generated_text'])

What is the numeric date of the day after Christmas day?

An accurate date, but a date in the format "F0-6"; you can see the exact date in hexadecimal format on the wiki.

Is this


In [27]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to run a web application using Python and Ruby.\n\nPlease read in more detail how to apply it, in short, to your web application. Please note the following:\n\nMake sure you understand'}]

In [28]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build an authentic relationship with the person you have with the world, and how to set a new foundation for your work day. After that, we will talk practical techniques and examples of how to work with'}]

In [31]:
generator("In this course, we will teach you how to",
          num_return_sequences = 5,
          max_length = 25)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a Raspberry Pi 3 as a portable desktop computer. Please use it to'},
 {'generated_text': 'In this course, we will teach you how to use an iPhone with a camera on a USB and it will work on any'},
 {'generated_text': 'In this course, we will teach you how to use a JavaScript class to manage two sets of users in the context of a'},
 {'generated_text': 'In this course, we will teach you how to use the new Ruby on Rails and how to build your own custom Ruby application'},
 {'generated_text': "In this course, we will teach you how to design a functional JavaScript app. We'll cover the fundamentals and see how to"}]

#### Mask Filling

Filling in the blanks!

In [37]:
unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 480/480 [00:00<?, ?B/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████| 331M/331M [00:12<00:00, 27.4MB/s]
All PyTorch model weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
vocab.json: 100%|███████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 6.98MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████| 456k/456k [00:00

In [40]:
unmasker('This course will teach you about <mask> models.', 
         top_k = 2 # How many values we want to return
)

[{'score': 0.19890430569648743,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you about mathematical models.'},
 {'score': 0.05367986112833023,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you about computational models.'}]

#### Named-Entity Recognition

Finds the entities in the given input.

In [42]:
ner = pipeline('ner', grouped_entities = True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 998/998 [00:00<?, ?B/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████| 1.33G/1.33G [00:48<00:00, 27.7MB/s]
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.
tokenizer_config.json: 100%|████████████████████████████████████████████████████████| 60.0/60.0 [00:00<00:00, 29.4kB/s]
vocab.txt: 100%|██

In [43]:
ner('My name is Will Curkan and I work for a psychiatric and psychological services company.')

[{'entity_group': 'PER',
  'score': 0.9791667,
  'word': 'Will Curkan',
  'start': 11,
  'end': 22}]

In [46]:
ner('My name is Bob and I drive my lamborghini urus to work at Meta.') # Didnt catch lambo!

[{'entity_group': 'PER',
  'score': 0.99894196,
  'word': 'Bob',
  'start': 11,
  'end': 14},
 {'entity_group': 'ORG',
  'score': 0.97730756,
  'word': 'Meta',
  'start': 58,
  'end': 62}]

#### Question Answering

Given a `question` and some `context`, we can get an answer.

**This pipeline DOES NOT GENERATE an answer, it gets it from the context**

In [2]:
answer_question = pipeline('question-answering')
answer_question(question = 'Where do I work?',
                context = 'A psychiatric and psychological services company')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|█████████████████████████████████████████████████████████████████████| 473/473 [00:00<00:00, 474kB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████| 261M/261M [00:10<00:00, 25.6MB/s]





All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<?, ?B/s]
vocab.txt: 100%|████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 1.90MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 10.8MB/s]


{'score': 0.5120940208435059,
 'start': 0,
 'end': 48,
 'answer': 'A psychiatric and psychological services company'}

In [4]:
answer_question(question = 'Where do I work?',
                context = 'Not a real context... I do actually work somewhere, as stated above, but this is just an example')

{'score': 0.9465346932411194, 'start': 41, 'end': 50, 'answer': 'somewhere'}

**Note the response completely depends on the context** (also I love the answer "somewhere" haha)

#### Summarization

In Summarization tasks, the model seeks to find the important parts of the message, then condense it.

#### Translation