# Pipelines for NLP Tasks

In [1]:
import transformers
from transformers import pipeline

In [3]:
print(transformers.__version__)

4.41.2


In [4]:
help(pipeline)

Help on function pipeline in module transformers.pipelines:

pipeline(task: str = None, model: Union[str, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel'), NoneType] = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, ForwardRef('PreTrainedTokenizerFast'), NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, image_processor: Union[str, transformers.image_processing_utils.BaseImageProcessor, NoneType] = None, framework: Optional[str] = None, revision: Optional[str] = None, use_fast: bool = True, token: Union[str, bool, NoneType] = None, device: Union[int, str, ForwardRef('torch.device'), NoneType] = None, device_map=None, torch_dtype=None, trust_remote_code: Optional[bool] = None, model_kwargs: Dict[str, Any] = None, pipeline_class: Optional[Any] = None, **kwargs) -> transformers.pipelines.base.Pipeline


## Loading Tasks

The task defining which pipeline will be returned. Currently accepted tasks are:
    
    - `"audio-classification"`: will return a [`AudioClassificationPipeline`].
    - `"automatic-speech-recognition"`: will return a [`AutomaticSpeechRecognitionPipeline`].
    - `"conversational"`: will return a [`ConversationalPipeline`].
    - `"depth-estimation"`: will return a [`DepthEstimationPipeline`].
    - `"document-question-answering"`: will return a [`DocumentQuestionAnsweringPipeline`].
    - `"feature-extraction"`: will return a [`FeatureExtractionPipeline`].
    - `"fill-mask"`: will return a [`FillMaskPipeline`]:.
    - `"image-classification"`: will return a [`ImageClassificationPipeline`].
    - `"image-segmentation"`: will return a [`ImageSegmentationPipeline`].
    - `"image-to-text"`: will return a [`ImageToTextPipeline`].
    - `"object-detection"`: will return a [`ObjectDetectionPipeline`].
    - `"question-answering"`: will return a [`QuestionAnsweringPipeline`].
    - `"summarization"`: will return a [`SummarizationPipeline`].
    - `"table-question-answering"`: will return a [`TableQuestionAnsweringPipeline`].
    - `"text2text-generation"`: will return a [`Text2TextGenerationPipeline`].
    - `"text-classification"` (alias `"sentiment-analysis"` available): will return a
      [`TextClassificationPipeline`].
    - `"text-generation"`: will return a [`TextGenerationPipeline`]:.
    - `"token-classification"` (alias `"ner"` available): will return a [`TokenClassificationPipeline`].
    - `"translation"`: will return a [`TranslationPipeline`].
    - `"translation_xx_to_yy"`: will return a [`TranslationPipeline`].
    - `"video-classification"`: will return a [`VideoClassificationPipeline`].
    - `"visual-question-answering"`: will return a [`VisualQuestionAnsweringPipeline`].
    - `"zero-shot-classification"`: will return a [`ZeroShotClassificationPipeline`].
    - `"zero-shot-image-classification"`: will return a [`ZeroShotImageClassificationPipeline`].
    - `"zero-shot-object-detection"`: will return a [`ZeroShotObjectDetectionPipeline`].

## Classification

### Default Models

In [5]:
pipe = pipeline(task="text-classification")
pipe("This restaurant is ok")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998236298561096}]

### Specific Models

Perhaps you want to use a different model for different categories or text types, for example, financial news: https://huggingface.co/ProsusAI/finbert

You can explore more details in the paper: https://arxiv.org/pdf/1908.10063

In [8]:
# Instead of using the default model, we can take specific ones
# Example: a model for financial news classification
# https://huggingface.co/ProsusAI/finbert
# Each model has also a default task associated
# Usually sentiment-analysis = text-classification
pipe = pipeline(model="ProsusAI/finbert")

In [9]:
pipe("Shares of food delivery companies surged despite the catastrophic impact of coronavirus on global markets.")

[{'label': 'positive', 'score': 0.9350943565368652}]

In [10]:
tweets = ['Gonna buy AAPL, its about to surge up!',
          'Gotta sell AAPL, its gonna plummet!']

In [11]:
pipe(tweets)

[{'label': 'positive', 'score': 0.5234110355377197},
 {'label': 'neutral', 'score': 0.5528594255447388}]

# Named Entity Recognition

Let's explore another NLP task, such as NER - Named Entity Recognition

**Note, this is a much larger model! If you run this it will download about 1.5 GB on to your computer inside of a cache folder!**

In [12]:
pipe = pipeline(task="text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [13]:
# Name Entity Recognition (WARNING: large model)
# We can tag parts of speech:
ner_tag_pipe = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [14]:
result = ner_tag_pipe("After working at Tesla I started to study Nikola Tesla a lot more, especially at university in the USA.")

In [16]:
# Tesla: Organization
# Nicola Tesla: Person
# USA: Location
result

[{'entity': 'I-ORG',
  'score': 0.9137764,
  'index': 4,
  'word': 'Te',
  'start': 17,
  'end': 19},
 {'entity': 'I-ORG',
  'score': 0.37898877,
  'index': 5,
  'word': '##sla',
  'start': 19,
  'end': 22},
 {'entity': 'I-PER',
  'score': 0.99693346,
  'index': 10,
  'word': 'Nikola',
  'start': 42,
  'end': 48},
 {'entity': 'I-PER',
  'score': 0.9901416,
  'index': 11,
  'word': 'Te',
  'start': 49,
  'end': 51},
 {'entity': 'I-PER',
  'score': 0.8931826,
  'index': 12,
  'word': '##sla',
  'start': 51,
  'end': 54},
 {'entity': 'I-LOC',
  'score': 0.9997478,
  'index': 22,
  'word': 'USA',
  'start': 99,
  'end': 102}]

In [15]:
# We can sace the model to file/folder:
# my_local_text_classification/
#   config.json
#   special_tokens_map.json
#   tokenizer.json
#   model.safetensors
#   tokenizer_config.json
#   vocab.txt
pipe.save_pretrained('my_local_text_classification/')

In [18]:
# Load the pipeline from the saved directory
pipe = pipeline(task="text-classification", model='my_local_text_classification/', tokenizer='my_local_text_classification/')

In [19]:
# Now we can use the pipeline for inference
result = pipe("I love using Hugging Face transformers!")
print(result)

[{'label': 'POSITIVE', 'score': 0.9971315860748291}]


In [17]:
!ls my_local_text_classification

config.json	   special_tokens_map.json  tokenizer.json
model.safetensors  tokenizer_config.json    vocab.txt


# Question Answering

In [20]:
# Another task is QA: this is similar to a chatbot,
# but we pass a context + question and get one answer
# Default QA model:
# https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad
qa_bot = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [21]:
text = """
D-Day, marked on June 6, 1944, stands as one of the most significant military operations in history,
initiating the Allied invasion of Nazi-occupied Europe during World War II. Known as Operation Overlord,
this massive amphibious assault involved nearly 160,000 Allied troops landing on the beaches of Normandy,
France, across five sectors: Utah, Omaha, Gold, Juno, and Sword. Supported by over 5,000 ships and 13,000
aircraft, the operation was preceded by extensive aerial and naval bombardment and an airborne assault.
The invasion set the stage for the liberation of Western Europe from Nazi control, despite the heavy
casualties and formidable German defenses. This day not only demonstrated the logistical prowess
and courage of the Allied forces but also marked a turning point in the war, leading to the eventual
defeat of Nazi Germany.
"""

In [22]:
question = "What were the five beach sectors on D-Day?"

result = qa_bot(question=question,context=text)

In [23]:
result

{'score': 0.9430820345878601,
 'start': 345,
 'end': 379,
 'answer': 'Utah, Omaha, Gold, Juno, and Sword'}

## Translations

Translates from one language to another.

This translation pipeline can currently be loaded from pipeline() using the following task identifier: "translation_xx_to_yy".

The models that this pipeline can use are models that have been fine-tuned on a translation task. See the up-to-date list of available models on www.huggingface.co/models.  

Note: You would typically call a specific model for translations: https://huggingface.co/models?pipeline_tag=translation

In [45]:
from transformers import pipeline
# Language translation: we can get a generic translation pipeline 'translation'
# or a specific one 'translation_xx_to_yy'
# Default model:
# https://huggingface.co/google-t5/t5-base
translate = pipeline('translation_en_to_fr')

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [41]:
result = translate("Hello, my name is Mikel. What is your name?")

In [42]:
result

[{'translation_text': 'Hola, me llamo Mikel. ¿Cómo te llamas?'}]

In [36]:
# For other language pairs, we need to specify other specific models
translate = pipeline('translation_en_to_es', model='Helsinki-NLP/opus-mt-en-es')

In [46]:
result = translate("Hello, my name is Mikel. What is your name?")

In [40]:
result

[{'translation_text': 'Hola, me llamo Mikel. ¿Cómo te llamas?'}]

In [47]:
# We can also load the generic translation pipeline
# and specify the source and target languages when translating
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-es')

result = translator("Hello, how are you?", src_lang='en', tgt_lang='es')
print(result)

[{'translation_text': 'Hola, ¿cómo estás?'}]
