#Testing a NLP models from hugging face
This tutorial aims to help users understand and practice using NLP models available on the Hugging Face platform. https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt

# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.

##Clasifier

With this model, we can classify whether a sentence is positive or negative, and provide a corresponding confidence score for that decision. It is very interesting how easy it is to use the pipeline function.


In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Today was a very productive day")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998051524162292}]

In [3]:
classifier(
    ["I've been waiting for you dear", "I hate you "]
)

[{'label': 'POSITIVE', 'score': 0.9990070462226868},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

## Zero-Shot Classification Model
This model classifies sentences into categories it has not seen before by providing contextual topics. It outputs the likelihood of a sentence belonging to each topic along with their confidence scores.

In [13]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "work", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'work', 'business'],
 'scores': [0.5655176043510437, 0.3595082461833954, 0.0749741941690445]}

##Text Generator

This model generates text by continuing a given phrase, ensuring that the output makes logical sense and maintains contextual relevance.

In [12]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to make money")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to make money in finance within the last year of your academic career. We will also explain how to become a stock trader, and how to learn trading skills via the NUTS Professional Trading Manual. We'}]

## Text Generator with an especific model

This model leverages a neural network architecture based on GPT-2 to automatically generate text. Given an initial input, the model predicts the subsequent words to produce coherent sequences. The code allows customization of the maximum length of the generated text and the number of text sequences to be returned.

In [10]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, i will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, i will teach you how to write simple and functional languages and how to read, manage, share, run and interact with other programming'},
 {'generated_text': 'In this course, i will teach you how to take on more responsibilities than just one job. i will teach you how to do what you like,'}]

##Fill-Mask Prediction with Pretrained Language Model
This pipeline utilizes a pretrained masked language model to predict the most likely words for a given <mask> token within a sentence. The model analyzes the context around the mask and returns the top predicted words along with their probabilities. The top_k parameter specifies how many of the most probable completions are displayed.

In [17]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> a car.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.3914192318916321,
  'token': 15487,
  'token_str': ' owning',
  'sequence': 'This course will teach you all about owning a car.'},
 {'score': 0.2691527009010315,
  'token': 1428,
  'token_str': ' driving',
  'sequence': 'This course will teach you all about driving a car.'}]

##Named Entity Recognition with Grouped Entities
This model applies Named Entity Recognition (NER) to identify and classify entities such as names, organizations, and locations in a given text. Using the grouped_entities parameter, it consolidates overlapping or adjacent entities into single groups for easier interpretation. The example identifies names, organizations, and places from the input sentence.

In [20]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Luis and I work at UNI in Bogota.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.9986791,
  'word': 'Luis',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.99847615,
  'word': 'UNI',
  'start': 30,
  'end': 33},
 {'entity_group': 'LOC',
  'score': 0.99914473,
  'word': 'Bogota',
  'start': 37,
  'end': 43}]

## Question Answering with Pretrained Language Model
This pipeline employs a pretrained question-answering model to extract answers from a given context based on a specific question. By analyzing the input text, the model identifies the most relevant snippet that answers the question. In the example, it determines the workplace mentioned in the context based on the query.


In [21]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Luis and I work at UNI in Bogota.",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.533600926399231, 'start': 30, 'end': 33, 'answer': 'UNI'}

##Text Summarization with Pretrained Language Model

This pipeline leverages a pretrained summarization model to condense lengthy text into concise summaries while retaining the key information. Given an input passage, the model analyzes the content and generates a shorter version that captures the main points.

In [24]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Language Translation with Pretrained Model

This pipeline uses a pretrained translation model to convert text from one language to another. In this case, the model translates French input into English, leveraging the Helsinki-NLP/opus-mt-fr-en model. The example demonstrates translating a French sentence about a course created by Hugging Face into English.


In [25]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Device set to use cpu


[{'translation_text': 'This course is produced by Hugging Face.'}]