<a href="https://colab.research.google.com/github/nitesis/CAS_AICP_M6_Exercises/blob/main/Session_1_NLP_tasks_with_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP tasks
In this tutorial, we test out the different NLP tasks namely (Classification, Sentiment Analysis, Translation, NER, Question Answering and Generation) with [Huggingface Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [18]:
!pip install transformers




In [27]:
from transformers import pipeline

In [28]:
def display_task(task_name, input_text, result):
    print(f"\n{'='*10} {task_name} {'='*10}")
    print(f"Input: {input_text}\n")
    print(f"Result: {result}\n")

In [29]:

# 1. Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis")
sentiment_input = "I loved the movie"
sentiment_result = sentiment_analyzer(sentiment_input)
display_task("Sentiment Analysis", sentiment_input, sentiment_result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



Input: I loved the movie

Result: [{'label': 'POSITIVE', 'score': 0.9998657703399658}]



In [32]:
sentiment_result[0].get('label')

'POSITIVE'

In [30]:

# 2. Text Generation
generator = pipeline("text-generation", model="gpt2")
generation_input = "Once upon a time, in a faraway land"
generation_result = generator(generation_input, max_length=30, num_return_sequences=1)
display_task("Text Generation", generation_input, generation_result)


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Input: Once upon a time, in a faraway land

Result: [{'generated_text': 'Once upon a time, in a faraway land, they gave birth to children that are the most precious of all. This was the gift of divine'}]



In [33]:

# 3. Named Entity Recognition (NER)
ner_pipeline = pipeline("ner", grouped_entities=True)
ner_input = "Barack Obama was the 44th President of the United States."
ner_result = ner_pipeline(ner_input)
display_task("Named Entity Recognition", ner_input, ner_result)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Input: Barack Obama was the 44th President of the United States.

Result: [{'entity_group': 'PER', 'score': 0.99913895, 'word': 'Barack Obama', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9952142, 'word': 'United States', 'start': 43, 'end': 56}]





In [34]:

# 4. Question Answering
qa_pipeline = pipeline("question-answering")
qa_context = "Transformers are state-of-the-art tools for NLP tasks developed by Hugging Face."
qa_question = "What are transformers?"
qa_result = qa_pipeline(question=qa_question, context=qa_context)
display_task("Question Answering", qa_question, qa_result)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



Input: What are transformers?

Result: {'score': 0.49432671070098877, 'start': 0, 'end': 39, 'answer': 'Transformers are state-of-the-art tools'}



In [35]:

# 5. Summarization
summarizer = pipeline("summarization")
summarization_input = (
    '''Moby-Dick; or, The Whale is an 1851 epic novel by American writer Herman Melville. The book is centered on the sailor Ishmael's narrative of the maniacal quest of Ahab, captain of the whaling ship Pequod, for vengeance against Moby Dick, the giant white sperm whale that bit off his leg on the ship's previous voyage. A contribution to the literature of the American Renaissance, Moby-Dick was published to mixed reviews, was a commercial failure, and was out of print at the time of the author's death in 1891. Its reputation as a Great American Novel was established only in the 20th century, after the 1919 centennial of its author's birth. William Faulkner said he wished he had written the book himself,[1] and D. H. Lawrence called it "one of the strangest and most wonderful books in the world" and "the greatest book of the sea ever written".[2] Its opening sentence, "Call me Ishmael", is among world literature's most famous'''
)
summarization_result = summarizer(summarization_input, max_length=50, min_length=25, do_sample=False)
display_task("Summarization", summarization_input, summarization_result)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



Input: Moby-Dick; or, The Whale is an 1851 epic novel by American writer Herman Melville. The book is centered on the sailor Ishmael's narrative of the maniacal quest of Ahab, captain of the whaling ship Pequod, for vengeance against Moby Dick, the giant white sperm whale that bit off his leg on the ship's previous voyage. A contribution to the literature of the American Renaissance, Moby-Dick was published to mixed reviews, was a commercial failure, and was out of print at the time of the author's death in 1891. Its reputation as a Great American Novel was established only in the 20th century, after the 1919 centennial of its author's birth. William Faulkner said he wished he had written the book himself,[1] and D. H. Lawrence called it "one of the strangest and most wonderful books in the world" and "the greatest book of the sea ever written".[2] Its opening sentence, "Call me Ishmael", is among world literature's most famous

Result: [{'summary_text': " Herman Melville's 1851 epic 

In [36]:

# 6. Zero-shot Classification
zero_shot_classifier = pipeline("zero-shot-classification")
zero_shot_input = "I enjoy coding in Python and working on NLP projects."
zero_shot_labels = ["technology", "sports", "cooking"]
zero_shot_result = zero_shot_classifier(zero_shot_input, candidate_labels=zero_shot_labels)
display_task("Zero-shot Classification", zero_shot_input, zero_shot_result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



Input: I enjoy coding in Python and working on NLP projects.

Result: {'sequence': 'I enjoy coding in Python and working on NLP projects.', 'labels': ['technology', 'sports', 'cooking'], 'scores': [0.9798336625099182, 0.012401980347931385, 0.007764369249343872]}



# Todo
Create the pipeline for a translation task from "English" to "German"

In [39]:

# 6. Translation
# Code snipped by manually seraching for a pipeline
en_de_translator = pipeline("translation_en_to_de")
# This code is generated by Gemini
# translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
translation_input = "Hugging Face makes it easy to use NLP models."
translation_result = translator(translation_input)
display_task("Translation (English to German)", translation_input, translation_result)


No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0



Input: Hugging Face makes it easy to use NLP models.

Result: [{'translation_text': 'Hugging Face macht es einfach, NLP-Modelle zu verwenden.'}]

