<a href="https://colab.research.google.com/github/m3wzz/very_fake/blob/main/Lia_Introduction_Transformers_what_can_they_do.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers, what can they do?

This notebook has been provided by HuggingFace as part of the [learning material](https://huggingface.co/learn) and was adapted by Sarah Oberbichler for the DMGK NLP-Course at the University of Mainz.

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
# @markdown ### Let's intall the requirements to run this notebook
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


#Transformer Pipelines

The most basic object in the Huggingface Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps. There are three main steps involved when you pass some text to a pipeline:

*   The text is preprocessed into a format the model can understand.
*   The preprocessed inputs are passed to the model.
*   The predictions of the model are post-processed, so you can make sense of





# 1. Text Classification Pipelines

### 1.1 Sentiment Analysis Pipeline

Sentiment analysis is an NLP technique that determines the emotional tone in text, classifying it as positive, negative, or neutral. It's used in various fields like customer feedback analysis, social media monitoring, and market research. Methods range from rule-based approaches using sentiment lexicons to machine learning models trained on labeled data. Sentiment analysis helps businesses and researchers understand attitudes in large volumes of text, though it faces challenges with sarcasm, context-dependent expressions, and domain-specific language. Advanced applications include aspect-based analysis and emotion detection. The accuracy of sentiment analysis systems is typically evaluated against human-labeled datasets.

In [None]:
# @markdown #### **Exercise 1**: Search for the most recent English speaking news and copy one sentence or paragraph. Add it instead of "Today is a gread day." Do you agree with the automated classification? Is there a model that we can use with German text? Find one model and adapt the code.
from transformers import pipeline

pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
pipe("Ich bin müde.")

Device set to use cuda:0


[{'label': 'Positive', 'score': 0.44753554463386536}]

In [None]:
pipe(
    ["Ich liebe mein Bett!", "I hate this so much!"]
)

[{'label': 'positive', 'score': 0.989069938659668},
 {'label': 'negative', 'score': 0.9600092768669128}]

### 2.1 Zero-Shot Classification

Zero-shot classification is a machine learning technique where a model can categorize or classify items into classes it hasn't been explicitly trained on. It leverages the model's understanding of language or concepts to make predictions about new, unseen categories.

In [None]:
# @markdown #### **Exercise 2**: Change the candidate_labels into "technical", "non-technical", see what happens when we delete the part: "in the humanities".
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "This is a course about Natural Language Processing",
    candidate_labels=['huhu', 'hihi'],
)

Device set to use cuda:0


{'sequence': 'This is a course about Natural Language Processing',
 'labels': ['huhu', 'hihi'],
 'scores': [0.5298119187355042, 0.47018805146217346]}

# 2. Text Generation Pipeline

Text generation with transformers leverages pre-trained models and deep learning frameworks to produce human-like text. The process typically involves:


*   Next-token prediction: The basic mechanism where the model predicts the most likely next word or token.
*   Advanced generation techniques: Including Greedy Search, Beam Search, and Top-K sampling, which offer different trade-offs between text quality and diversity.
*   Prompt engineering: Carefully designed inputs can elicit specific types of responses, revealing the model's learned patterns and biases.
*   Model insights: Generated text can provide glimpses into the norms, biases, and knowledge encoded in the model's training data.
*   Temporal context: The outputs reflect not just the training data, but also the historical and cultural context of when the data was collected and the model trained.
*   Accessibility: With user-friendly interfaces like chatbots, text generation often serves as many people's first interaction with large language models.
*   Observable inference: Text generation allows users to witness the model's decision-making process in real-time, offering a tangible way to understand how these complex systems operate.




In [None]:
# @markdown #### **Exercise 3**: Try different phrases, did you get an answer that was satisfying?
from transformers import pipeline

generator = pipeline("text-generation")
generator("Transformer Models are")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Transformer Models are a huge step forward in this area. The new models are all fully automated, allowing the user to manually configure their design in real time.\n\nAnd finally, as promised, the next version of the Model 3 will be available in early 2018. The company has already been busy developing its next generation of cars and there will be a lot of updates to be made to it in the future. The company has already announced that a new driver will be available in the near future, and it's not looking far from that.\n\nIn the meantime, we're here with a special thank you to all of you for watching, and to all our fellow Model 3 enthusiasts. What are you waiting for?\n\nThanks for reading, and stay tuned for more updates as they come."}]

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'score': 0.19619743525981903,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052695631980896,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

# 3. Text Extraction Pipeline

### 3.1 Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing task that identifies and classifies named entities in text. It locates and categorizes proper nouns into predefined classes such as person names, organizations, locations, dates, monetary values, and product names. NER is widely used in various applications, including information extraction from unstructured text, improving search engine results, customer service automation, content recommendation, social media analysis, medical record processing, legal document analysis, machine translation enhancement, question answering systems, and automatic text summarization. By helping machines recognize and categorize important elements within text, NER enables more human-like understanding and processing of language, making it a crucial component in many language understanding and information extraction tasks across different domains.


In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Hans. Yesterday, I visited the office of my girlfriend, who works at Sparkasse. Her office is not far from mine at the University, which is next to Sparkasse.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'entity_group': 'PER',
  'score': np.float32(0.9989588),
  'word': 'Hans',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': np.float32(0.9812904),
  'word': 'Sparkasse',
  'start': 80,
  'end': 89},
 {'entity_group': 'ORG',
  'score': np.float32(0.518164),
  'word': 'University',
  'start': 130,
  'end': 140},
 {'entity_group': 'ORG',
  'score': np.float32(0.5309791),
  'word': 'Sparkasse',
  'start': 159,
  'end': 168}]

### 3.2 Question-Answering

Question answering with transformers is a natural language processing task that uses transformer-based models to automatically generate answers to questions based on a given context or knowledge base. This approach leverages the powerful language understanding and generation capabilities of transformer models like BERT, RoBERTa, or T5.

In [None]:
# @markdown #### **Exercise 4**: Try a different model. For example the "bert-large-uncased-whole-word-masking-finetuned-squad" model. Does it change the answer?
from transformers import pipeline

question_answerer = pipeline("question-answering",  model="bert-large-uncased-whole-word-masking-finetuned-squad")
question_answerer(
    question="Where do I work?",
    context="My name is Hans. Yesterday, I visited the office of my girlfriend, who works at Sparkasse. Her office is not far from mine at the University, which is next to Sparkasse",
)

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.22385284304618835,
 'start': 126,
 'end': 140,
 'answer': 'the University'}

# 4. Summarization and Translation

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")