In [1]:
# !pip install transformers torch

# Basic imports for the whole lab
from transformers import pipeline
import pandas as pd

# Example text from the lab
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

  from .autonotebook import tqdm as notebook_tqdm


### Question 1 – Understanding Pipelines

1. A pipeline in Hugging Face is basically a shortcut that handles all the low-level steps for us. It manages the tokenizer, the model, the tensors, and turns raw text into predictions without us worrying about the details. It lets us focus on the task instead of the engineering behind it.

2. Transformers offers many different pipeline tasks. For example: question answering, token classification for NER, summarization, translation, text generation, fill-mask, and others.

3. If we do not specify a model, the pipeline automatically loads a default one that is recommended for the task. If we want a specific model instead, we just pass its identifier, for example:
   classifier = pipeline(text-classification, model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

### Question 2 – Text Classification Deep Dive

1. The default model used here is distilbert/distilbert-base-uncased-finetuned-sst-2-english.

2. This model is fine-tuned on the SST-2 dataset, which contains short movie review sentences labeled as positive or negative. Because of that, it works best on short English sentences where we want to detect sentiment in a simple way.

3. The score value is basically the confidence of the model after applying softmax. It is a number between 0 and 1, where values close to 1 mean the model feels quite sure about the prediction.

4. A model that predicts emotions instead of only positive or negative is something like j-hartmann/emotion-english-distilroberta-base. These models usually output labels such as joy, anger, sadness, fear, and so on.

In [2]:
# Simple sentiment classifier using the default model
classifier = pipeline("text-classification")

outputs = classifier(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Unnamed: 0,label,score
0,NEGATIVE,0.901546


In [3]:
# Named Entity Recognition on the complaint text
ner_tagger = pipeline("ner", aggregation_strategy="simple")
ner_outputs = ner_tagger(text)
pd.DataFrame(ner_outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879009,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556567,Mega,208,212
4,PER,0.590258,##tron,212,216
5,ORG,0.669693,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775361,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


In [4]:
# Named Entity Recognition on the complaint text
ner_tagger = pipeline("ner", aggregation_strategy="simple")
ner_outputs = ner_tagger(text)
pd.DataFrame(ner_outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879009,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556567,Mega,208,212
4,PER,0.590258,##tron,212,216
5,ORG,0.669693,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775361,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


### Question 3 – Named Entity Recognition

1. NER tries to detect important pieces of information in text and assign them to categories. Typical categories are person, organization, location, or miscellaneous entities like product names.

2. In our example, the model finds things like Amazon (organization), Optimus Prime and Megatron (misc), Germany (location), and Bumblebee (person). These match the characters and places mentioned in the email.

3. The entity_group column is the final entity label. The word column shows the actual text span that was detected. The score is the confidence of the model. The start and end positions are the character indexes of that entity in the original text. Sub-words sometimes appear because of how tokenization works, and the aggregation strategy merges them back into clean entities.

In [5]:
# Question Answering pipeline
qa_pipeline = pipeline("question-answering")

qa_example = qa_pipeline(
    question="What did the customer receive instead of Optimus Prime?",
    context=text
)
qa_example

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'score': 0.1923447698354721,
 'start': 335,
 'end': 358,
 'answer': 'an exchange of Megatron'}

### Question 4 – Question Answering Systems

1. A question answering system takes two inputs: a question and a context paragraph. The output contains the answer extracted from the context, along with a score and the start and end positions inside the text.

2. In our example, the answer always comes directly from the context. The model selects the most likely span that answers the question, instead of inventing a new sentence.

3. This method works well when the answer is explicitly written somewhere in the passage. It becomes harder when the answer requires external knowledge, heavy reasoning, or when the context is very long.

In [6]:
# Summarization pipeline
summarizer = pipeline("summarization")

summary = summarizer(text, max_length=60, min_length=20, do_sample=False)
summary

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'summary_text': ' Bumblebee ordered an Optimus Prime action figure from your online store in Germany . Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead .'}]

### Question 5 – Text Summarization

1. The goal of summarization is to compress the original message into a shorter version while keeping the main point. Here, we want a short summary of the complaint email.

2. The min_length and max_length parameters control how long the generated summary can be. They limit the number of tokens produced by the model.

3. The generated summary keeps the essential idea of the email, even if it skips some details. It usually keeps the main problem and the request from the customer.

In [14]:
# Translation pipeline (English to French as an example)

# Use T5 as a generic text2text model
translator = pipeline(
    "text2text-generation",
    model="google-t5/t5-base"
)

# T5 expects a task prefix
prompt = "translate English to French: " + "Paul loves programming in Python since he knows Dallard King"

translation = translator(
    prompt,
    do_sample=False
)

translation




Device set to use mps:0


[{'generated_text': "Paul aime la programmation en Python puisqu'il connaît Dallard King"}]

### Question 6 – Machine Translation

1. The translation pipeline rewrites the entire email into another language while keeping the meaning as close as possible. The structure of the message generally remains the same.

2. The translation is not always literal. Modern models try to produce natural-sounding sentences in the target language, so some expressions may be slightly rephrased.

3. Names like Optimus Prime and Megatron stay unchanged, and sometimes the tone can shift a little, especially in long or formal messages.

In [8]:
# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")

prompt = "Dear Amazon, last week I ordered an Optimus Prime action figure but"
generated = generator(
    prompt,
    max_length=80,
    num_return_sequences=1,
    do_sample=True,
    top_k=50
)

generated

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'Dear Amazon, last week I ordered an Optimus Prime action figure but was told, by a friend, that it would not be available until late November. I was told the only way to get one was through Amazon, but luckily I had a friend who had already purchased one of the figures, and they had a lot of interesting information.\n\nThey said, "All you have to do is grab the figure of your choice from Amazon, and if you want it then come out and buy it. I\'ve been saying this to Amazon for a couple of years now, and they do want to release a new Optimus Prime figure every few months. I\'m sure you can find a few of them on Amazon, but most of you won\'t be able to find one from them. I got a few of them from a friend, but only recently did I get one from Amazon. This time I got it from Amazon with a $0.99 shipping charge. It\'s been quite a journey to get this one out there, but if you get it, you can thank Amazon for offering you a great deal on a great deal of Transformers toy

### Question 7 – Text Generation

1. Text generation is different from the other tasks because the model continues the prompt by creating new text. It is not classifying or extracting information; it is actually producing new sentences.

2. max_length sets the maximum number of tokens for the full output. do_sample activates randomness in the generation. top_k means the model only chooses from the top k most likely tokens at each step. These settings change how creative or controlled the output is.

3. Simple text generation can sometimes drift off-topic, repeat itself, or produce strange or incorrect statements. There is no guarantee that the produced text is factual, so practical systems usually add constraints or filters.