## Check if PyTorch with Cuda is installed

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
torch.cuda.get_device_name(0)

'NVIDIA RTX A2000 Laptop GPU'

# Lesson 01

Tasks supported by pipelines:
- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

Doesn't seem to support
- asking questions of text / extract insights

In [3]:
from transformers import pipeline

# create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [11]:
# Curious as it also gets it right in portuguese
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", "Isto é absolutamente adorável", "Isto é absolutamente horrível"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'POSITIVE', 'score': 0.7158698439598083},
 {'label': 'NEGATIVE', 'score': 0.967018723487854}]

In [12]:
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [23]:
# different task - zero shot classification
# facebook/bart-large-mnli doesn't seem to do very well with this example
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about Covid-19 and life and death of humans under disease and pain",
    candidate_labels=["business", "health", "sports", "politics", "tech"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about Covid-19 and life and death of humans under disease and pain',
 'labels': ['tech', 'health', 'business', 'politics', 'sports'],
 'scores': [0.5046038031578064,
  0.28641557693481445,
  0.10287082940340042,
  0.060360126197338104,
  0.04574966803193092]}

In [24]:
# trying a different model
classifier = pipeline("zero-shot-classification", model="microsoft/deberta-xlarge-mnli")
classifier(
    "This is a course about Covid-19 and life and death of humans under disease and pain",
    candidate_labels=["business", "health", "sports", "politics", "tech"]
)

{'sequence': 'This is a course about Covid-19 and life and death of humans under disease and pain',
 'labels': ['health', 'tech', 'business', 'politics', 'sports'],
 'scores': [0.44734638929367065,
  0.22406303882598877,
  0.12684054672718048,
  0.1135772317647934,
  0.0881727784872055]}

In [37]:
classifier(
    "o meu chapéu tem três bicos",
    candidate_labels=["business", "health", "sports", "politics", "tech", "french", "portuguese"]
)

{'sequence': 'o meu chapéu tem três bicos',
 'labels': ['portuguese',
  'business',
  'tech',
  'french',
  'sports',
  'health',
  'politics'],
 'scores': [0.45083752274513245,
  0.14250801503658295,
  0.11888513714075089,
  0.08598063141107559,
  0.08276902139186859,
  0.0782376080751419,
  0.04078204929828644]}

In [46]:
# the output changes on each generation and is generaly not good
generator = pipeline("text-generation") # , model="distilgpt2")
generator("The main reasons for Climate Change are")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The main reasons for Climate Change are complex and unpredictable. There is no simple solution.\n\nOur Climate Change problem may be a mystery to many, but we know what it is and we know what we are going to do about it. As long'}]

In [54]:
# the results suck both for the default gpt2 and distilgpt2
generator = pipeline("text-generation") #, model="distilgpt2")
generator(
    "The best book ever written was",
    max_length=80,
    num_return_sequences=2,
    temperature=0.7)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The best book ever written was written in the early 1950s by a man who, like all good novels, was a man who was a man.\n\nI think I could have written much better if I'd read a good book(s). I don't find them all that interesting. And I think all of the great books are those that I think are great books. My favorite is The"},
 {'generated_text': 'The best book ever written was about this guy. He\'s like a brother to me. I\'ve been reading this book for more than five years now, and I never thought that I would end up reading it. I thought, "Wow, somebody\'s reading this book right now, and I\'m going to be able to say, \'Wow, I\'m reading this book!\'"\n\nI\'ve'}]

In [61]:
# only does one mask at a time
unmasker = pipeline("fill-mask") 
unmasker("This <mask> will teach you all about how women <mask>.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[[{'score': 0.18847587704658508,
   'token': 1566,
   'token_str': ' article',
   'sequence': '<s>This article will teach you all about how women<mask>.</s>'},
  {'score': 0.1335030496120453,
   'token': 1040,
   'token_str': ' book',
   'sequence': '<s>This book will teach you all about how women<mask>.</s>'}],
 [{'score': 0.16272902488708496,
   'token': 173,
   'token_str': ' work',
   'sequence': '<s>This<mask> will teach you all about how women work.</s>'},
  {'score': 0.157148540019989,
   'token': 18871,
   'token_str': ' behave',
   'sequence': '<s>This<mask> will teach you all about how women behave.</s>'}]]

In [67]:
# decent results
ner = pipeline("ner", grouped_entities=True)
ner("My friends call me Josefina Varnafé and I work in Setubalém at Blughab Grunhit.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.90422666,
  'word': 'Josefina Varnafé',
  'start': 19,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.7157374,
  'word': 'Setubalém',
  'start': 50,
  'end': 59},
 {'entity_group': 'ORG',
  'score': 0.88490963,
  'word': 'Blughab Grunhit',
  'start': 63,
  'end': 78}]

In [70]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is my first name?", #Where do I work?
    context="My friends call me Josefina Varnafé and I work in Setubalém at Blughab Grunhit.",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.982245147228241,
 'start': 19,
 'end': 35,
 'answer': 'Josefina Varnafé'}

In [71]:
summarizer = pipeline("summarization") # most popular is facebook/bart-large-cnn
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [72]:
summarizer(
    """
    Blindness is a great novel by Portuguese writer José Saramago that deals with human’s individual and collective reactions when in the face of adversarial forces. With gorgeous prose, this thought-provoking book shows us how our world, ever so concerned and consumed by appearances, would deal with the loss of our most relied upon sense: vision. When it’s every man by himself, when every man is free to do whatever he wants without the impending fear of recognition and judgement, we start to feel – I was going to say see – what the man’s true nature is and the crumbling down of a civilization diseased with selfishness, intolerance and ambition, to name just few symptoms.
    In Blindness by Jose Saramago, authortells us the story of a mysterious mass plague of blindness that affects nearly everyone living in an unnamed place in a never specified time and the implications this epidemic has on people’s lives. It all starts inexplicably when a man in his car suddenly starts seeing – or rather stops seeing anything but – a clear white brightness. He’s blind. Depending upon a stranger’s kindness to be able to go home in safety, we witness what appears to be the first sign of corruption and the first crack in society’s impending breakdown when the infamous volunteer steals the blind man’s car. Unfortunately for him, the white pest follows him and turns him into one of its victims as well.
    Spreading fast, this collective blindness is now frightening the authorities and must be dealt with: a large group of blind people and possibly infected ones – those who had any contact with the first group – have now been put in quarantine until second order. Living conditions start to degrade as the isolated population grows bigger, there is no organization, basic medicine is a luxury not allowed in and hygiene is nowhere to be found. To complicate things further, an armed clique acquires control and power, forcing the subjugated to pay for food in any way they can. The scenes that follow are extremely unpleasant to read, but at the same time they’re so realistic that you can’t be mad at Saramago for writing such severe events packed with violence that include rapes and murders.
"""
)

[{'summary_text': " In Blindness by Jose Saramago, a Portuguese writer deals with a mysterious mass plague of blindness that affects nearly everyone living in an unnamed place in a never specified time . With gorgeous prose, this thought-provoking book shows us how our world, ever so concerned and consumed by appearances, would deal with the loss of our most relied upon sense: vision . The scenes that follow are extremely unpleasant to read, but at the same time they're so realistic that you can’t be mad ."}]

In [74]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'Este curso faith feito pela Hugging Face.'}]

In [7]:
# modelo suporta 4 idiomas (en/de/fr/ro) e pode ser usado também para tarefas como sumarização
# notar como especificar o idioma de entrada e de saída
from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
tracker.start()
translator = pipeline("translation_en_to_fr", model="t5-base")
print(translator("Sometimes I look like a yellow curved banana filled with honey.", max_length=40))

# tracker.stop()
emissions: float = tracker.stop()
print(f"Emissions: {emissions*1000} g CO2")

# good read on t5 and transformers - https://github.com/christianversloot/machine-learning-articles/blob/main/easy-machine-translation-with-machine-learning-and-huggingface-transformers.md
# and the full list of articles - https://github.com/christianversloot/machine-learning-articles

[codecarbon INFO @ 21:14:42] [setup] RAM Tracking...
[codecarbon INFO @ 21:14:42] [setup] GPU Tracking...
[codecarbon INFO @ 21:14:42] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 21:14:42] [setup] CPU Tracking...
[codecarbon INFO @ 21:14:44] CPU Model on constant consumption mode: 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
[codecarbon INFO @ 21:14:44] >>> Tracker's metadata:
[codecarbon INFO @ 21:14:44]   Platform system: Windows-10-10.0.22621-SP0
[codecarbon INFO @ 21:14:44]   Python version: 3.9.16
[codecarbon INFO @ 21:14:44]   Available RAM : 31.838 GB
[codecarbon INFO @ 21:14:44]   CPU count: 8
[codecarbon INFO @ 21:14:44]   CPU model: 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
[codecarbon INFO @ 21:14:44]   GPU count: 1
[codecarbon INFO @ 21:14:44]   GPU model: 1 x NVIDIA RTX A2000 Laptop GPU
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automat

[{'translation_text': 'Parfois, je ressemble à une banane courbée jaune remplie de miel.'}]
Emissions: 0.022567390403866433 g


# Bias and Limitations

These models are often trained e.g. on biased text, and it shows in their outputs.

«When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waitress). The others are work occupations usually associated with one specific gender — and yes, **prostitute** ended up in the top 5 possibilities the model associates with “woman” and “work.” *This happens **even though** BERT is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it’s trained on the English Wikipedia and BookCorpus datasets)*.

«You therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. **Fine-tuning the model on your data won’t make this intrinsic bias disappear**.»

In [9]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
