<a href="https://colab.research.google.com/github/manuelrucci7/deep-learning-course/blob/main/colab/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
#!pip install transformers # light version (no Pytorch)
!pip install transformers[sentencepiece]

In [None]:
import transformers

## Transformers Pipeline

In [None]:
# https://huggingface.co/learn/nlp-course/chapter1/1
# Classifying whole sentences:
# Classifying each word in a sentence
# Generating text content
# Extracting an answer from a text
# Generating a new sentence from an input text
# NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision,

# How we represet text?
# Transform Library: https://github.com/huggingface/transformers
# Model Hub: https://huggingface.co/models

In [None]:
from transformers import pipeline

# Pipeline has preprocess and postprocess step so that we input text and we get a text answer
# By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English.

# The text is preprocessed into a format the model can understand.
# The preprocessed inputs are passed to the model.
# The predictions of the model are post-processed, so you can make sense of them.

# https://huggingface.co/docs/transformers/main_classes/pipelines

# Pipeline
# feature-extraction (get the vector representation of a text)
# fill-mask
# ner (named entity recognition)
# question-answering
# sentiment-analysis
# summarization
# text-generation
# translation
# zero-shot-classification

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [None]:
# Zero-shot classification
# We need to classify texts that haven’t been labelled.
# it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model.

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "sport","business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'sport', 'politics'],
 'scores': [0.7840248942375183,
  0.10394306480884552,
  0.07171984016895294,
  0.04031217098236084]}

In [None]:
# Text Generation
# Use a pipeline to generate some text. WE provide a prompt and the model will auto-complete it by generating the remaining text.
# Text generation involves randomness, the results might be different

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'In this course, we will teach you how to design and use the Arduino libraries that you will take in this tutorial.\n\nDownload the following source code (or any other software that you can use with the Arduino IDE) from here:\n\n'}]

In [None]:
from transformers import pipeline

# https://huggingface.co/models
# https://huggingface.co/distilbert/distilgpt2
# Use the num_return_sequences and max_length arguments to generate two sentences of 15 words each.
#generator = pipeline("text-generation", model="NousResearch/Hermes-3-Llama-3.2-3B")
#generator = pipeline("text-generation", model="meta-llama/Llama-3.2-3B")
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'In this course, we will teach you how to get up on top of the day. There is many methods for creating beautiful things in Java.\n'},
 {'generated_text': 'In this course, we will teach you how to read, write and use language and grammar through a book and learn how to use language together. If'}]

In [None]:
#  Mask filling
from transformers import pipeline

# The top_k argument controls how many possibilities you want to be displayed
# the model fills in the special <mask> word, which is often referred to as a mask token.
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19198468327522278,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042092032730579376,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [None]:
# Named entity recognition
#Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

# Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True) # Try False
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")



No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [None]:
# Question answering
from transformers import pipeline

# The question-answering pipeline answers questions using information from a given context:

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [None]:
#  Summarization
# Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""",
    max_length=10,
    min_length=5
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years'}]

In [None]:
# Translation
# For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [None]:
# The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks.
# https://huggingface.co/learn/nlp-course/chapter1/4
# GPT-like (also called auto-regressive Transformer models)
# BERT-like (also called auto-encoding Transformer models)
# BART/T5-like (also called sequence-to-sequence Transformer models)ù
# All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models.   self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!. Then we need to do fine tuning, transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

# An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones

#Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
# Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.

# https://huggingface.co/learn/nlp-course/chapter1/4
# Model Encoder, Decoder
# Encoder-only models:  Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
# Decoder-only models: Good for generative tasks such as text generation.
# Encoder-decoder models or sequence-to-sequence models
# A key feature of Transformer models is that they are built with special layers called attention layers
# https://arxiv.org/abs/1706.03762
# When we create the embeeding we put more attention to certain words.

# a word by itself has a meaning, but that meaning is deeply affected by the context,
# The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.
# Architecture: This is the skeleton of the model —
# Checkpoints: These are the weights that will be loaded in a given architecture
# Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”:

In [None]:
# Encoder
# Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.
# ALBERT
# BERT
# DistilBERT
# ELECTRA
# RoBERTa

# Decoder
# At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.
# CTRL
# GPT
# GPT-2
# Transformer XL

# Sequence to Sequence
#Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.
# BART
# mBART
# Marian
# T5

# https://huggingface.co/learn/nlp-course/chapter1/8
# PAY Attention to which data the model used for trained. Gender freee?
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])