<a href="https://colab.research.google.com/github/manuelrucci7/deep-learning-course/blob/main/colab/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
#!pip install transformers # light version (no Pytorch)
!pip install transformers[sentencepiece]

In [None]:
import transformers

## Transformers Pipeline

In [None]:
# https://huggingface.co/learn/nlp-course/chapter1/1
# Classifying whole sentences:
# Classifying each word in a sentence
# Generating text content
# Extracting an answer from a text
# Generating a new sentence from an input text
# NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision,

# How we represet text?
# Transform Library: https://github.com/huggingface/transformers
# Model Hub: https://huggingface.co/models

In [None]:
from transformers import pipeline

# Pipeline has preprocess and postprocess step so that we input text and we get a text answer
# By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English.

# The text is preprocessed into a format the model can understand.
# The preprocessed inputs are passed to the model.
# The predictions of the model are post-processed, so you can make sense of them.

# https://huggingface.co/docs/transformers/main_classes/pipelines

# Pipeline
# feature-extraction (get the vector representation of a text)
# fill-mask
# ner (named entity recognition)
# question-answering
# sentiment-analysis
# summarization
# text-generation
# translation
# zero-shot-classification

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [None]:
# Zero-shot classification
# We need to classify texts that haven’t been labelled.
# it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model.

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "sport","business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'sport', 'politics'],
 'scores': [0.7840248942375183,
  0.10394306480884552,
  0.07171984016895294,
  0.04031217098236084]}

In [None]:
# Text Generation
# Use a pipeline to generate some text. WE provide a prompt and the model will auto-complete it by generating the remaining text.
# Text generation involves randomness, the results might be different

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'In this course, we will teach you how to design and use the Arduino libraries that you will take in this tutorial.\n\nDownload the following source code (or any other software that you can use with the Arduino IDE) from here:\n\n'}]

In [None]:
from transformers import pipeline

# https://huggingface.co/models
# https://huggingface.co/distilbert/distilgpt2
# Use the num_return_sequences and max_length arguments to generate two sentences of 15 words each.
#generator = pipeline("text-generation", model="NousResearch/Hermes-3-Llama-3.2-3B")
#generator = pipeline("text-generation", model="meta-llama/Llama-3.2-3B")
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'In this course, we will teach you how to get up on top of the day. There is many methods for creating beautiful things in Java.\n'},
 {'generated_text': 'In this course, we will teach you how to read, write and use language and grammar through a book and learn how to use language together. If'}]

In [None]:
#  Mask filling
from transformers import pipeline

# The top_k argument controls how many possibilities you want to be displayed
# the model fills in the special <mask> word, which is often referred to as a mask token.
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19198468327522278,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042092032730579376,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [None]:
# Named entity recognition
#Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

# Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True) # Try False
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")



No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [None]:
# Question answering
from transformers import pipeline

# The question-answering pipeline answers questions using information from a given context:

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [None]:
#  Summarization
# Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""",
    max_length=10,
    min_length=5
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years'}]

In [None]:
# Translation
# For translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [None]:
# The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks.
# https://huggingface.co/learn/nlp-course/chapter1/4
# GPT-like (also called auto-regressive Transformer models)
# BERT-like (also called auto-encoding Transformer models)
# BART/T5-like (also called sequence-to-sequence Transformer models)ù
# All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models.   self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!. Then we need to do fine tuning, transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

# An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones

#Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
# Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.

# https://huggingface.co/learn/nlp-course/chapter1/4
# Model Encoder, Decoder
# Encoder-only models:  Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
# Decoder-only models: Good for generative tasks such as text generation.
# Encoder-decoder models or sequence-to-sequence models
# A key feature of Transformer models is that they are built with special layers called attention layers
# https://arxiv.org/abs/1706.03762
# When we create the embeeding we put more attention to certain words.

# a word by itself has a meaning, but that meaning is deeply affected by the context,
# The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.
# Architecture: This is the skeleton of the model —
# Checkpoints: These are the weights that will be loaded in a given architecture
# Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”:

In [None]:
# Encoder
# Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.
# ALBERT
# BERT
# DistilBERT
# ELECTRA
# RoBERTa

# Decoder
# At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.
# CTRL
# GPT
# GPT-2
# Transformer XL

# Sequence to Sequence
#Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.
# BART
# mBART
# Marian
# T5

# https://huggingface.co/learn/nlp-course/chapter1/8
# PAY Attention to which data the model used for trained. Gender freee?
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

## Chapter Using Transformers  preprocessing with tokenizers, passing the inputs through the model, and postprocessing!

In [1]:
# Tokenizers take care of the first and last processing steps, handling the conversion from text to numerical inputs for the neural network,, and the conversion back to text when it is needed
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [3]:
# Preprocessing with a tokenizer
# Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers
# Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# However, Transformer models only accept tensors as input
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # tensors we want to get back (PyTorch, TensorFlow, or plain NumPy)
print(inputs)

# input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [5]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # outputs of 🤗 Transformers models behave like namedtuples or dictionaries.

# This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features.
# For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

# these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the head

# Batch size: The number of sequences processed at a time (2 in our example).
# Sequence length: The length of the numerical representation of the sequence (16 in our example).
# Hidden size: The vector dimension of each model input.  The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

torch.Size([2, 16, 768])


In [10]:
# Model heads: Making sense out of numbers
# The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.
#There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:
# *Model (retrieve the hidden states)
# *ForCausalLM
# *ForMaskedLM
# *ForMultipleChoice
# *ForQuestionAnswering
# *ForSequenceClassification
# *ForTokenClassification
# and others 🤗

from transformers import AutoModelForSequenceClassification
import torch
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)
# Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model.
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

torch.Size([2, 2])
tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


## Model

In [13]:
# AutoModel class, which is handy when you want to instantiate any model from a checkpoint.
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
# # Model is randomly initialized!
print(config)
# the hidden_size attribute defines the size of the hidden_states vector, and num_hidden_layers defines the number of layers the Transformer model has.

BertConfig {
  "_attn_implementation_autoset": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [14]:
# Loading a Transformer model that is already trained is simple — we can do this using the from_pretrained() method:
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased") # This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its model card.
# The entire list of available BERT checkpoints can be found: https://huggingface.co/models?other=bert
model.save_pretrained(".")
# If you take a look at the config.json file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.
# The pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [17]:
# Using a Transformer model for inference
# Transformer models can only process numbers — numbers that the tokenizer generates.
# Tokenizers can take care of casting the inputs to the appropriate framework’s tensors
# sequences = ["Hello!", "Cool.", "Nice!"]
# The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]
# This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This “array” is already of rectangular shape, so converting it to a tensor is easy:

import torch
model_inputs = torch.tensor(encoded_sequences)
print(model_inputs)

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])


In [18]:
# Using the tensors as inputs to the model
output = model(model_inputs)
print(output)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1972e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0110e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1085e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

## Tokenizers

In [None]:
# Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model.
# . Models can only process numbers, so tokenizers need to convert our text inputs to numerical data
# The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation

In [19]:
# Word-based
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
# There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a vocabulary is defined by the total number of independent tokens that we have in our corpus.
# Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.
# If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language
# Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.
# inally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ”<unk>”

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [None]:
# Character-based
# character-based tokenizers split the text into characters, rather than words
# The vocabulary is much smaller.
# There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
# each character doesn’t mean a lot on its own,
# Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model:

In [None]:
# Subword tokenization
# Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

In [21]:
# Load and Save Tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
res = tokenizer("Using a Transformer network is simple")
print(res)
tokenizer.save_pretrained("directory_on_my_computer")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json')

In [23]:
# Encoding
# Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.
# As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens.
# The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model
# the tokenizer has a vocabulary,

# This tokenizer is a subword tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [26]:
# Decoding
# Decoding is going the other way around: from vocabulary indices, we want to get a string.
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
# Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.

Using a transformer network is simple


## Handling Multiple Sequences

In [None]:
# How do we handle multiple sequences?
# How do we handle multiple sequences of different lengths?
# Are vocabulary indices the only inputs that allow a model to work well?
# Is there such a thing as too long a sequence?

In [30]:
# Models expect a batch of inputs
# Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)



Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [None]:
# There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths.
# batched_ids = [
#    [200, 200, 200],
#    [200, 200]
# ]

# In order to work around this, we’ll use padding to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values.
# padding_id = 100
# batched_ids = [
#    [200, 200, 200],
#    [200, 200, padding_id],
# ]