# Hugging Face - Tokenizer and Model

For both _tokenizer_ and _model_, there are generic classes: `AutoTokenizer` and `AutoModelForSequenceClassification`. These can be substituted by specific classes, such as  `BertTokenizer` and `BertModel`.

- See [AutoTokenizer official documentation](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- See [AutoModelForSequenceClassification official documentation](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification)

In [1]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

classifier = pipeline("sentiment-analysis")

res = classifier("I've been waiting for a HuggingFace course my whole life.")

print(res)

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


In [2]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

res = classifier("I've been waiting for a HuggingFace course my whole life.")

print(res)


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


### What a tokenizer does?

A tokenizer in natural language processing, such as the one used in the Hugging Face Transformers library, is responsible for converting text into a format that can be understood and processed by a model. Here's a detailed breakdown of what a tokenizer does:

1. Text to Tokens: The tokenizer breaks down the input text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenizer's design. This process is important for handling complex languages where words can have multiple forms, or for managing large vocabularies efficiently.
1. Tokenization Methods: Different models use different tokenization methods. For instance, BERT uses WordPiece tokenization, which splits words into subword units, allowing the model to handle unknown words more effectively.
1. Generating Token IDs: Each token is mapped to a unique integer called a token ID. These IDs are used by the model to identify tokens.
1. Attention Mask: Along with token IDs, the tokenizer generates attention masks. An attention mask is a sequence of 1s and 0s, where 1 indicates that a particular token should be paid attention to, and 0 means it should be ignored. This is useful for handling varying lengths of input sequences.
1. Additional Tokens: The tokenizer may add special tokens that serve specific purposes. For example, '[CLS]' and '[SEP]' in BERT mark the beginning of a sequence and separation of segments, respectively. '[PAD]' tokens are used for padding shorter sequences to a uniform length.
1. Pre-Trained Models: Tokenizers in Hugging Face are often linked to pre-trained models. When using a specific model, you typically use its associated tokenizer to ensure compatibility between the tokenization and the model’s training data.

In [7]:
sequence = "Using Transformer networks is simple"
res = tokenizer(sequence)
print("Print 1:", res) #attention mask will be 1. If 0, the model will ignore that number

tokens = tokenizer.tokenize(sequence)
print("Print 2:",tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print("Print 3:",ids)

decoded_string = tokenizer.decode(ids)
print("Print 4:",decoded_string)
#101 means "begin of sentence"
#102 means "end of sentece"

Print 1: {'input_ids': [101, 2478, 10938, 2121, 6125, 2003, 3722, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
Print 2: ['using', 'transform', '##er', 'networks', 'is', 'simple']
Print 3: [2478, 10938, 2121, 6125, 2003, 3722]
Print 4: using transformer networks is simple
