## Creating a Transformer
The first thing we’ll need to do to initialize a BERT model is load a configuration object:

In [1]:
from transformers import BertConfig, TFBertModel

In [2]:
config=BertConfig()
model=TFBertModel(config)

In [3]:
#The configuration contains many attributes that are used to build the model:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



- Creating a model from the default configuration initializes it with random values
The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand,this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it’s imperative to be able to share and reuse models that have already been trained.

# Loading a Transformer model that is already trained 

In [7]:
from transformers import TFBertModel
model=TFBertModel.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/527M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


- This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

# Saving methods
- Saving a model is as easy as loading one — we use the save_pretrained method, which is analogous to the from_pretrained method:

In [8]:
model.save_pretrained('directory_on_my_computer')

- This saves two files to your disk:

config.json, tf_model.h5
- The tf_model.h5 file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.

## Using a Transformer model for inference

- Transformer models can only process numbers — numbers that the tokenizer generates. But before we discuss tokenizers, let’s explore what inputs the model accepts.

Tokenizers can take care of casting the inputs to the appropriate framework’s tensors, but to help you understand what’s going on, we’ll take a quick look at what must be done before sending the inputs to the model.

In [10]:
sequences = [
  "Hello!",
  "Cool.",
  "Nice!"
]

- The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:

In [12]:
encoded_sequences = [
  [ 101, 7592,  999,  102],
  [ 101, 4658, 1012,  102],
  [ 101, 3835,  999,  102]
]

- tensors only accept rectangular shapes (think matrices). This “array” is already of rectangular shape, so converting it to a tensor is easy:

In [13]:
import tensorflow as tf
inputs=tf.constant(encoded_sequences)

In [14]:
#Using the tensors as inputs to the model
#Making use of the tensors with the model is extremely simple — we just call the model with the inputs:
output=model(inputs)
#While the model accepts a lot of different arguments, only the input IDs are necessary

In [16]:
output

TFBaseModelOutputWithPooling(last_hidden_state=<tf.Tensor: shape=(3, 4, 768), dtype=float32, numpy=
array([[[ 4.4495672e-01,  4.8276237e-01,  2.7797243e-01, ...,
         -5.4032564e-02,  3.9393413e-01, -9.4770178e-02],
        [ 2.4942866e-01, -4.4093004e-01,  8.1772351e-01, ...,
         -3.1916550e-01,  2.2992221e-01, -4.1171834e-02],
        [ 1.3667539e-01,  2.2517797e-01,  1.4502037e-01, ...,
         -4.6914738e-02,  2.8224230e-01,  7.5565636e-02],
        [ 1.1788867e+00,  1.6738546e-01, -1.8187107e-01, ...,
          2.4671380e-01,  1.0440768e+00, -6.1966730e-03]],

       [[ 3.6435878e-01,  3.2464243e-02,  2.0257650e-01, ...,
          6.0110077e-02,  3.2451323e-01, -2.0995270e-02],
        [ 7.1865964e-01, -4.8725191e-01,  5.1740408e-01, ...,
         -4.4012007e-01,  1.4553063e-01, -3.7544742e-02],
        [ 3.3223283e-01, -2.3270914e-01,  9.4876423e-02, ...,
         -2.5268185e-01,  3.2171962e-01,  8.1109535e-04],
        [ 1.2523220e+00,  3.5754380e-01, -5.1320337e-02, .

## Tokenizers

#### Word- Based

- The first type of tokenizer that comes to mind is word-based. It’s generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:

There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a vocabulary is defined by the total number of independent tokens that we have in our corpus.

Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs. Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.

Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””. It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.

One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.

In [17]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


## Character Based Tokenizer

- Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

The vocabulary is much smaller.
There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
But here too some questions arise concerning spaces and punctuation:



This approach isn’t perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful: each character doesn’t mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: subword tokenization.

## Subword tokenization
- Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.
- These subwords end up providing a lot of semantic meaning: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.
- Unsurprisingly, there are many more techniques out there. To name a few:

Byte-level BPE, as used in GPT-2,
WordPiece, as used in BERT,
SentencePiece or Unigram, as used in several multilingual models.

## Loading and saving

- Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: from_pretrained and save_pretrained. These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

In [21]:
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

- Similar to TFAutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [23]:
from transformers import AutoTokenizer
token=AutoTokenizer.from_pretrained('bert-base-cased')

In [24]:
token('are you okay!')

{'input_ids': [101, 1132, 1128, 3008, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [25]:
#Saving a tokenizer is identical to saving a model:

token.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json')

# let’s see how the input_ids are generated. To do this, we’ll need to look at the intermediate methods of the tokenizer.
- Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained method. Again, we need to use the same vocabulary used when the model was pretrained.

In [26]:
from transformers import AutoTokenizer

In [28]:
token=AutoTokenizer.from_pretrained('bert-base-cased')

In [29]:
seq=token.tokenize('you are amazing')

In [30]:
seq

['you', 'are', 'amazing']

In [31]:
#From tokens to input IDs
#The conversion to input IDs is handled by the convert_tokens_to_ids tokenizer method:
ids=token.convert_tokens_to_ids(seq)

In [32]:
ids

[1128, 1132, 6929]

In [34]:
##decode
decoded_string = token.decode([1128, 1132, 6929])
print(decoded_string)

you are amazing


In [35]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_93']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
output

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606979,  1.6122825],
       [-3.6183178,  3.9137495]], dtype=float32)>, hidden_states=None, attentions=None)