In [1]:
import tqdm as notebook_tqdm
from transformers import pipeline, AutoTokenizer, TFAutoModel, TFAutoModelForSequenceClassification
import tensorflow as tf
import pprint
from pprint import pprint

  from .autonotebook import tqdm as notebook_tqdm


# Complete pipeline

In [2]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.






All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Device set to use 0


In [3]:
data = [
    'I hate it!',
    'I love it!',
    'what a shame'
]

In [4]:
results = classifier(data)
print(type(results))
print(results)

<class 'list'>
[{'label': 'NEGATIVE', 'score': 0.9995840191841125}, {'label': 'POSITIVE', 'score': 0.9998781681060791}, {'label': 'NEGATIVE', 'score': 0.9997525811195374}]


# Tokenizer

In [5]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [6]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
inputs = tokenizer(data, padding=True, truncation=True, return_tensors="tf")

In [8]:
for k, v in inputs.items():
    print(k)
    print(v)
    print(100*'-')

input_ids
tf.Tensor(
[[ 101 1045 5223 2009  999  102]
 [ 101 1045 2293 2009  999  102]
 [ 101 2054 1037 9467  102    0]], shape=(3, 6), dtype=int32)
----------------------------------------------------------------------------------------------------
attention_mask
tf.Tensor(
[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 0]], shape=(3, 6), dtype=int32)
----------------------------------------------------------------------------------------------------


# Automodel

In [9]:
model = TFAutoModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [10]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(3, 6, 768)


## Classification model

In [11]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)
print(outputs.logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tf.Tensor(
[[ 4.322109  -3.462332 ]
 [-4.3268723  4.6855288]
 [ 4.5768104 -3.7275047]], shape=(3, 2), dtype=float32)


In [12]:
predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[9.99584019e-01 4.15986753e-04]
 [1.21873934e-04 9.99878168e-01]
 [9.99752581e-01 2.47385411e-04]], shape=(3, 2), dtype=float32)


In [13]:
predictions_label = tf.math.argmax(predictions, axis=-1)
print(model.config.id2label)

{0: 'NEGATIVE', 1: 'POSITIVE'}


In [14]:
for f, l in zip(data, predictions_label.numpy()):
    print(f, '\t--->', model.config.id2label[l.item()])

I hate it! 	---> NEGATIVE
I love it! 	---> POSITIVE
what a shame 	---> NEGATIVE


# Instantiate a transformer model

In [15]:
bert_model = TFAutoModel.from_pretrained("bert-base-cased")
print(type(bert_model))

gpt_model = TFAutoModel.from_pretrained("gpt2")
print(type(gpt_model))

bart_model = TFAutoModel.from_pretrained("facebook/bart-base")
print(type(bart_model))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

<class 'transformers.models.bert.modeling_tf_bert.TFBertModel'>


All PyTorch model weights were used when initializing TFGPT2Model.

All the weights of TFGPT2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


<class 'transformers.models.gpt2.modeling_tf_gpt2.TFGPT2Model'>


All PyTorch model weights were used when initializing TFBartModel.

All the weights of TFBartModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartModel for predictions without further training.


<class 'transformers.models.bart.modeling_tf_bart.TFBartModel'>


# Tokenizer - Part 2

In [16]:
from transformers import AutoTokenizer

In [17]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [18]:
phrase = "Let's try to tokenize!"
tokens = tokenizer.tokenize(phrase)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
final_inputs = tokenizer.prepare_for_model(input_ids)
print(tokens)
print(input_ids)
print(final_inputs['input_ids'])
print(tokenizer.decode(final_inputs['input_ids']))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['let', "'", 's', 'try', 'to', 'token', '##ize', '!']
[2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]
[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
[CLS] let ' s try to tokenize! [SEP]


In [19]:
inputs = tokenizer(phrase)
print(tokenizer.decode(inputs['input_ids']))

[CLS] let ' s try to tokenize! [SEP]


In [20]:
for i, j in inputs.items():
    print(f'{i}:\t{j}')

input_ids:	[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
token_type_ids:	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask:	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


# Batch inputs together

In [21]:
sentences = data

tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]

for i in range(len(sentences)):
    print(sentences[i], ids[i])

I hate it! [1045, 5223, 2009, 999]
I love it! [1045, 2293, 2009, 999]
what a shame [2054, 1037, 9467]


In [22]:
# Not possible, no padding!
#tf.constant(ids)

In [23]:
no_padding = tokenizer(sentences, padding=False)
yes_padding = tokenizer(sentences, padding=True)
for k, v in no_padding.items():
    print(k, v)
print(100*'-')
for k, v in yes_padding.items():
    print(k, v)

input_ids [[101, 1045, 5223, 2009, 999, 102], [101, 1045, 2293, 2009, 999, 102], [101, 2054, 1037, 9467, 102]]
token_type_ids [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
attention_mask [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]
----------------------------------------------------------------------------------------------------
input_ids [[101, 1045, 5223, 2009, 999, 102], [101, 1045, 2293, 2009, 999, 102], [101, 2054, 1037, 9467, 102, 0]]
token_type_ids [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
attention_mask [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0]]


In [24]:
# Differtent kind of padding
max_len_padding = tokenizer(sentences, padding='max_length', max_length=15)

for k, v in max_len_padding.items():
    print(k, v)

input_ids [[101, 1045, 5223, 2009, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2293, 2009, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2054, 1037, 9467, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask [[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


# Datasets

In [25]:
from datasets import load_dataset

In [26]:
raw_datasets = load_dataset('glue', 'mrpc')
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


In [27]:
print(raw_datasets['train'])
pprint(raw_datasets['train'][6])

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
{'idx': 6,
 'label': 0,
 'sentence1': 'The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , '
              'closing at 1,520.15 on Friday .',
 'sentence2': 'The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or '
              '2.04 percent , to 1,520.15 .'}


In [28]:
pprint(raw_datasets['train'].features)

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}


In [29]:
pprint(raw_datasets['train'][:6])

{'idx': [0, 1, 2, 3, 4, 5],
 'label': [1, 0, 1, 0, 1, 1],
 'sentence1': ['Amrozi accused his brother , whom he called " the witness " , '
               'of deliberately distorting his evidence .',
               "Yucaipa owned Dominick 's before selling the chain to Safeway "
               'in 1998 for $ 2.5 billion .',
               'They had published an advertisement on the Internet on June 10 '
               ', offering the cargo for sale , he added .',
               'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at '
               'A $ 4.56 , having earlier set a record high of A $ 4.57 .',
               'The stock rose $ 2.11 , or about 11 percent , to close Friday '
               'at $ 21.51 on the New York Stock Exchange .',
               'Revenue in the first quarter of the year dropped 15 percent '
               'from the same period a year earlier .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his '
               'broth

## Tokenize dataset

In [30]:
def tokenize_function(example):
    return tokenizer(
        example['sentence1'], example['sentence2'], padding='max_length', truncation=True, max_length=128
    )

In [31]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Map: 100%|██████████| 408/408 [00:00<00:00, 3481.55 examples/s]


In [32]:
print(tokenized_datasets.column_names)

{'train': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'validation': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'test': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask']}


In [33]:
# Remove unnecessary columns
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets = tokenized_datasets.with_format('tensorflow')
print(tokenized_datasets['train'])

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})


In [34]:
# To generate a short example of the dataset
small_train_dataset = tokenized_datasets['train'].select(range(100))
print(small_train_dataset)

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 100
})


# Preprocessing sentences

In [35]:
pprint(sentences)
batch = tokenizer(sentences, padding=True, truncation=True, return_tensors='tf')

['I hate it!', 'I love it!', 'what a shame']


In [36]:
batch

{'input_ids': <tf.Tensor: shape=(3, 6), dtype=int32, numpy=
array([[ 101, 1045, 5223, 2009,  999,  102],
       [ 101, 1045, 2293, 2009,  999,  102],
       [ 101, 2054, 1037, 9467,  102,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 6), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 6), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]], dtype=int32)>}

## Sentences Pairs

In [37]:
a = 'My name isa Luca'
b = 'I am a software developer'
c = 'going to the cinema'
d = 'this movie is great'

### Single pair

In [38]:
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

pair = tokenizer(a, b)
pprint(pair)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               2026,
               2171,
               18061,
               15604,
               102,
               1045,
               2572,
               1037,
               4007,
               9722,
               102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]}


### Multiple pairs

In [39]:
batch = tokenizer([a, c], [b, d], padding=True, return_tensors='tf')
pprint(batch)

{'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101,  2026,  2171, 18061, 15604,   102,  1045,  2572,  1037,
         4007,  9722,   102],
       [  101,  2183,  2000,  1996,  5988,   102,  2023,  3185,  2003,
         2307,   102,     0]], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0]], dtype=int32)>}


In [40]:
checkpoint

'bert-base-uncased'

In [41]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**batch)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [42]:
outputs

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-0.38012093, -1.1027563 ],
       [-0.35463804, -0.94542867]], dtype=float32)>, hidden_states=None, attentions=None)

# Fine tuning and transfer learning

In [43]:
checkpoint = 'bert-base-cased'
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.





In [44]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_252 (Dropout)       multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108311810 (413.18 MB)
Trainable params: 108311810 (413.18 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
