# Using transformers
<br>
Using the transformers library which implemented various NLP Attention based models, we could do a lot of tasks quite easily.<br>
In this notebook, I will do various NLP tasks using Bert model and GPT model.<br>

### Pipeline
<br>
Pipelines are a great and easy way to use models for inference.<br>
Pipelines are objects that abstract most of the complex code from the library, offering a simple API.<br>
<br>
When choosing a model for a certain task, we should carefully decide.<br>
<br>
For example, BERT is primarily aimed at being fine-tuned on tasks that use the whole sentence to make decisions.<br>
This includes, sequence classification, token classification, or question answering.<br>
Therefore, tasks such as text generation could be done using AR models such as GPT instead.

In [1]:
from transformers import pipeline

In [2]:
# Feature Extraction pipeline extracts hidden states from the base transformers, 
# which can be used as features in downstream tasks.
feature_extraction_bert = pipeline('feature-extraction',model='bert-base-uncased')

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight',

In [13]:
# Sentiment Analysis is classifying sequences according to positive or negative sentiments.
# In the general TextClassificationPipeline, if multiple classification labels are available, 
# the pipeline will run a softmax over results.

# The most famous dataset to finetune on sentiment analysis is 
# 'SST-2 dataset(Stanford Sentiment Treebank)' from the 
# 'GLUE dataset(General Language Understanding Evaluation)'

# A label('POSITIVE' or 'NEGATIVE') is returned alongside a score
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

tokenizer= BertTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')
model= BertForSequenceClassification.from_pretrained('bert-base-cased-finetuned-mrpc')

# Actually, the default model is DistilBERT model that was fine-tuned on SST2 from GLUE dataset.
sentiment_analysis_bert = pipeline('sentiment-analysis',tokenizer=tokenizer,model=model)

result = sentiment_analysis_bert("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'],4)}")

label: LABEL_0, with score: 0.7663


In [None]:
named_entity_recognition_bert = pipeline('ner',model='bert-base-uncased')

In [20]:
# Question Answering is extracting an answer from a text given a question.

# The most famous dataset is the 'SQuAD dataset(Stanford Question Answering Dataset)'

# One thing to mind is that only Extractive Question Answering is available.
# This means that a context should be given
# A answer extracted from the given text and a confidence score is returned.
from transformers import BertTokenizer
from transformers import BertForQuestionAnswering

context = "Who is this speaking? Perhaps this is god speaking. Then it is likely he will exist"

tokenizer= BertTokenizer.from_pretrained('bert-large-cased-whole-word-masking-finetuned-squad')
model= BertForQuestionAnswering.from_pretrained('bert-large-cased-whole-word-masking-finetuned-squad')

question_answering_bert = pipeline('question-answering',model=model,tokenizer=tokenizer)

result = question_answering_bert(question='Do you think there exists god?', context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'],4)}, start: {result['start']}, end: {result['end']}")

Answer: 'Perhaps this is god', score: 0.2784, start: 22, end: 41


In [21]:
# Language Modeling is a task of fitting a model to a corpus.
# For example, BERT uses masked language modeling, and GPT uses casual language modeling.
# Language Modeling can be useful outside of pretraining as well as domain specific.
# Especially, Masked Language Modeling is the task of filling the mask with an appropriate token.

fill_mask_bert = pipeline('fill-mask',model='bert-base-uncased')

result = fill_mask_bert("The best university in South Korea is [MASK] university.")

print(result)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': 'the best university in south korea is seoul university.', 'score': 0.4197063446044922, 'token': 10884, 'token_str': 'seoul'}, {'sequence': 'the best university in south korea is korea university.', 'score': 0.17043077945709229, 'token': 4420, 'token_str': 'korea'}, {'sequence': 'the best university in south korea is samsung university.', 'score': 0.060363661497831345, 'token': 19102, 'token_str': 'samsung'}, {'sequence': 'the best university in south korea is peking university.', 'score': 0.013846565037965775, 'token': 27057, 'token_str': 'peking'}, {'sequence': 'the best university in south korea is sbs university.', 'score': 0.01384267583489418, 'token': 21342, 'token_str': 'sbs'}]


In [None]:
summarization_bert = pipeline('summarization',model='bert-base-uncased')

In [None]:
translation_bert = pipeline('translation_en_to_kr',model='bert-base-uncased')

In [None]:
text2text_bert = pipeline('text2text-generation',model='bert-base-uncased')

In [None]:
zeroshot_classification_bert = pipeline('zero-shot-classification',model='bert-base-uncased')

In [None]:
conversation_bert = pipeline('conversational',model='bert-base-uncased')

In [13]:
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model= TFBertModel.from_pretrained('bert-base-uncased')
# tokenizer -> dataloader(preprocessing data) -> model
text1 ="Is this a valid sentence?"
text2 = "I am student running to Yonsei University building."
text3 = " My favorite Programming Language is C++."
text = [text1,text2,text3]
encoded_inputs = tokenizer(text, return_tensors='tf',return_attention_mask=True,padding=True,truncation=True)
print(encoded_inputs)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


{'input_ids': <tf.Tensor: shape=(3, 13), dtype=int32, numpy=
array([[  101,  2003,  2023,  1037,  9398,  6251,  1029,   102,     0,
            0,     0,     0,     0],
       [  101,  1045,  2572,  3076,  2770,  2000, 10930, 12325,  2072,
         2118,  2311,  1012,   102],
       [  101,  2026,  5440,  4730,  2653,  2003,  1039,  1009,  1009,
         1012,   102,     0,     0]])>, 'token_type_ids': <tf.Tensor: shape=(3, 13), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(3, 13), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])>}


In [14]:
encoded_inputs['input_ids']

<tf.Tensor: shape=(3, 13), dtype=int32, numpy=
array([[  101,  2003,  2023,  1037,  9398,  6251,  1029,   102,     0,
            0,     0,     0,     0],
       [  101,  1045,  2572,  3076,  2770,  2000, 10930, 12325,  2072,
         2118,  2311,  1012,   102],
       [  101,  2026,  5440,  4730,  2653,  2003,  1039,  1009,  1009,
         1012,   102,     0,     0]])>

In [15]:
print(tokenizer.decode(encoded_inputs['input_ids'][0]))
print(tokenizer.decode(encoded_inputs['input_ids'][1]))
print(tokenizer.decode(encoded_inputs['input_ids'][2]))

[CLS] is this a valid sentence? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] i am student running to yonsei university building. [SEP]
[CLS] my favorite programming language is c + +. [SEP] [PAD] [PAD]


In [16]:
encoded_input2 = tokenizer("What is your name?","My name is Seungone.",return_tensors='tf')
print(encoded_input2)

{'input_ids': <tf.Tensor: shape=(1, 15), dtype=int32, numpy=
array([[ 101, 2054, 2003, 2115, 2171, 1029,  102, 2026, 2171, 2003, 7367,
        5575, 5643, 1012,  102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 15), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]])>, 'attention_mask': <tf.Tensor: shape=(1, 15), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}


In [18]:
print(tokenizer.decode(encoded_input2['input_ids'][0]))

[CLS] what is your name? [SEP] my name is seungone. [SEP]


In [19]:
output = model(encoded_inputs)
print(output)

TFBaseModelOutputWithPooling(last_hidden_state=<tf.Tensor: shape=(3, 13, 768), dtype=float32, numpy=
array([[[ 0.1584683 ,  0.05394202, -0.02173148, ..., -0.1263333 ,
          0.06896812,  0.46639672],
        [ 0.12491301, -0.1220137 , -0.03282016, ...,  0.3831069 ,
          1.0131848 ,  0.6159238 ],
        [-0.20566858, -0.6244838 ,  0.16080046, ..., -0.00229507,
         -0.05117407,  0.5581342 ],
        ...,
        [ 0.22871093,  0.44744945,  0.3578546 , ...,  0.46337926,
          0.10052502, -0.2444902 ],
        [ 0.18982568, -0.05153685, -0.08534762, ...,  0.3760352 ,
          0.197602  , -0.27288702],
        [ 0.1412916 , -0.01042851, -0.01311123, ...,  0.36208832,
          0.18744272, -0.21668495]],

       [[-0.22801131,  0.15609686,  0.22225544, ..., -0.577901  ,
          0.35622063,  0.18715926],
        [-0.16532849, -0.3931642 ,  0.09108079, ..., -0.5639909 ,
          0.07929863, -0.29892743],
        [-0.3986258 , -0.07625765,  0.6550277 , ..., -0.6394473 ,
  

In [34]:
from transformers import BertTokenizer, TFBertForMaskedLM
import tensorflow as tf

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForMaskedLM.from_pretrained('bert-base-cased')

inputs = tokenizer("The capital of France is [MASK].", return_tensors="tf")
inputs["labels"] = tokenizer("The capital of France is Paris.", return_tensors="tf")["input_ids"]

outputs = model(inputs)
loss = outputs.loss
logits = outputs.logits

print(outputs)
print()
print(loss)
print()
print(logits)

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


TFMaskedLMOutput(loss=<tf.Tensor: shape=(9,), dtype=float32, numpy=
array([1.2178121e+01, 5.3454580e+00, 9.5798366e-04, 6.0269046e-03,
       3.3788753e-03, 5.6930067e-04, 8.1031382e-01, 4.1392818e-04,
       2.3580479e+01], dtype=float32)>, logits=<tf.Tensor: shape=(1, 9, 28996), dtype=float32, numpy=
array([[[ -7.1545234,  -6.9931316,  -7.182646 , ...,  -5.9123926,
          -5.6732936,  -5.9854193],
        [ -8.018968 ,  -8.13193  ,  -8.050915 , ...,  -6.5679274,
          -6.4058185,  -6.8997593],
        [ -4.977202 ,  -6.1780906,  -6.066909 , ...,  -5.636203 ,
          -4.6602654,  -5.124104 ],
        ...,
        [ -3.4420216,  -3.2557287,  -3.5732915, ...,  -2.460596 ,
          -2.6494863,  -3.195194 ],
        [-10.588984 , -10.462034 , -11.718096 , ...,  -7.4646177,
          -9.954243 ,  -8.39268  ],
        [-14.889968 , -14.887321 , -14.45686  , ..., -11.658825 ,
         -13.0151005, -11.607314 ]]], dtype=float32)>, hidden_states=None, attentions=None)

tf.Tensor(
[1.

In [44]:
from transformers import FillMaskPipeline

pipeline = FillMaskPipeline(model,tokenizer)

result = pipeline("The capital of France is [MASK].")
print(result)

[{'sequence': 'The capital of France is Paris.', 'score': 0.44471848011016846, 'token': 2123, 'token_str': 'P a r i s'}, {'sequence': 'The capital of France is Lyon.', 'score': 0.09396772086620331, 'token': 10067, 'token_str': 'L y o n'}, {'sequence': 'The capital of France is Toulouse.', 'score': 0.0823521688580513, 'token': 18367, 'token_str': 'T o u l o u s e'}, {'sequence': 'The capital of France is Lille.', 'score': 0.07515732944011688, 'token': 25411, 'token_str': 'L i l l e'}, {'sequence': 'The capital of France is Marseille.', 'score': 0.05692743510007858, 'token': 17851, 'token_str': 'M a r s e i l l e'}]


https://huggingface.co/transformers/main_classes/pipelines.html#transformers.QuestionAnsweringPipeline
https://huggingface.co/transformers/task_summary.html
https://huggingface.co/deepset/bert-base-cased-squad2?text=What+is+my+name%3F
https://huggingface.co/transformers/model_doc/bert.html
https://huggingface.co/tuner007/t5_abs_qa
https://huggingface.co/transformers/pretrained_models.html