# Bidirectional Encoder Representations from Transformers (BERT)
**BERT (Bidirectional Encoder Representations from Transformers)** is a pre-trained deep learning model developed by Google for natural language processing (NLP) tasks. It uses a transformer-based architecture to learn contextual relations between words in a sentence or text. BERT is trained on a large corpus of text data and can be fine-tuned for various NLP tasks such as question answering, sentiment analysis, and text classification. BERT has achieved state-of-the-art results on several benchmark datasets in the NLP field.

Let's assume you work on a text classification task where input to the model is a word and you want to classify the words either to be person or country. 

<img src = "img.jpg" width = "800px" height = "600px"></img>

<img src = "img2.jpg" width = "800px" height = "600px"></img>

* Now think how the model process these words? Means how can we capture similarities between two words?

<img src = "img1.jpg" width = "800px" height = "600px"></img>

* So the simple answer to the question is: We look to the features of each input word and then compare them with each other if they're similary then will be classify in the same category as shown in the bellow image:
* The image shows us that the first two homes are similar but the third one is not similar with previous two homes.

<img src = "img3.jpg" width = "800px" height = "600px"></img>

* Similarly we represent the person and country names as features and finally we can decide whether it's person or country.

<img src = "img4.jpg" width = "800px" height = "600px"></img>

<img src = "img5.jpg" width = "800px" height = "600px"></img>

<img src = "img6.jpg" width = "800px" height = "600px"></img>

* So the approach covered in the upper images are representing text as number using 'Word2Vec'. The main issue of 'Word2Vec' is, it's generate fixed embedding or generate fixed vectors for similar words. It means one word using with different context might give different meaning, so this is the main issue using 'Word2Vec'.

<img src = "img7.jpg" width = "800px" height = "600px"></img>

* So based on the mentioned problem, we need to have a model to generate contextualize meaning of a word. It means based on studying the whole sentence model will generate a number representation for a specifict word. 
### So BERT allow us to do the exact same thing.

* BERT can generate contextualized embedding. At the same time it will capture the meaning of the word in a right way. As we see in the following image the words 'fair' and 'unbiased' are a kind of similar, so the model will generate almost the same vectors. Similarly the word 'fair' in sentence number 3 in the following image and the word 'Canival' has most similarity, so the model will generate almost the same vectors for these two words.

<img src = "img8.png" width = "800px" height = "600px"></img>

* **BERT** can also generate the embedding for the entire sentence. For the whole sentence it will generate a single vector.**

<img src = "img9.jpg" width = "800px" height = "600px"></img>

* BERT was trained by Google on **2500 Milions words of Wikipedia** and **800 Milions words of different books.**

<img src = "img10.jpg" width = "800px" height = "600px"></img>

* For training they use two approaches:
    1. Mased Language Model
    2. Next sentence prediction
* So today **Google Search** is powered by BERT. When you searching something, when you type the first 1 word or 2 words then they will give you suggestions for your search.

<img src = "img11.jpg" width = "800px" height = "600px"></img>

* Links for more about BERT:
    http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

* Now let's look into tensorflow code and generate some sentence and words embedding.
* So to locate BERT model we go to **tensorflow hub** which is the repository of all the different models. Go to word embedding then to BERT, here you will see a section for BERT. BERT has different models which is listed here... so we'll use the simple one which has 12 encoders. 

In [6]:
# Required libraries:
import tensorflow_hub as hub
import tensorflow_text as text

* BERT models tensorflow hub link: https://tfhub.dev/google/collections/bert/1

In [5]:
# The best thing with BERT is that, we can directly take the URL and past it in the Jupyter ('encoders') to use it. 
# Then for each model we have a pre-processing URL which can pre-process you text.
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

In [8]:
# So next we create a hub layer which takes pre-process URL as an argument. It gives you like function pointer. Then we can 
# supply a buch of statements and it will do pre-process on those statements.

# bert_preprocess_model = hub.KerasLayer(preprocess_url)


In [None]:
# So let's say here we build a movie classification model. So we pass couple of statements to be pre-processed by the model.
# The output of this function pointer is gonna be dictionary so we just print the key becaus the object may be big:
text_test = ['nice movie indeed','I love python programming']
text_preprocessed = bert_preprocess_model(text_test)
text_preprocessed.keys()

In [None]:
# So the two sentences are pre-processed.
# Now we can check individual elements in this dictionary. The first one is 'input_mask'. 
text_preprocessed['input_mask']

* So we see the shape is (2, 128), 2 is because of: we have two sentences, and for each sentence we see the mask. So as we see the first sentence has three words but the BERT generated 5 ones (1s). The reason is the way BERT work, it add a special token called 'CLS' at the beginning and to separte two sentences it use 'SEP' token. Now if we count it it will be 5. The idea is similar for the second sentece. 128 is a kind of maximum lenght of the sentence.


In [None]:
# So the next element is 'input_type_ids' which is pretty usefull. So it assign special number to each word because this is 
# pre-proecssing step, in next step we do word embedding.

# text_preprocessed['input_type_ids']
text_preprocessed['input_word_ids']

### Special Tokens:
* **101 --> CLS token**
* **102 --> SEP token**

In [None]:
# Once the pre-processing steps is done, we want to create another layer and it will have the encoder URL.
# This layer we'll called BERT model. So as before it will return like function pointer and we can supply our pre-process 
# text and as result it will generate word embedding for the sentencess.
# Again we just call the keys.

bert_model = hub.KerasLayer(encoder_url)
bert_results = bert_model(text_preprocessed)
bert_results.keys()

In [None]:
# So this will have three keys, the first one is 'pooled_output'. 
# 'Pooled_output' is the embedding for the entire sentence.
bert_results['pooled_output']

* So we see, we had two sentences, and we see the embedding for the two sentences. The embedding vector size is 768. So this vector nicely represent the statement 'nice movie indeed' in form of numbers. Similarly we have other embedding vector for the 2nd sentence.
* So these embeddings are pretty powerful and we can use it for our NLP task, it could be movie review classification, name entity recognization, it could be anything, but BERT help you to generate a meaningful vectors out of your statement.

In [None]:
# Now let's look out to the 2nd key which is 'sequence_output'. 
# 'sequence_output' is individual word embedding vectors. 2 is again for 2 sentences, 128 means: we've 128 padding for each
# sentence and for each of the word inside the sentence we will have 768 size vector.
# The paddings (numbers) are because of contextualize embedding. 
bert_results['sequence_output']

In [None]:
# If you look at 'encoder_output', if we display the len of 'encoder_output', it will be 12. The reason is we're use the 
# small size BERT base. Means this BERT model has 12 encoder layers and each layer has 768 size embedding vectors.
 len(bert_results['encoder_outputs'])

In [None]:
# The output of 'encoder_output' is nothing but the output of each individual encoder. 
bert_results['encoder_outputs'][0]

* To get more about the elements, you can simply check the https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 link.

In [None]:
# 

In [None]:
<img src = "img12.jpg" width = "800px" height = "600px"></img>