# Introduction

# Practical

## Downloading Required Files
- `tensorflow_hub` - Repository for all trained model for use cases
- `tensorflow_text` - TF.Text is a TensorFlow library of text related ops, modules, and subgraphs. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

In [None]:
pip install tensorflow-text

Collecting tensorflow-text
  Downloading tensorflow_text-2.6.0-cp37-cp37m-manylinux1_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.2 MB/s 
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.6.0


In [None]:
# importing libraries
import tensorflow_hub as hub
import tensorflow_text as text

## Setting the downloading path url
- For Preprocessor
- For Encoder

In [None]:
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

## Creating Preprocessing Objects
- Use hub layers to achive it as it makes the object as a function pointer , where one can pass some values
- bert_preprocess : preprocessing text, , returns a dictionart object
- test_text : - text which we will pass to bert_preprocess() to get the preprocessed text


In [None]:
# creating hub layers
bert_preprocess_model = hub.KerasLayer(preprocess_url)

In [None]:
# creating testing text
test_text = ['I love pizza', 'India is my country', 'Italy is fun place to visit']

In [None]:
# preprocessing text
text_preprocessed = bert_preprocess_model(test_text)
text_preprocessed.keys()

dict_keys(['input_mask', 'input_word_ids', 'input_type_ids'])

## Understanding Preprocessed Dictonary Keys

- input_mask
- input_type_ids
- input_words_ids


### Input Mask

- A mask of the words in a sentence - all masks starts with `CLF` token and `SEP` token
- So the masked array is `sentence size + 2`
- The token is defined for CLF - 101 and SEP - 102
- Tensor shape - (no of sentence, 128- maximum length of sentence- other padded with zeors as no values present)

In [None]:
# input_mask 
text_preprocessed['input_mask']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0

### Input Type Ids
- Give id's to multiple sentences in one statement
- Usefull for same secnario
- Not usefull in our case
- Help in contextiualizing the sentence
- Usefull for a object like pandas data frame

In [None]:
# input_type_ids
text_preprocessed['input_type_ids']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0

### Input Word Id's

- has the token ids of the input sequences.
- Give unique id's for indivisual words
- Each word in encoded(ids can be from a vocabulary) , padded and seperated
- Length : (no of sentence , 128-max length of each sentence)

In [None]:
# input_word_ids
text_preprocessed['input_word_ids']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[  101,  1045,  2293, 10733,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

## Creating Embeddings

- Embeddings are unique repersentation of wordsin from of numbers

- In word 2 vec , similar words are given same embedding id's , i.e even for a diffferent secnario it will predict the same words if it had to do so.

- BERT - `Bidirectional Encoder Repersentation From Transformer` solves this issue by creating a contextulized embedding.


USAGE: 
- WORD EMBEDDINGS GENERATION : for each time stamp of a given words , it also carry some attention or semantics of previous words(inf forms of feature vectors )/ look into other parts of sentence, allow us to predict words unbiasedly and create goood embedding vectors.

- CLASSIFICATION:  It allows us to do parallel computation which speeds up the calculations and thus capturing the meaning of a word in a right way as it can now see the whole sentence in one go and predict out words with high probability values. 

- WORDS/ SENTENCES PREDICTIONS: Since the previous predection is feed into it and due to above feature , it can also genrate new word suggestion or even sentences - similar to auto complete.


STEPS INVOLVED
- Create a encoder/ model object using hub API - to act as a function pointer
- Pass `prerpocessed_text` into model - important
- Check the keys.
- Reuturns a dictonary

In [None]:
# create our model- encoder which encodes to create word embeddings
bert_encoder_model = hub.KerasLayer(encoder_url)

In [None]:
# encoding - creating embeddings
bert_results = bert_encoder_model(text_preprocessed)

In [None]:
# check the keys
bert_results.keys()

dict_keys(['default', 'sequence_output', 'pooled_output', 'encoder_outputs'])

## Understanding Embedding Dictonary

- pooled_output
- sequence_output
- encoder_outputs

### Pooled Output
- Embedding for the entire sentence
- Length : `(no of sentence, no of hidden units - 768[this case])` 
- Also these 768 will not be 0 as bert carries some of the contextual meaning for each meaning i.e relates how much one feature differ from each other [-ve less relatable , +ve - very relatable], this is the feature why bert is so popular and powerful in NLP task

In [None]:
# pooled_output
bert_results['pooled_output']

<tf.Tensor: shape=(3, 768), dtype=float32, numpy=
array([[-0.82588106, -0.23015173,  0.44445992, ...,  0.36432374,
        -0.6134956 ,  0.88324934],
       [-0.86772424, -0.39655608, -0.37747943, ..., -0.11730396,
        -0.647971  ,  0.8981817 ],
       [-0.73642266, -0.19927372,  0.39546803, ...,  0.17739156,
        -0.57022023,  0.7637291 ]], dtype=float32)>

### Sequence output

- This is the repersentation/ embedding  of indivisual word of a sentence
- Due to processing sentence become a length of 128
- Hidden state is 768
- No of sentence 

So length of the sequnence out put array/ tensor is `(no of sentences, length of sentence-128, no of hidden units -768 )`


In [None]:
# sequence_output
bert_results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=float32, numpy=
array([[[ 0.13989714,  0.30410826,  0.06352297, ..., -0.10268718,
          0.22148415,  0.149779  ],
        [ 0.3213734 ,  0.29354593, -0.1408113 , ..., -0.20841485,
          0.8356771 ,  0.20442364],
        [ 1.1886758 ,  0.64992356,  0.67424935, ...,  0.10617817,
          0.41522315, -0.23921788],
        ...,
        [ 0.20025288,  0.2712966 ,  0.4500041 , ...,  0.22973931,
          0.02177422,  0.11564764],
        [ 0.12034723,  0.1726704 ,  0.36960527, ...,  0.277618  ,
          0.02981649,  0.06090213],
        [-0.23403431, -0.10138147,  0.34712183, ...,  0.44735387,
          0.04013278, -0.04988691]],

       [[-0.02354234,  0.37282184, -0.09210153, ..., -0.44056383,
          0.01131824,  0.33282456],
        [-0.61942905, -0.2867513 , -0.7241891 , ..., -0.3831982 ,
          0.31782785, -0.2673176 ],
        [ 0.22915015, -0.2607807 , -0.25905165, ..., -0.3867034 ,
          0.5324209 ,  0.7462699 ],
        ...,

### Encoder Output

- It is the intermediate activations of transformer block and its last encoding is similar to sequence outputs.

- Length same as sequence output - `(no of sentences, length of each sentence- 128, no of hidden units-768)`

In [None]:
# encoder_output
bert_results['encoder_outputs']

[<tf.Tensor: shape=(3, 128, 768), dtype=float32, numpy=
 array([[[ 0.17126319,  0.05714687, -0.06743662, ...,  0.01732922,
           0.1435501 ,  0.04887234],
         [ 0.63244945,  1.2133181 , -0.05207326, ...,  0.5324669 ,
           0.8438145 ,  0.26414838],
         [ 1.3925556 ,  0.6356846 ,  0.74228704, ...,  0.43790025,
           0.9343474 ,  0.07733156],
         ...,
         [-0.05122618, -0.17709933,  0.7356079 , ...,  0.4184736 ,
          -0.19071539,  0.02412983],
         [-0.15421629, -0.2190762 ,  0.59030294, ...,  0.48437572,
          -0.10654706, -0.06792145],
         [-0.02584009, -0.1486233 ,  0.59410834, ...,  0.7565186 ,
          -0.38720998, -0.13480783]],
 
        [[ 0.15777126,  0.04426724, -0.13644364, ..., -0.00874417,
           0.12490405,  0.09001195],
         [-0.18299708,  0.27142665, -0.63025725, ..., -0.63325214,
           0.11569588,  0.18398012],
         [-0.7387832 , -0.22359392, -0.29233348, ...,  0.04634893,
           0.6207117 ,  0.38

In [None]:
# check if encoder output[-1] is same as pooled_output last embedding

bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  