###### resource: https://github.com/Applied-Language-Technology/notebooks/blob/main/part_iii/05_embeddings_continued.ipynb

#### **E. Contextual Embeddings**

So far, the embeddings that we have learned are described as `static`. This means that the vector representation for a given token **regardless** of the `context` in which the token occurs.

Words can have more than one meaning depending on the context they appear:

1. Manila is the `capital` of the Philippines.
2. Protect your `capital`, profits just follow.

Static word embeddings cannot capture this distinction because words will always have the same vector representation. The values of the `n-dimensional` vector remains the same.

We would like to encode the context by which a word occurs to draw out these differences in meaning, using an architecture called **`Transformers`**.

A **`Transformer`** is a neural network architecture that is capable of encoding information about the context which a token occurs. It improved performance for traditional natural language processing tasks such as part-of-speech tagging and syntactic parsing, but it is `computationally heavy`.

**`Transformer`** has an **encoder-decoder** architecture and work through **sequence-to-sequence** learning, where the Transformer takes a sequence of tokens (ex. sentence) throught the encoder, and predicts the next word in the output sequence (decoder).

It accomplishes this by iterating through **encoder layers**, the encoder layers generate **encodings** which define which part of the input sequence are relevant to each other, and then passes these encodings to the next encoder layer. The **decoder layer** takes in the encodings and use their derived context to generate the output sequence.

Transformer is a form of **semi-supervised learning**. They are pre-trained in an unsupervised manner with a large unlabeled dataset, andare trained supervised so they perform better.

Transformers do not necessarily process sequence in order. They use **"attention"** mechanism. This provides context around words in the input sequence. Rather than starting at the beginning of the sequence, the Transformer attempts to identify the *context* which brings meaning to each word in a sequence.

Transformers run multiple sequences in **parallel**, speeding up training times.

In [1]:
import spacy

In [2]:
# Load a Transformer-based language model
nlp_trf = spacy.load('en_core_web_trf')

On the surface, a ***Language*** object that contains a Transformer-based model looks and works just like any other language model in spaCy.

However, if we look under the hood of the *Language* object under `nlp_trf` using `pipeline`, we will see that the first component in the processing pipeline is a Transformer.

In [3]:
nlp_trf.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x20e5da7db20>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x20e5da7df40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x20e5da930b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x20e5daca300>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x20e5dacfc80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x20e5da93200>)]

In [4]:
# Feed an example sentence to the model
example_doc = nlp_trf('Protect your capital, profits just follow ~ Paul Aldrin Enriquez')

# Check the length of the document
# It counts the punctuation marks
len(example_doc)

11

The output of the transformer is stored in an object in spaCy called `trf_data`.

The `tensors` attribute of the *TransformerData* object contains a list with vector representations generated by the Transformer for individual *Tokens* and the entire *Doc* object.

The first item in the `tensors` list under index 0 contains the output for the individual tokens.

In [5]:
example_doc._.trf_data.tensors[0].shape

(1, 16, 768)

This means there is one batch of `16` vectors, each with `768` dimensions.

In [6]:
# Check the first ten dimensions of the tensor
example_doc._.trf_data.tensors[0][0][:10]

array([[ 0.11291292,  0.0949162 , -1.14669   , ..., -0.01264074,
         0.41317725, -0.72440493],
       [ 0.67019814, -1.5462617 , -0.10938556, ..., -2.2133718 ,
        -0.00850883, -0.4185671 ],
       [ 0.30647755,  0.17339891, -0.8266205 , ...,  0.9133857 ,
         0.96996236,  0.9239493 ],
       ...,
       [-0.37335178, -1.1102524 ,  0.87105846, ..., -1.744865  ,
         0.19510983, -0.20536007],
       [ 2.0650144 , -1.0129883 , -0.3600636 , ...,  0.67700756,
         1.2665759 ,  0.8351748 ],
       [-0.11842656, -1.4274666 ,  0.46804148, ..., -0.01015834,
        -0.40615833, -0.1324595 ]], dtype=float32)

The second item under index 1 holds the output for the entire *Doc* object.

In [7]:
example_doc._.trf_data.tensors[1].shape
# A single vector with 768 dimensions

(1, 768)

In [8]:
example_doc._.trf_data.tensors[1]

array([[ 3.32911432e-01,  2.82008320e-01, -1.22935511e-01,
        -4.33271453e-02, -1.10591464e-01, -2.17549235e-01,
         3.60825837e-01, -1.08922005e-01, -5.62521696e-01,
        -1.24820150e-01,  1.62289187e-01, -3.98453236e-01,
         2.47829735e-01,  5.80639005e-01, -4.30171937e-01,
         4.15430814e-01,  3.04646939e-01, -3.43187630e-01,
         1.08068369e-01,  2.72357017e-01, -2.20026150e-01,
         3.99266511e-01,  2.72856414e-01,  4.65857089e-01,
         1.78466421e-02, -1.69012547e-01,  1.08608790e-02,
        -2.04969168e-01,  3.82948130e-01,  6.41437769e-01,
         1.16692081e-01, -3.29945862e-01,  2.60516524e-01,
         6.70336664e-01, -5.85568905e-01,  2.46269591e-02,
        -1.35597423e-01, -4.42497544e-02,  6.23648822e-01,
        -3.30455810e-01, -1.45988777e-01, -2.04099566e-01,
         4.19752412e-02, -3.65055263e-01,  2.85328627e-01,
         1.26097113e-01,  8.49947631e-02, -1.24532990e-01,
         3.82689714e-01,  2.89329916e-01, -3.87665331e-0

The Transformers object output 16 vectors for a spaCy Doc object of 11 tokens. They don't match.

Vocabulary size of all words just in the English language alone is massive, and to learn every unique word would blow up the size of the model.

Because Transformers are trained on massive volumes of text, the model's vocabulary must be kept limited.

To address the issue, Transformers use more complex tokenizers that identify frequently occuring character sequences in the data and learn embeddings for these sequences instead. These sequences, which are often referred as `subwords`, make up the vocabulary of the Transformer.

This information is stored in the attribute `tokens`, which contains a dictionary of subwords under the key `input_texts`.

In [9]:
example_doc._.trf_data.tokens['input_texts']

[['<s>',
  'Protect',
  'Ġyour',
  'Ġcapital',
  ',',
  'Ġprofits',
  'Ġjust',
  'Ġfollow',
  'Ġ~',
  'ĠPaul',
  'ĠAld',
  'rin',
  'ĠEn',
  'ri',
  'quez',
  '</s>']]

Here we can see how the document has been tokenized. They begin and terminate with the tag `<s>`, and the `Ġ` indicates space. The more interesting part is how my name has been tokenized. 
`'ĠPaul',`
  `'ĠAld',`
  `'rin',`
  `'ĠEn',`
  `'ri',`
  `'quez'`

This is so because my name is not in the Transformer vocabulary, but the subwords of my name are. Their individual vector representations will be used to construct the vector for each word of my name.

This is done for optimization. The subwords that make up the vocabulary are those that occur frequently in the training data the Transformer is trained with.

To map these vectors to `Tokens` in the spaCy Doc object, we must retrieve alignment information from the `align` attribute of the TransformerData object.

The `align` attribute can be indexed using the indices of Token objects in the Doc object.

To exemplify, we can retrieve the last Token "Enriquez" in the Doc object doc using the expression `example_doc[0]`.

We then use the index of this Token in the Doc object to retrieve alignment data, which is stored under the `align` attribute.

More specifically, we need the information stored under the attribute `data`.

In [10]:
#Get the last spaCy Token, "Enriquez", and its alignment data
example_doc[-1], example_doc._.trf_data.align[-1].data

(Enriquez,
 array([[12],
        [13],
        [14]]))

In this case, vectors at indices 12, 13 and 14 in the batch of 31 vectors contain the representation for "Enriquez".

To use the contextual embeddings from the Transformer efficiently, we can define a component that retrieves contextual word embeddings for Docs, Spans and Tokens and add this component to the spaCy pipeline.

This can be achieved by creating a new Python Class. Because the new Class will become a component of the spaCy pipeline, we must first import the Language object and let spaCy know that we are now defining a new pipeline component.

In [11]:
# Import the Language object under the 'language' module in spaCy,
# and NumPy for calculating cosine similarity.
from spacy.language import Language
import numpy as np

# We use the @ character to register the following Class definition
# with spaCy under the name 'tensor2attr'.
@Language.factory('tensor2attr')

# We begin by declaring the class name: Tensor2Attr. The name is 
# declared using 'class', followed by the name and a colon.
class Tensor2Attr:
    
    # We continue by defining the first method of the class, 
    # __init__(), which is called when this class is used for 
    # creating a Python object. Custom components in spaCy 
    # require passing two variables to the __init__() method:
    # 'name' and 'nlp'. The variable 'self' refers to any
    # object created using this class!
    def __init__(self, name, nlp):
        
        # We do not really do anything with this class, so we
        # simply move on using 'pass' when the object is created.
        pass

    # The __call__() method is called whenever some other object
    # is passed to an object representing this class. Since we know
    # that the class is a part of the spaCy pipeline, we already know
    # that it will receive Doc objects from the preceding layers.
    # We use the variable 'doc' to refer to any object received.
    def __call__(self, doc):
        
        # When an object is received, the class will instantly pass
        # the object forward to the 'add_attributes' method. The
        # reference to self informs Python that the method belongs
        # to this class.
        self.add_attributes(doc)
        
        # After the 'add_attributes' method finishes, the __call__
        # method returns the object.
        return doc
    
    # Next, we define the 'add_attributes' method that will modify
    # the incoming Doc object by calling a series of methods.
    def add_attributes(self, doc):
        
        # spaCy Doc objects have an attribute named 'user_hooks',
        # which allows customising the default attributes of a 
        # Doc object, such as 'vector'. We use the 'user_hooks'
        # attribute to replace the attribute 'vector' with the 
        # Transformer output, which is retrieved using the 
        # 'doc_tensor' method defined below.
        doc.user_hooks['vector'] = self.doc_tensor
        
        # We then perform the same for both Spans and Tokens that
        # are contained within the Doc object.
        doc.user_span_hooks['vector'] = self.span_tensor
        doc.user_token_hooks['vector'] = self.token_tensor
        
        # We also replace the 'similarity' method, because the 
        # default 'similarity' method looks at the default 'vector'
        # attribute, which is empty! We must first replace the
        # vectors using the 'user_hooks' attribute.
        doc.user_hooks['similarity'] = self.get_similarity
        doc.user_span_hooks['similarity'] = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity
    
    # Define a method that takes a Doc object as input and returns 
    # Transformer output for the entire Doc.
    def doc_tensor(self, doc):
        
        # Return Transformer output for the entire Doc. As noted
        # above, this is the last item under the attribute 'tensor'.
        # Average the output along axis 0 to handle batched outputs.
        return doc._.trf_data.tensors[-1].mean(axis=0)
    
    # Define a method that takes a Span as input and returns the Transformer 
    # output.
    def span_tensor(self, span):
        
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        
        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = span.doc._.trf_data.tensors[0].shape[-1]
        
        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
        
        # Average vectors along axis 0 ("columns"). This yields a 768-dimensional
        # vector for each spaCy Span.
        return tensor.mean(axis=0)
    
    # Define a function that takes a Token as input and returns the Transformer
    # output.
    def token_tensor(self, token):
        
        # Get alignment information for Token; flatten array for indexing.
        # Again, we use the 'doc' attribute of a Token to get the parent Doc,
        # which contains the Transformer output.
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()
        
        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = token.doc._.trf_data.tensors[0].shape[-1]
        
        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.mean(axis=0)
    
    # Define a function for calculating cosine similarity between vectors
    def get_similarity(self, doc1, doc2):
        
        # Calculate and return cosine similarity
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)

With the Class Tensor2Attr defined, we can now add it to the pipeline by referring to the name we registered with spaCy using `@Language.factory()`, that is, `tensor2attr`.

In [12]:
# Add the component named 'tensor2attr', which we registered using the
# @Language decorator and its 'factory' method to the pipeline.
nlp_trf.add_pipe('tensor2attr')

# Call the 'pipeline' attribute to examine the pipeline
nlp_trf.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x20e5da7db20>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x20e5da7df40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x20e5da930b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x20e5daca300>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x20e5dacfc80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x20e5da93200>),
 ('tensor2attr', <__main__.Tensor2Attr at 0x20e5f7b0d60>)]

This component stores the Transformer-based contextual embeddings for Docs, Spans and Tokens under the vector attribute.

Let's explore contextual embeddings by defining two Doc objects and feeding them to the Transformer-based language model under `nlp_trf`.

In [13]:
# Define two example sentences and process them using the Transformer-based
# language model under 'nlp_trf'.
doc_city_trf = nlp_trf("Manila is the capital of the Philippines.")
doc_money_trf = nlp_trf("Protect your capital, profits just follow.")

The noun `"capital"` has two different meanings in these sentences: in doc_city_trf, `"capital"` refers to a `city`, whereas in doc_money_trf the word refers to `money`.

The Transformer should encode this difference into the resulting vector based on the context in which the word occurs.

Let's fetch the Token corresponding to `"capital"` in each example and retrieve their vector representations under the `vector` attribute.

In [14]:
# Retrieve vectors for the two Tokens corresponding to "capital";
# assign to variables 'city_trf' and 'money_trf'.
city_trf = doc_city_trf[3]
money_trf = doc_money_trf[2]

# Compare the similarity of the two meanings of 'capital'
city_trf.similarity(money_trf)

0.67417926

Cosine similarity is `less than 1`, which means that their vector representations even they are the same word `capital` are different. This is because the Transformer also encodes information abouth their context of occurence into the vectors, which has allowed it to learn that the same linguistic form may have different meanings in different contexts.

This stands in stark contrast to `static` word embeddings, as shown below using a large language model for English in spaCy `en_core_web_lg`.

In [17]:
# Define two example sentences and process them using the large language model
# en_core_web_lg is a large language model for English, which contains word vectors for 685 000 Token objects.

# Load a large language model and assign it to the variable 'nlp_lg'
nlp_lg = spacy.load('en_core_web_lg')
doc_city_lg = nlp_lg("Manila is the capital of the Philippines.")
doc_money_lg = nlp_lg("Protect your capital, profits just follow.")

# Retrieve vectors for the two Tokens corresponding to "capital";
# assign to variables 'city_lg' and 'money_lg'.
city_lg = doc_city_lg[3]
money_lg = doc_money_lg[2]

# Compare the similarity of the two meanings of 'capital'
city_lg.similarity(money_lg)

1.0

As you can see, the vectors for the word `"capital"` are identical `(cosine similarity of 1)`, because the word embeddings do not encode information about the context in which the word occurs.

#### **End. Thank you!**