# LLM: Contextualized Word Embeddings Using Microsoft DeBERTa v3

Contextualized Word Embeddings: Traditional word embeddings (like Word2Vec or GloVe) provide a single, context-independent representation for each word in the vocabulary. In contrast, contextualized word embeddings generate vectors for words that vary depending on the sentence context in which the word appears. This means the word "bank" would have different embeddings when used in "river bank" versus "bank account," allowing models to handle ambiguity and nuance more effectively.

Models like BERT and GPT use these contextual embeddings to better understand language nuances, making them highly effective across a range of NLP tasks, from text classification and sentiment analysis to question answering and summarization.

In [1]:
from transformers import AutoTokenizer, AutoModel

In [2]:
# Load a Tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



In [3]:
# Load a Language Model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

In [4]:
# Tokenize the Sentence
tokens = tokenizer('Hello World', return_tensors='pt') #Use the pretrained tokenizer and model to process the 'Hello World' string.
tokens

{'input_ids': tensor([[    1, 31414,   623,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [5]:
# Process the tokens
output = model(**tokens)[0]
output # Raw output of the language model

tensor([[[-3.2520,  0.1818, -0.1254,  ..., -0.0502, -0.2334,  0.8897],
         [-0.4673,  0.1730, -0.0206,  ..., -0.5289,  0.7303,  2.1177],
         [-0.4950,  0.0564,  0.2842,  ...,  1.0543, -0.1747,  1.3793],
         [-2.9601,  0.2129, -0.1138,  ...,  0.1518, -0.2094,  1.0494]]],
       grad_fn=<NativeLayerNormBackward0>)

In [6]:
# Let's check the output dimensions
output.shape

torch.Size([1, 4, 384])

In [7]:
# What are the 4 dimension in the tensor above?
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 World
[SEP]
