<a href="https://colab.research.google.com/github/khengkok/Android-Permission-Extraction-and-Dataset-Creation-with-Python/blob/master/%5Dcontextual_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Embedding

One of the main drawbacks of embeddings such as Word2Vec and GloVE are that they have the same embedding for the same word regardless of its meaning in a particular context. For example, the word `rock` in `The rock concert is being held at national stadium` have a very different meaning in `The naughty boy throws a rock at the dog`.

Contextual embedding such as those produced by transformers (where the modern-day large language are based on) took into account the context of the word, and different embedding is generated for the same word depending on the context.

## Install Hugging Face Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [None]:
%%capture
!pip install transformers
!pip install datasets

Let's try to generate some embeddings using one of the transformer model `deberta`.

In [None]:
from transformers import AutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")



In [None]:
# Load a language model
model = AutoModel.from_pretrained("distilbert-base-uncased")
model = model.to('cuda')
# Tokenize the sentence
tokens = tokenizer('The rock concert is being held at national stadium.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

{'input_ids': tensor([[ 101, 1996, 2600, 4164, 2003, 2108, 2218, 2012, 2120, 3346, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS]
the
rock
concert
is
being
held
at
national
stadium
.
[SEP]


We will pass the tokens through the model to generate embeddings.  We will take the embedding produced by the last layer.

In [None]:
# Process the tokens
embeddings_1 = model(**tokens)[0]
print(embeddings_1)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

**Questions**

1. What is the shape of the embeddings?
2. Why is the shape is such?

Let's try to find the embedding of the token 'rock' used here.

In [None]:
embedding_rock1 = embeddings_1[0][2]
print(embedding_rock1)

Now write codes to find the embeddings of the word `rock` as used in the sentence `The naughty boy throws a rock at the dog.` and `The boy throws the rock into the drain`.


In [None]:
tokens = tokenizer('The naughty boy throws a rock at the dog.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
tokens.to('cuda')
embeddings_2 = model(**tokens)[0]
embedding_rock2 = embeddings_2[0][6]
print(embedding_rock2)

In [None]:
tokens = tokenizer('A big rock falls from the slope after heavy rain.', return_tensors='pt')
print(tokens)
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))
embeddings_3 = model(**tokens)[0]
embedding_rock3 = embeddings_3[0][3]
print(embedding_rock3)

Let's compute how similar are the embeddings to each other

In [None]:
import torch

cos = torch.nn.CosineSimilarity(dim=0)
similarity1 = cos(embedding_rock1, embedding_rock2)
print(similarity1)

similarity2 = cos(embedding_rock2, embedding_rock3)
print(similarity2)



We can see that embedding_rock2 are more similar to embedding_rock3 than with embedding_rock1.

In [None]:
from datasets import load_dataset

# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

dataset = load_dataset('csv', data_files=train_data_url, split="train").shuffle().select(range(2000))

In [None]:
dataset

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 2000
})

In [None]:
def process_dataset(sample):
    sample['sentiment'] = 0 if sample['sentiment'] == 'negative' else 1
    return sample

dataset = dataset.map(process_dataset)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train'].batch(batch_size=10)
test_dataset = dataset['test'].batch(batch_size=10)

Batching examples:   0%|          | 0/1600 [00:00<?, ? examples/s]

Batching examples:   0%|          | 0/400 [00:00<?, ? examples/s]

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased').to('cuda')



Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

In [None]:
train_X = []
train_y = []

for data in train_dataset:
    train_encodings = tokenizer(data['review'], padding=True, max_length=512, truncation=True, return_tensors='pt')
    train_labels = data['sentiment']
    train_encodings.to('cuda')
    # train_labels.to('cuda')
    outputs = model(**train_encodings)
    train_X.append(outputs[0])
    train_y.append(train_labels)

train_X = np.concatenate(train_X)
train_y = np.concatenate(train_y)

OutOfMemoryError: CUDA out of memory. Tried to allocate 120.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 57.06 MiB is free. Process 79470 has 14.69 GiB memory in use. Of the allocated memory 14.29 GiB is allocated by PyTorch, and 288.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

In [None]:
train_encodings[1]

We will create a tensorflow dataset and use it's efficient batching later to obtain the embeddings.

In [None]:
BATCH_SIZE = 16

In [None]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

Here we instantiate a pretrained model from 'distilbert-base-cased' and specify output_hidden_state=True so that we get the output from each of the attention layers.

## Feature Extraction using (Distil)BERT.

Here we will load the pretrained model for distibert-based-uncased and use it to extract features from the text (i.e. emeddings).

In [None]:
from transformers import TFAutoModel

model = TFAutoModel.from_pretrained("distilbert-base-uncased",output_hidden_states=True)

The model will produce two outputs: the 1st output `output[0]` is of shape `(16, 512, 768)` which corresponds to the output of the last hidden layer and the second output `output[1]` is a list of 7 outputs of shape `(16, 512, 768)`, corresponding to the output of each of the 6 attention layers and the output. 768 refers to the hidden size.

In [None]:
def extract_features(dataset):

    embeddings = []
    labels = []

    for encoding, label in dataset:
        output = model(encoding)
        hidden_states = output[1]
        # here we take the output of the second last attention layer as our embeddings.
        # We take the average of the embedding value of 512 tokens (at axis=1) to generate sentence embedding
        sentence_embedding = tf.reduce_mean(hidden_states[-2], axis=1).numpy()
        embeddings.append(sentence_embedding)
        labels.append(label)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)

    return embeddings, labels

In [None]:
X_train, y_train = extract_features(train_dataset)
X_val, y_val = extract_features(val_dataset)
X_test, y_test = extract_features(test_dataset)

## Train a classifier using the extracted features (embeddings)

In [None]:
X_train.shape

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
print(f'train score : {clf.score(X_train, y_train)}')
print(f'validation score : {clf.score(X_val, y_val)}')
print(f'test score : {clf.score(X_test, y_test)}')

We should be getting an validation and accuracy score of around 86% to 87% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the hidden states from a different attention layer as features or take average of hidden states  from few layers as features.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`