<a href="https://colab.research.google.com/github/mn22abk/Research-Methods-Assign.3/blob/main/BERT_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Cell 1: Importing Libraries

This cell imports necessary libraries including `BertTokenizer`, `BertModel`, `torch`, and `numpy`.

In [1]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [2]:
# from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### Cell 2: Loading Pre-trained BERT Tokenizer

In this cell, the pre-trained BERT tokenizer (`bert-base-uncased`) is loaded using `BertTokenizer.from_pretrained()` method.

In [3]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Cell 3: Tokenizing Input Text

This cell tokenizes the input text "Hello, how are you?" using the loaded tokenizer. It then converts tokens to token IDs, adds special tokens `[CLS]` and `[SEP]`, and converts them to a tensor.

In [4]:
# Tokenize input text
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Add special tokens [CLS] and [SEP]
token_ids = [tokenizer.cls_token_id] + token_ids + [tokenizer.sep_token_id]

# Convert token IDs to tensor
input_ids = torch.tensor(token_ids)

### Cell 4: Loading Pre-trained BERT Model and Forward Pass

In this cell, the pre-trained BERT model (`bert-base-uncased`) is loaded using `BertModel.from_pretrained()` method. It performs a forward pass through the model to get outputs.

In [5]:
# Load pre-trained BERT model
model = BertModel.from_pretrained("bert-base-uncased")

# Forward pass through the model
with torch.no_grad():
    outputs = model(input_ids.unsqueeze(0))  # Add batch dimension

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Cell 5: Extracting Hidden States and Printing Shape

This cell extracts the hidden states (embeddings) from the outputs and prints the shape of the hidden states tensor.

In [6]:
# Get the hidden states (embeddings)
hidden_states = outputs.last_hidden_state

# Print the hidden states tensor and its shape
print(hidden_states)
print(hidden_states.shape)  # Shape of the output embeddings

tensor([[[-0.0824,  0.0667, -0.2880,  ..., -0.3566,  0.1960,  0.5381],
         [ 0.0310, -0.1448,  0.0952,  ..., -0.1560,  1.0151,  0.0947],
         [-0.8935,  0.3240,  0.4184,  ..., -0.5498,  0.2853,  0.1149],
         ...,
         [-0.2812, -0.8531,  0.6912,  ..., -0.5051,  0.4716, -0.6854],
         [-0.4429, -0.7820, -0.8055,  ...,  0.1949,  0.1081,  0.0130],
         [ 0.5570, -0.1080, -0.2412,  ...,  0.2817, -0.3996, -0.1882]]])
torch.Size([1, 8, 768])


### Cell 6: Converting Hidden States to NumPy Array

Here, the hidden states tensor is converted to a NumPy array and the batch dimension is removed. The shape of the NumPy array is printed.

In [7]:
# Convert hidden states tensor to NumPy array
hidden_states_np = hidden_states.numpy().squeeze(0)  # Remove the batch dimension
print(hidden_states_np.shape)

(8, 768)


### Cell 7: Converting Token IDs to Tokens and Reconstructing Original Text

This cell converts token IDs back to tokens using the tokenizer's `convert_ids_to_tokens()` function. It then reconstructs the original input text from tokens and prints it.

In [8]:
# Convert token IDs to tokens using the tokenizer's convert_ids_to_tokens function
tokens = tokenizer.convert_ids_to_tokens(token_ids)

# Reconstruct the original input text from the tokens
original_text = tokenizer.convert_tokens_to_string(tokens)
print("Original Text:", original_text)

Original Text: [CLS] hello , how are you ? [SEP]
