<a href="https://colab.research.google.com/github/jboverio/agent_experiments/blob/main/Attention_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

No tokenization - strings, just numbers


In [4]:
import torch
import torch.nn as nn

class SimpleAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        Q, K, V = self.query(x), self.key(x), self.value(x)
        attention = torch.softmax(Q @ K.transpose(-1, -2) / (Q.size(-1)**0.5), dim=-1)
        return attention @ V

In [5]:
# Instantiate the class
embed_dim = 64
attention_module = SimpleAttention(embed_dim)

# Create a sample input tensor
batch_size = 1
sequence_length = 10
x = torch.randn(batch_size, sequence_length, embed_dim)

# Pass the input through the module
output = attention_module(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 10, 64])
Output shape: torch.Size([1, 10, 64])


# Task
Modify the provided Python code for the `SimpleAttention` class to accept a string input instead of a numerical tensor. This involves adding tokenization and embedding steps before processing the input with the attention mechanism. Use a sample string as input and display the shape of the output tensor.

## Define vocabulary and tokenization

### Subtask:
Create a simple vocabulary and a function to tokenize the input string into a sequence of numerical indices.


**Reasoning**:
The instructions require creating a vocabulary, a mapping from vocabulary to indices, and a function to tokenize a string using this mapping. These steps can be combined into a single code block.



In [6]:
# 1. Define a list of unique characters or words that will constitute the vocabulary.
# Using a simple character-based vocabulary for demonstration.
vocabulary = list("abcdefghijklmnopqrstuvwxyz ")

# Add an 'unknown' token
unknown_token = '<unk>'
vocabulary.append(unknown_token)

# 2. Create a dictionary to map each item in the vocabulary to a unique integer index.
vocab_to_index = {char: i for i, char in enumerate(vocabulary)}

# Define the index for the unknown token
unknown_index = vocab_to_index[unknown_token]

# 3. Write a function that takes a string as input and uses the vocabulary mapping to convert the string into a list of numerical indices.
def tokenize_string(input_string, vocab_to_index, unknown_index):
    """Tokenizes an input string into a list of numerical indices."""
    tokenized_indices = []
    for char in input_string.lower(): # Convert to lowercase for case-insensitivity
        tokenized_indices.append(vocab_to_index.get(char, unknown_index))
    return tokenized_indices

# Example usage
sample_string = "hello world"
tokenized_output = tokenize_string(sample_string, vocab_to_index, unknown_index)
print("Sample string:", sample_string)
print("Tokenized indices:", tokenized_output)
print("Vocabulary size:", len(vocabulary))

Sample string: hello world
Tokenized indices: [7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]
Vocabulary size: 28


## Create embedding layer

### Subtask:
Add an embedding layer to convert the numerical indices into dense vectors (embeddings) that have the `embed_dim` size expected by the `SimpleAttention` module.


**Reasoning**:
Add an embedding layer to convert the numerical indices into dense vectors using the vocabulary size and embed_dim, then tokenize a sample string and pass it through the embedding layer to demonstrate its functionality.



In [7]:
# 2. Get the size of the vocabulary (number of unique tokens).
vocab_size = len(vocabulary)

# 3. Instantiate an nn.Embedding layer.
embedding_layer = nn.Embedding(vocab_size, embed_dim)

# 4. Create a sample tensor of token indices using the tokenize_string function and the sample string "hello world".
sample_string = "hello world"
tokenized_indices = tokenize_string(sample_string, vocab_to_index, unknown_index)
tokenized_tensor = torch.tensor(tokenized_indices)

# 5. Pass the sample tensor of token indices through the embedding layer to get the embeddings.
embeddings = embedding_layer(tokenized_tensor)

# 6. Print the shape of the resulting embeddings.
print("Shape of embeddings:", embeddings.shape)

Shape of embeddings: torch.Size([11, 64])


## Integrate tokenization and embedding

### Subtask:
Modify the code to include the tokenization and embedding steps before passing the data to the `SimpleAttention` module.


**Reasoning**:
The goal is to modify the `SimpleAttention` class to accept a string input. This involves integrating the tokenization and embedding steps within the class's `forward` method and adding an embedding layer in the `__init__` method.



In [8]:
import torch
import torch.nn as nn

class SimpleAttention(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, input_string, vocab_to_index, unknown_index):
        # Tokenize the input string
        tokenized_indices = []
        for char in input_string.lower():
            tokenized_indices.append(vocab_to_index.get(char, unknown_index))

        # Convert tokenized indices to a tensor and add batch dimension
        tokenized_tensor = torch.tensor(tokenized_indices).unsqueeze(0)

        # Get embeddings
        x = self.embedding(tokenized_tensor)

        # Pass through attention mechanism
        Q, K, V = self.query(x), self.key(x), self.value(x)
        attention = torch.softmax(Q @ K.transpose(-1, -2) / (Q.size(-1)**0.5), dim=-1)
        return attention @ V

# Instantiate the modified SimpleAttention class
vocab_size = len(vocabulary)
embed_dim = 64
attention_module_string_input = SimpleAttention(vocab_size, embed_dim)

# Create a sample string input
sample_string_input = "this is a test string"

# Pass the sample string through the modified module
output_string_input = attention_module_string_input(sample_string_input, vocab_to_index, unknown_index)

# Print the output shape
print("Sample string input:", sample_string_input)
print("Output shape:", output_string_input.shape)

Sample string input: this is a test string
Output shape: torch.Size([1, 21, 64])


## Prepare string input

### Subtask:
Define a sample string input.


**Reasoning**:
Define a sample string input as requested by the subtask.



In [9]:
# Choose a meaningful sentence or phrase as a sample input string.
# This string should ideally contain characters present in your defined vocabulary.
sample_string_input = "This is a sample sentence for testing the attention mechanism."

# Store this chosen string in a variable named sample_string_input.
# This was done in the previous step.

# Display the sample string input to confirm.
print("Sample string input:", sample_string_input)

Sample string input: This is a sample sentence for testing the attention mechanism.


## Process and pass input

### Subtask:
Tokenize and embed the sample string input and then pass the resulting tensor through the `SimpleAttention` module.


**Reasoning**:
Tokenize and embed the sample string input and then pass the resulting tensor through the `SimpleAttention` module.



In [10]:
# 1. Use the tokenize_string function to convert the sample_string_input into a list of numerical indices.
tokenized_indices = tokenize_string(sample_string_input, vocab_to_index, unknown_index)

# 2. Convert the list of indices into a PyTorch tensor.
tokenized_tensor = torch.tensor(tokenized_indices)

# 3. Add a batch dimension to the tokenized tensor using unsqueeze(0).
tokenized_tensor = tokenized_tensor.unsqueeze(0)

# 4. Instantiate the SimpleAttention class with vocab_size and embed_dim.
# This was already done in a previous step and the instance is `attention_module_string_input`.

# 5. Pass the tokenized tensor through the instantiated SimpleAttention module's forward method.
# The forward method of the modified SimpleAttention class now accepts the string directly.
# So we call the module with the string input.
output = attention_module_string_input(sample_string_input, vocab_to_index, unknown_index)

# Display the shape of the output tensor.
print("Output shape after passing through SimpleAttention module:", output.shape)

Output shape after passing through SimpleAttention module: torch.Size([1, 62, 64])


## Display output

### Subtask:
Print the shape of the output tensor.


**Reasoning**:
Print the shape of the output tensor obtained from the SimpleAttention module with a descriptive label.



In [11]:
# Print the shape of the output tensor obtained from the SimpleAttention module.
# Use a descriptive label in the print statement, such as "Shape of the output tensor:".
print("Shape of the output tensor:", output.shape)

Shape of the output tensor: torch.Size([1, 62, 64])


## Summary:

### Data Analysis Key Findings

*   A character-based vocabulary was successfully created, including lowercase English alphabet characters, a space, and an unknown token.
*   A function `tokenize_string` was implemented to convert an input string into a list of numerical indices based on the defined vocabulary.
*   An `nn.Embedding` layer was successfully created and used to convert the tokenized numerical indices into dense vectors (embeddings) of size 64. The embedding process transformed a tensor of token indices with shape `torch.Size([11])` into an embedding tensor with shape `torch.Size([11, 64])` for the sample string "hello world".
*   The `SimpleAttention` class was modified to accept a string input directly. The tokenization and embedding steps were integrated within the class's `forward` method.
*   Processing the sample string "this is a test string" through the modified `SimpleAttention` module resulted in an output tensor with the shape `torch.Size([1, 21, 64])`.
*   Processing the sample string "This is a sample sentence for testing the attention mechanism." through the modified `SimpleAttention` module resulted in an output tensor with the shape `torch.Size([1, 62, 64])`.

### Insights or Next Steps

*   The current implementation processes strings character by character. For more complex natural language processing tasks, consider using word-based tokenization or sub-word tokenization methods (e.g., WordPiece, BPE) and a larger vocabulary.
*   The attention mechanism currently calculates attention over the entire input sequence. For longer sequences, consider implementing masked attention or other attention variants to improve efficiency and potentially performance.
