# Embedding Upsampling
## Overview
Hello! We are excited to share this take-home assignment with you. Please read the problem
statement below and raise questions if any. We are happy to clarify over email or a quick call.
Please try to get back to us in a week. After your solution is ready, please share it with us and
schedule a 1-hour meeting to discuss your solution

## Prompt
At Connectly, ML engineers focus on both modeling and infra work. This task is designed to test
your ability to design, build, train, and ship ML models

## Task
Your task is to build a model and go through the full ML lifecycle. You’re given a problem to
solve with an ML model. You’ll need to take the model through the full ML lifecycle from ideation
to deploymen

## ML Problem
We have taken 1536 Dimensional OpenAI embeddings and applied a mystery affine
transformation to 32 dimensions. Your job is to build a model that given the 32-dimension
embeddings can upsample them back to 1536-dimensions in a way that preserves the cosine
angles between vectors of the original 1536-dimension embeddings. The upsampled vectors
should be of unit length. This model is approximating the inverse of the mystery transform
(mystery transform is not perfectly invertable)

We will provide:

1. Training Embeddings: This contains 12130 32-dimensional unit length embeddings
representing as a matrix of shape (12130,32). File:
https://cdn.connectly.ai/interview_prompts/ml_embedding_upsample/projected_train_embs.npy



1. Cosine Angle Similarity Matrix: This is a (12130, 12130) where the value at position ij
corresponds to the cosine angle <embedding_i, embedding_j>. Since these are unit
vectors they were calculated as dot<embedding_i, embedding_j>. These were
calculated on the original 1536-Dimensional embeddings NOT the 32-Dimensional
projected embeddings. File:
https://cdn.connectly.ai/interview_prompts/ml_embedding_upsample/og_train_cos_theta.npy

In [69]:
import numpy as np
og = np.load('og_train_cos_theta.npy')
print(og.shape)

projected = np.load('projected_train_embs.npy')
print(projected.shape)

(12130, 12130)
(12130, 32)


In [70]:
# Split the data into training and test sets
projected_train = projected[: 10000]
projected_test = projected[10000: ]

og_train = og[:10000, :10000]
og_test = og[10000:, 10000: ]

print(projected_train.shape)
print(projected_test.shape)

print(og_train.shape)
print(og_test.shape)


(10000, 32)
(2130, 32)
(10000, 10000)
(2130, 2130)


In [72]:
import torch.nn as nn
import torch.nn.functional as F
import torch

class FiveLayerNetwork(nn.Module):
    def __init__(self):
        super(FiveLayerNetwork, self).__init__()

        # Define the layers
        self.fc1 = nn.Linear(32, 64)  # First hidden layer
        self.fc2 = nn.Linear(64, 128) # Second hidden layer
        self.fc3 = nn.Linear(128, 256) # Third hidden layer
        self.fc4 = nn.Linear(256, 512) # Fourth hidden layer
        self.fc5 = nn.Linear(512, 1536) # Output layer

    def forward(self, x):
        # Pass the input tensor through each of the layers
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x)

        return x


In [73]:
def rowwise_mse(matrix1, matrix2):
    if matrix1.shape != matrix2.shape:
        raise ValueError("Both matrices should have the same shape.")
    
    matrix2 = torch.from_numpy(matrix2).to(matrix1.device)  # Convert to the same device as matrix1
    mse_per_row = torch.mean((matrix1 - matrix2)**2, axis=1)
    return mse_per_row

def get_unit_vector(tensor):
    return tensor / torch.norm(tensor, dim=1, keepdim=True)

def custom_loss(upsampled_vectors, original_cosine_similarity_subset):
    upsampled_cosine_similarity_matrix = F.cosine_similarity(upsampled_vectors.unsqueeze(0), upsampled_vectors.unsqueeze(1), dim=2)
    cosine_similarity_loss = rowwise_mse(upsampled_cosine_similarity_matrix, original_cosine_similarity_subset)
    return cosine_similarity_loss


In [79]:
import torch
from torch.utils.data import DataLoader, TensorDataset, SequentialSampler
import torch.optim as optim


# Create DataLoader for training data
train_data = TensorDataset(torch.from_numpy(projected_train).float())
train_loader = DataLoader(train_data, batch_size=10, shuffle=False)
model = FiveLayerNetwork()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Training loop
for epoch in range(10):
    for i, batch in enumerate(train_loader):
        optimizer.zero_grad()
        input_data = batch[0]
        upsampled_vectors = model(input_data)
        upsampled_vectors = get_unit_vector(upsampled_vectors)
        
        batch_indices = torch.arange(i * train_loader.len(input_data), (i + 1) * len(input_data))
        
        # Limit batch_indices to the actual size of og_train
        batch_indices = torch.clamp(batch_indices, 0, len(og_train) - 1)

        # Retrieve the subset of `og_train` using the batch indices
        original_cosine_similarity_subset = og_train[batch_indices][:, batch_indices]
        
        row_loss = custom_loss(upsampled_vectors, original_cosine_similarity_subset)
        loss = row_loss.mean()
        
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 0.0007969019325098784


In [None]:
from torch import no_grad

# Testing (evaluation)

# Forward pass on test data
with no_grad():
    upsampled_test_vectors = model(projected_test)
    test_loss = custom_loss(upsampled_test_vectors, og_test)

print(f"Test Loss: {test_loss.item()}")