# Hierarchical Embeddings
Taking our already generated embeddings form Cohere's Embed V3 Model, and training these embeddings in a hierarchical fashion. This will be achieved by sparse encoding the embeddings and integrating findings from the Matryoshka Embeddings Model. 

Sparse encoding will generate progressively simplified embeddings by reducing the dimensionality of the data by setting elements to zero based on a loss term. This means that we will be capturing the most important features in the lower dimensions (this is similar to the Matryoshka Idea), which mimics hierarchical embeddings. 

In most embeddings models, typically a loss function is applied on the full-size embedding to ensure the quality of the created embedding. However, in the Matryoshka framework, not only is the loss function applied to the full-size embeddings, but to dimensionally reduced (truncated) portions of the embeddings as well.

"Shortlisting and reranking: Rather than performing your downstream task (e.g., nearest neighbor search) on the full embeddings, you can shrink the embeddings to a smaller size and very efficiently "shortlist" your embeddings. Afterwards, you can process the remaining embeddings using their full dimensionality."

In [2]:
#imports
import numpy as np
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers, backend as K

### Preprocessing the Embeddings
This may not be necessary since we will be using Cohere's Embed V3 model

In [3]:
#use cohere embed v3 - here will use a place holder for testing purposes

#define the embeddings
embeddings = np.random.rand(1000, 256) #example

#instantiate the StandardScaler class
scaler = StandardScaler()

#fit the scaler to the embeddings
scaled_embeddings = scaler.fit_transform(embeddings)

### Defining a Sparse Autoencoder with Matryoshka Loss
Following the Matryoshka framework, we will train a sparse autoencoder for different embedding dimensions. The goal is that the smallest dimensions perserve the most important data/features. Nested loss will be applied to penalize the lower-dimensional embeddings for losing important information.

Sparse autoencoders create a compressed representation, enforcing sparsity via a loss term. Autoencoders consist of an encoder, which compresses the input into a lower-dimensional embedding; the decoser reconstructs the input from the compressed versions.

In the encoding layer, an L1 regularization term is added as the sparsity constraint (it enforces sparsity on the embeddings (known as sparse codes)). L1 regularizationn works by encouraging many of the weights to be zero, which creates sparsity in the latent space (the 128-dim encoded space). As a result, this ensures that only a few dimensions carry signifiacnt informtion.  



In [4]:
#defining the input dimension
input_dim = scaled_embeddings.shape[1]

### defining the layers of the autoencoder ###

#input_dim is 256; size of the original embedding
input_layer = Input(shape = (input_dim,)) 

#compresses the input into a lower-dimensional representation; reduces dim to 128
#regularizers.l1(1e-5) adds a sparsity constraint by penalizing large numbers of non-zero values in the embedding
encoded = Dense(128, activation='relu', activity_regularizer=regularizers.l1(1e-5))(input_layer) 

 #reconstructs dim back to 256
decoded = Dense(input_dim, activation='sigmoid')(encoded)

#building the autoencoder
autoencoder = Model(input_layer, decoded)

The loss function will be modified so that it works across multiple embedding sizes. Here, we will use the mean squared error (MSE) for each dimension. The loss ensures that the smaller-scaled embeddings capture the most important/essential info. 

In [5]:
#defining the Matryoshka loss
#assuming for this example that we are looking at the following dimensions: 64, 128, 256
# ensures that lower-dimensional embeddings retain the most important info

def matryoshka_loss(y_true, y_pred):
    # Loss for 64-dimensional embedding
    mse_64 = K.mean(K.square(y_true[:, :64] - y_pred[:, :64]))

    # Loss for 128-dimensional embedding
    mse_128 = K.mean(K.square(y_true[:, :128] - y_pred[:, :128]))

    # Loss for 264-dimensional embedding
    mse_256 = K.mean(K.square(y_true - y_pred))

    # Combine losses with increasing weight for larger dimensions
    #combine losses with different weights to ensure that smaller dimensions are prioritized
    return mse_64 + mse_128 * 0.5 + mse_256 * 0.25



In [6]:
#compiling the autoencoder with the customized loss function, and using Adam as the optimizer
autoencoder.compile(optimizer='adam', loss=matryoshka_loss)

#building the encoder to extract sparse codes
encoder = Model(input_layer, encoded)

### Training the Autoencoder
Training the autoencoder on Cohere's Embed v3 model. The sparsity constraint forces the model to focus on the most important features. The autoencoder is trained by minimizing the Matryoshka loss using the Adam optimizer. We train it in an unsupervised fashion by using the generated embeddings as the input and the output. 

In [7]:
autoencoder.fit(scaled_embeddings, scaled_embeddings,
                epochs=100, #100 training cycles
                batch_size=256,#256 embeddings per training step
                shuffle=True, # randomized the data each epoch for better generalization
                validation_split=0.2)

Epoch 1/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step - loss: 2.3412 - val_loss: 2.2554
Epoch 2/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 2.2522 - val_loss: 2.1737
Epoch 3/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 2.1525 - val_loss: 2.1053
Epoch 4/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 2.0937 - val_loss: 2.0495
Epoch 5/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 2.0265 - val_loss: 2.0052
Epoch 6/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - loss: 1.9883 - val_loss: 1.9709
Epoch 7/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 1.9567 - val_loss: 1.9447
Epoch 8/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 1.9274 - val_loss: 1.9249
Epoch 9/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x264e7702690>

### Extracting Sparse Codes
Once the autoencoder is trained, we will use the trained encoder part of the model to extract sparse codes from the embeddings, which are the compressed 128-dimensional representation. 

In [8]:
#extracting sparse codes from the scaled embeddings 
sparse_codes = encoder.predict(scaled_embeddings)

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 


### Hierarchical Decomposition
We now have the sparse codes in 128 dimensions from the previous step. To create the hierarchical representations (aka the embeddings of different sizes), progressively truncate the sparse codes by removing the smallest components by setting them to zero. As it stands, the most important part of the data (most important features) are in the largest components. For each embedding vector, this function finds the top k largest absolute values and keeps them, setting the rest to zero.

In [9]:
def truncate_sparse_codes(sparse_codes, k):
    truncated_codes = np.copy(sparse_codes)
    for i in range(truncated_codes.shape[0]):
        threshold = np.sort(np.abs(truncated_codes[i]))[-k]
        truncated_codes[i][np.abs(truncated_codes[i]) < threshold] = 0
    return truncated_codes


#example outputs 
#generating hierarchical embeddings at different scales
# Generate hierarchical embeddings at different scales
sparse_codes_64 = truncate_sparse_codes(sparse_codes, 64)
sparse_codes_128 = truncate_sparse_codes(sparse_codes, 128)

In [10]:
print("Sparse Codes dim = 64: ", sparse_codes_64)
print("Sparse Codes dim = 128: ", sparse_codes_128)

Sparse Codes dim = 64:  [[0.         0.         2.3489363  ... 0.         3.806964   1.6243358 ]
 [0.         0.         1.2997916  ... 0.         0.96140766 2.964773  ]
 [0.         0.         0.         ... 0.         1.8880342  0.01070255]
 ...
 [0.         0.         0.39064673 ... 0.         0.         2.8452244 ]
 [1.2399896  0.         1.6777012  ... 1.0786636  1.3557804  0.        ]
 [2.297599   3.1833317  0.         ... 0.         2.6793904  1.6575242 ]]
Sparse Codes dim = 128:  [[0.         0.         2.3489363  ... 0.         3.806964   1.6243358 ]
 [0.         0.         1.2997916  ... 0.         0.96140766 2.964773  ]
 [0.         0.         0.         ... 0.         1.8880342  0.01070255]
 ...
 [0.         0.         0.39064673 ... 0.         0.         2.8452244 ]
 [1.2399896  0.         1.6777012  ... 1.0786636  1.3557804  0.        ]
 [2.297599   3.1833317  0.         ... 0.         2.6793904  1.6575242 ]]


### Evaluation
Can perform several tasks to evaluate the quality of the hierarchical embeddings. Such examples include: performing classification tasks, and comparing the F1/accuracy scores of the embeddings of different dimensions; applying clustering algorithms on the embeddings and evaluate the qulaity using metrics like Silhouette Score or ARI; evaluating how well the different dimensional embeddings capture semantic similarity by comparing them using cosine similarity between different embeddings for similar data points.

In [13]:
#Similarity Search using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Compare cosine similarity between embeddings
similarity_64 = cosine_similarity([sparse_codes_64[0]], sparse_codes_64[1:])
similarity_128 = cosine_similarity([sparse_codes_128[0]], sparse_codes_128[1:])

# Cosine similarity returns a 1x1 matrix, so we extract the single value to get the similarity score
embedding_1_64 = sparse_codes_64[0].reshape(1, -1)  # First embedding
embedding_2_64 = sparse_codes_64[1].reshape(1, -1)  # Second embedding
similarity_64_score = cosine_similarity(embedding_1_64, embedding_2_64)[0][0]

embedding_1_128 = sparse_codes_128[0].reshape(1, -1)  # First embedding
embedding_2_128 = sparse_codes_128[1].reshape(1, -1)  # Second embedding
similarity_128_score = cosine_similarity(embedding_1_128, embedding_2_128)[0][0]

print(f"Cosine Similarity between the two embeddings 64: {similarity_64_score:.8f}")
print(f"Cosine Similarity between the two embeddings 128: {similarity_128_score:.8f}")

Cosine Similarity between the two embeddings 64: 0.26770779
Cosine Similarity between the two embeddings 128: 0.26770779


It is seen that the cosine similarity between two nearby points using the 64 dimensional and the 128 dimensional embeddings yield the same result. This demonstrates that pertinant information is not being lost by reducing the dimensionality of the embeddings. 

In [12]:
print(similarity_64)
print(similarity_128)

[[0.2677078  0.27551526 0.22697154 0.38289875 0.30744645 0.2544997
  0.42064178 0.14423496 0.22994164 0.25471574 0.19051248 0.2027433
  0.2717739  0.3316057  0.3397695  0.11698658 0.21989001 0.17641507
  0.13010623 0.35663667 0.30376926 0.30757278 0.3958108  0.25582063
  0.25202084 0.45863402 0.31988612 0.11422151 0.15509571 0.32597235
  0.3244828  0.24448371 0.23829296 0.2867934  0.24815892 0.16879669
  0.3062197  0.36652797 0.25208464 0.17279562 0.33739972 0.13860625
  0.27460253 0.2649688  0.4103203  0.18800947 0.10189275 0.34627643
  0.2883152  0.30326164 0.2454203  0.25379893 0.21497914 0.3021431
  0.2837531  0.33348745 0.27229118 0.2903967  0.3477605  0.28745848
  0.2013261  0.22856137 0.20963284 0.2641049  0.23649405 0.28403974
  0.23301241 0.38936776 0.16701178 0.24890901 0.25155854 0.308977
  0.30094    0.34131455 0.12677506 0.22907285 0.17504603 0.24738911
  0.1147745  0.08729383 0.26229486 0.19662927 0.31852424 0.16724816
  0.22927843 0.43393952 0.15565172 0.09541989 0.16737