# Embedding Similarity & Weight Projection - M2V

After extracting the learned node embeddings from the LastFM database using Metapath2Vec, we will input and process the respective CSV and txt files to calculate `Cosine Similarity` between any two nodes sharing an edge in the original graph.

We first import the required libraries.

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

## Loading Embeddings Data from CSV

Since your embeddings are saved in a CSV file, we will use Pandas to load this file into a DataFrame. Each row in CSV file represents a node, and each column represents a feature of the embeddings (i.e., 128-dimension embeddings).

In [2]:
embeddings_df = pd.read_csv('M2V_Embeddings/node_embeddings.csv', delimiter=',', header=None, float_precision='high')

with open('M2V_Embeddings/node_ids.txt', 'r') as file:
    node_indexes = [line.strip() for line in file]

# Add node_indexes back as the first column of the DataFrame
embeddings_df.insert(0, 'node_id', node_indexes)

# Set node indexes as embeddings_df index to allow for faster search later on
embeddings_df.set_index('node_id', inplace=True)

# Now 'embeddings_df' is ready for further analysis
print(embeddings_df.head())

embeddings_df.shape

              0         1         2         3         4         5         6    \
node_id                                                                         
v10189   0.039389 -0.009403  0.250416  0.060442 -0.794951 -0.487609  0.138515   
v10187   0.234121  0.260514  0.524371  0.347817 -0.425835 -0.031699  0.369254   
v10181   0.079545  0.020564 -0.221185 -0.005457 -0.316500  0.019192  0.041198   
v10188  -0.033314 -0.112948  0.137019  0.158844 -0.318135 -0.827857 -0.169478   
v10178  -0.083494 -0.445674  0.331088 -0.007949 -0.231055 -0.842527 -0.500928   

              7         8         9    ...       118       119       120  \
node_id                                ...                                 
v10189   0.442603 -0.431907  0.876152  ... -0.238084 -0.386559 -0.383074   
v10187   0.081705  0.329991  0.228142  ... -0.531880  0.341435  0.297704   
v10181   0.148705  0.016404  0.581341  ... -0.208760 -0.408285 -0.265582   
v10188  -0.021913 -0.374677  0.242143  ... -0.471632

(15649, 128)

Now that we have cleaned-up the embeddings into a dataframe, we need to check if there are any inconsistencies in the data. We also check for non-numeric data.

In [3]:
# Check for non-numeric data
print("Data types:\n", embeddings_df.dtypes)

# Check for missing values
if embeddings_df.isnull().values.any():
    print("Missing values found")

# Check shape of embeddings dataframe to see if there are varying row lengths
print("DataFrame shape:", embeddings_df.shape)


Data types:
 0      float64
1      float64
2      float64
3      float64
4      float64
        ...   
123    float64
124    float64
125    float64
126    float64
127    float64
Length: 128, dtype: object
DataFrame shape: (15649, 128)


## Loading Edge List Data from .edgelist File

To be able to access which nodes are connected by an edge, we need to import the edge list into another dataframe. Note that the node IDs must be consistent across both the embedding and edge list dataframes! It is also an undirected graph, meaning source and target do not necessarily mean anything.

In [4]:
# File path
edgelist_file = 'EdgeList_DBLP/dblp.edgelist'

# Read edge list into DataFrame
edge_list_df = pd.read_csv(edgelist_file, sep=' ', header=None, names=['source', 'target'])

display(edge_list_df.head())

display(embeddings_df.head())

Unnamed: 0,source,target
0,p0,a1
1,p0,a2
2,p0,a3
3,p0,a4
4,p0,a5


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
node_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
v10189,0.039389,-0.009403,0.250416,0.060442,-0.794951,-0.487609,0.138515,0.442603,-0.431907,0.876152,...,-0.238084,-0.386559,-0.383074,-0.477575,0.015037,0.000587,0.169964,0.139748,-0.287373,-0.223799
v10187,0.234121,0.260514,0.524371,0.347817,-0.425835,-0.031699,0.369254,0.081705,0.329991,0.228142,...,-0.53188,0.341435,0.297704,-0.462971,-0.385332,-0.508439,-0.378611,0.611846,-0.135939,0.200854
v10181,0.079545,0.020564,-0.221185,-0.005457,-0.3165,0.019192,0.041198,0.148705,0.016404,0.581341,...,-0.20876,-0.408285,-0.265582,0.182117,-0.129788,0.665685,-0.520602,0.559168,0.334289,0.788972
v10188,-0.033314,-0.112948,0.137019,0.158844,-0.318135,-0.827857,-0.169478,-0.021913,-0.374677,0.242143,...,-0.471632,-0.63743,-0.640109,-0.142028,-0.345079,0.186298,-0.541176,-0.142426,-0.156906,-0.378822
v10178,-0.083494,-0.445674,0.331088,-0.007949,-0.231055,-0.842527,-0.500928,0.346768,-0.339669,0.454952,...,0.078538,0.127578,-0.472686,0.184884,-0.021486,0.232314,0.0822,-0.13903,-0.221281,-0.267204


## Calculating Cosine Similarity

- For each edge, we retrieve the embeddings of the connected nodes.
- Use cosine_similarity from sklearn.metrics.pairwise to calculate the similarity for each edge.
- Store the similarity values in a new column in the edge list DataFrame.

### Method 1: Row-by-Row Iteration (Slower, Inefficient)

For graphs with a very large number of edges, iterating over each row using DataFrame.iterrows() and calculating cosine similarity one pair at a time can be very inefficient. This method has a time complexity that grows linearly with the number of edges, leading to long execution times for large graphs. 

In [None]:
# Assume embeddings_df is your DataFrame with embeddings indexed by node IDs
# Calculate cosine similarities
similarities = []
for _, row in edge_list_df.iterrows():
    emb1 = embeddings_df.loc[row['source']].values.reshape(1, -1)
    emb2 = embeddings_df.loc[row['target']].values.reshape(1, -1)
    similarity = cosine_similarity(emb1, emb2)[0, 0]
    similarities.append(similarity)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities

### Method 2: Batch Processing using Vectorization (Faster, Efficient)

1. Efficiency and Vectorization
    - Vectorized Operations: Modern CPUs and computing frameworks like NumPy are optimized for vectorized operations, where the same operation is performed simultaneously on multiple data points. This is inherently more efficient than processing each data point (or in this case, each pair of embeddings) individually, as it minimizes the overhead associated with looping constructs in high-level languages like Python.

    - Batch Processing: By processing multiple pairs of embeddings at once, the batch approach reduces the number of iterations and takes full advantage of vectorized operations. This leads to a significant reduction in computation time, especially for large datasets.

2. Scalability
    - Memory Management: Calculating cosine similarities for millions of edges at once can be memory-intensive, leading to memory overflow or significantly slowed performance due to swapping. Processing the data in smaller batches helps manage memory usage more effectively, ensuring that the computation remains within the available system resources, thereby maintaining performance across varying scales of data.

    - Parallelization Potential: Although not implemented in the provided code, batch processing opens up possibilities for parallel computation. Batches can be processed in parallel across multiple CPU cores or even distributed systems, further speeding up the computation for very large graphs.

3. Practicality
    - Adaptability: The batch size can be adjusted based on the available computing resources and the specific requirements of the dataset. This flexibility allows the method to be optimized for different environments, from personal laptops to high-performance computing clusters.

    - Reduced Computational Overhead: The original method's reliance on DataFrame.iterrows() is known to be inefficient for large datasets due to the overhead of generating Series objects for each row. In contrast, the batch processing approach minimizes this overhead by working directly with NumPy arrays, which are more efficient both in terms of memory layout and computational performance.

In [5]:
# Assume embeddings_df is indexed by node IDs and contains embeddings
embeddings = embeddings_df.to_numpy()

# Map node IDs to their index in the embeddings array for quick lookup
node_id_to_index = {node_id: index for index, node_id in enumerate(embeddings_df.index)}

# Convert edge list source and target to indices
edge_indices = [(node_id_to_index[row['source']], node_id_to_index[row['target']])
                for _, row in edge_list_df.iterrows()]

# Calculate similarities in batches to manage memory usage
batch_size = 1000  # Adjust based on your memory capacity
similarities = []

for i in range(0, len(edge_indices), batch_size):
    batch_edges = edge_indices[i:i+batch_size]
    emb1 = np.array([embeddings[index_pair[0]] for index_pair in batch_edges])
    emb2 = np.array([embeddings[index_pair[1]] for index_pair in batch_edges])
    
    # Calculate batch similarities
    batch_similarities = cosine_similarity(emb1, emb2).diagonal()
    similarities.extend(batch_similarities)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities


In [6]:
display(edge_list_df.head(100)) 

Unnamed: 0,source,target,weight
0,p0,a1,0.752102
1,p0,a2,0.803198
2,p0,a3,0.900095
3,p0,a4,0.682413
4,p0,a5,0.833705
...,...,...,...
95,p41,a43,0.911719
96,p41,a44,0.898741
97,p41,v10173,0.461055
98,p41,t10241,0.526762


We can now export the new updated edge list with cosine similarities as edge weights.

In [7]:
# Optionally save the updated edge list
edge_list_df.to_csv('M2V_edge_list_with_similarity.csv', index=False)