# Embedding Similarity & Weight Projection

After extracting the learned node embeddings from the DBLP database using JUST, we will input and process the respective CSV and txt files to calculate `Cosine Similarity` between any two nodes sharing an edge in the original graph.

We first import the required libraries.

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

### Loading Embeddings Data from .embeddings File

Since JUST's exported embeddings are saved in a .embeddings file, we will use Pandas to load this file into a DataFrame. Each row in the file represents a node, and each column represents a feature of the embeddings (i.e., 128-dimension embeddings).

In [2]:
# File path
embeddings_file = 'JUST_Embeddings/dblp.embeddings'

# Initialize lists to store node indexes and embeddings
node_indexes = []
embeddings = []

# Process the embeddings file
with open(embeddings_file, 'r') as file:
    next(file)  # Skip the first line
    for line in file:
        parts = line.split(maxsplit=1)
        node_indexes.append(parts[0])
        embeddings.append([float(val) for val in parts[1].split()])

# Convert embeddings to DataFrame
embeddings_df = pd.DataFrame(embeddings)

# Assign a new index to the DataFrame
embeddings_df.index = range(1, len(embeddings_df) + 1)

# Check the first few rows of the DataFrame
print(embeddings_df.head())

# node_indexes Array with all extracted node indexes
print(node_indexes)


        0         1         2         3         4         5         6    \
1  2.062163 -3.822559  3.356310  1.495711  0.147654  4.982532  0.769526   
2  8.755095  3.324619 -0.268750  1.956853  0.033400  1.397121 -5.998192   
3 -0.016692  5.091135  3.150124 -8.831798 -2.781424 -1.588178 -3.017164   
4  0.512313  1.716134  3.903577  4.922825 -0.388932  1.212146 -4.973925   
5 -2.113167 -5.932612  4.734505 -0.371535 -3.220574 -5.085805 -0.017450   

         7         8         9    ...       118       119       120       121  \
1   1.339957  1.357171 -0.542609  ... -1.071640 -0.721320 -4.548015 -6.567608   
2  11.206099  1.390129 -1.581060  ...  2.489887 -3.961385  1.724385 -4.532434   
3  -4.199641  3.244243  4.761934  ...  4.306939  1.667063 -7.785442 -1.921008   
4  -2.483732 -1.679280 -8.647503  ...  1.322175  2.736613  7.367923 -1.163539   
5   1.531176  8.325444  5.257865  ... -2.974113 -5.438767  1.133533  3.574147   

        122       123       124       125       126       127 

### Saving node indexes to new file

Node indexes extracted from JUST embedding file are saved to a new file for later use.

In [3]:
# Save node_indexes to a file
with open('Just_Embeddings/node_indexes.txt', 'w') as f:
    for index in node_indexes:
        f.write("%s\n" % index)

# Add node_indexes back as the first column of the DataFrame
embeddings_df.insert(0, 'node_id', node_indexes)

# Set node indexes as embeddings_df index to allow for faster search later on
embeddings_df.set_index('node_id', inplace=True)

# Now 'embeddings_df' is ready for further analysis
print(embeddings_df.head())

embeddings_df.shape

              0         1         2         3         4         5         6    \
node_id                                                                         
t10503   2.062163 -3.822559  3.356310  1.495711  0.147654  4.982532  0.769526   
v10189   8.755095  3.324619 -0.268750  1.956853  0.033400  1.397121 -5.998192   
v10187  -0.016692  5.091135  3.150124 -8.831798 -2.781424 -1.588178 -3.017164   
v10181   0.512313  1.716134  3.903577  4.922825 -0.388932  1.212146 -4.973925   
v10188  -2.113167 -5.932612  4.734505 -0.371535 -3.220574 -5.085805 -0.017450   

               7         8         9    ...       118       119       120  \
node_id                                 ...                                 
t10503    1.339957  1.357171 -0.542609  ... -1.071640 -0.721320 -4.548015   
v10189   11.206099  1.390129 -1.581060  ...  2.489887 -3.961385  1.724385   
v10187   -4.199641  3.244243  4.761934  ...  4.306939  1.667063 -7.785442   
v10181   -2.483732 -1.679280 -8.647503  ...  1.

(15649, 128)

Now that we have cleaned-up the embeddings into a dataframe, we need to check if there are any inconsistencies in the data. We also check for non-numeric data.

In [4]:
# Check for non-numeric data
print("Data types:\n", embeddings_df.dtypes)

# Check for missing values
if embeddings_df.isnull().values.any():
    print("Missing values found")

# Check shape of embeddings dataframe to see if there are varying row lengths
print("DataFrame shape:", embeddings_df.shape)

Data types:
 0      float64
1      float64
2      float64
3      float64
4      float64
        ...   
123    float64
124    float64
125    float64
126    float64
127    float64
Length: 128, dtype: object
DataFrame shape: (15649, 128)


### Cross-Check Node ID List w/M2V Result

For comparison purposes, we would like to also see if M2V was able to explore the same nodes as JUST via its random walk method. We already know that both embedding files involved 29242 instances of 128-dimensional vectors, resembling ALL nodes in the graph. We would like to see if these are the same nodes (i.e., that the set comparison gives an empty result), and if not, what the differences are...

In [5]:
# Path to the M2V Node ID List file
file_path = 'M2V_Embeddings/node_ids.txt'

# Read the file and store the node IDs in a list
with open(file_path, 'r') as file:
    m2v_node_ids = [line.strip() for line in file.readlines()]

#Add Lists into Sets for easier comparison
m2v_node_set = set(m2v_node_ids)
just_node_set = set(node_indexes)

# Find differences
nodes_in_just_not_in_m2v = just_node_set - m2v_node_set
nodes_in_m2v_not_in_just = m2v_node_set - just_node_set

count_just_nodes = len(just_node_set)
count_m2v_nodes = len(m2v_node_set)

print("Count of nodes in JUST list:", count_just_nodes)
print("Count of nodes in M2V list:", count_m2v_nodes)
print("Nodes in JUST not in M2V:", nodes_in_just_not_in_m2v)
print("Nodes in M2V not in JUST:", nodes_in_m2v_not_in_just)


Count of nodes in JUST list: 15649
Count of nodes in M2V list: 15649
Nodes in JUST not in M2V: set()
Nodes in M2V not in JUST: set()


## Loading Edge List Data from .edgelist File

To be able to access which nodes are connected by an edge, we need to import the edge list into another dataframe. Note that the node IDs must be consistent across both the embedding and edge list dataframes! It is also an undirected graph, meaning source and target do not necessarily mean anything.

In [6]:
# File path
edgelist_file = 'EdgeList_DBLP/dblp.edgelist'

# Read edge list into DataFrame
edge_list_df = pd.read_csv(edgelist_file, sep=' ', header=None, names=['source', 'target'])

display(edge_list_df.head())

display(embeddings_df.head())

Unnamed: 0,source,target
0,p0,a1
1,p0,a2
2,p0,a3
3,p0,a4
4,p0,a5


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
node_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
t10503,2.062163,-3.822559,3.35631,1.495711,0.147654,4.982532,0.769526,1.339957,1.357171,-0.542609,...,-1.07164,-0.72132,-4.548015,-6.567608,0.460308,4.073237,0.99022,4.230855,-0.606458,-7.153749
v10189,8.755095,3.324619,-0.26875,1.956853,0.0334,1.397121,-5.998192,11.206099,1.390129,-1.58106,...,2.489887,-3.961385,1.724385,-4.532434,0.955911,2.797703,-4.208636,-6.292368,-1.775127,2.294312
v10187,-0.016692,5.091135,3.150124,-8.831798,-2.781424,-1.588178,-3.017164,-4.199641,3.244243,4.761934,...,4.306939,1.667063,-7.785442,-1.921008,-6.616045,-5.740364,3.086287,-0.996902,3.851432,-2.341438
v10181,0.512313,1.716134,3.903577,4.922825,-0.388932,1.212146,-4.973925,-2.483732,-1.67928,-8.647503,...,1.322175,2.736613,7.367923,-1.163539,3.495918,-3.30472,-2.831706,3.197457,-1.91444,-3.99541
v10188,-2.113167,-5.932612,4.734505,-0.371535,-3.220574,-5.085805,-0.01745,1.531176,8.325444,5.257865,...,-2.974113,-5.438767,1.133533,3.574147,6.642196,9.211696,2.070999,-2.381046,3.787209,0.941584


## Calculating Cosine Similarity

- For each edge, we retrieve the embeddings of the connected nodes.
- Use cosine_similarity from sklearn.metrics.pairwise to calculate the similarity for each edge.
- Store the similarity values in a new column in the edge list DataFrame.

### Method 1: Row-by-Row Iteration (Slower, Inefficient)

For graphs with a very large number of edges, iterating over each row using DataFrame.iterrows() and calculating cosine similarity one pair at a time can be very inefficient. This method has a time complexity that grows linearly with the number of edges, leading to long execution times for large graphs. 

In [10]:
# Assume embeddings_df is your DataFrame with embeddings indexed by node IDs
# Calculate cosine similarities
similarities = []
for _, row in edge_list_df.iterrows():
    emb1 = embeddings_df.loc[row['source']].values.reshape(1, -1)
    emb2 = embeddings_df.loc[row['target']].values.reshape(1, -1)
    similarity = cosine_similarity(emb1, emb2)[0, 0]
    similarities.append(similarity)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities


### Method 2: Batch Processing using Vectorization (Faster, Efficient)

1. Efficiency and Vectorization
    - Vectorized Operations: Modern CPUs and computing frameworks like NumPy are optimized for vectorized operations, where the same operation is performed simultaneously on multiple data points. This is inherently more efficient than processing each data point (or in this case, each pair of embeddings) individually, as it minimizes the overhead associated with looping constructs in high-level languages like Python.

    - Batch Processing: By processing multiple pairs of embeddings at once, the batch approach reduces the number of iterations and takes full advantage of vectorized operations. This leads to a significant reduction in computation time, especially for large datasets.

2. Scalability
    - Memory Management: Calculating cosine similarities for millions of edges at once can be memory-intensive, leading to memory overflow or significantly slowed performance due to swapping. Processing the data in smaller batches helps manage memory usage more effectively, ensuring that the computation remains within the available system resources, thereby maintaining performance across varying scales of data.

    - Parallelization Potential: Although not implemented in the provided code, batch processing opens up possibilities for parallel computation. Batches can be processed in parallel across multiple CPU cores or even distributed systems, further speeding up the computation for very large graphs.

3. Practicality
    - Adaptability: The batch size can be adjusted based on the available computing resources and the specific requirements of the dataset. This flexibility allows the method to be optimized for different environments, from personal laptops to high-performance computing clusters.

    - Reduced Computational Overhead: The original method's reliance on DataFrame.iterrows() is known to be inefficient for large datasets due to the overhead of generating Series objects for each row. In contrast, the batch processing approach minimizes this overhead by working directly with NumPy arrays, which are more efficient both in terms of memory layout and computational performance.

In [7]:
# Assume embeddings_df is indexed by node IDs and contains embeddings
embeddings = embeddings_df.to_numpy()

# Map node IDs to their index in the embeddings array for quick lookup
node_id_to_index = {node_id: index for index, node_id in enumerate(embeddings_df.index)}

# Convert edge list source and target to indices
edge_indices = [(node_id_to_index[row['source']], node_id_to_index[row['target']])
                for _, row in edge_list_df.iterrows()]

# Calculate similarities in batches to manage memory usage
batch_size = 1000  # Adjust based on your memory capacity
similarities = []

for i in range(0, len(edge_indices), batch_size):
    batch_edges = edge_indices[i:i+batch_size]
    emb1 = np.array([embeddings[index_pair[0]] for index_pair in batch_edges])
    emb2 = np.array([embeddings[index_pair[1]] for index_pair in batch_edges])
    
    # Calculate batch similarities
    batch_similarities = cosine_similarity(emb1, emb2).diagonal()
    similarities.extend(batch_similarities)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities


In [8]:
display(edge_list_df.head(100)) 

Unnamed: 0,source,target,weight
0,p0,a1,0.810501
1,p0,a2,0.825994
2,p0,a3,0.846090
3,p0,a4,0.766287
4,p0,a5,0.839626
...,...,...,...
95,p41,a43,0.891549
96,p41,a44,0.932675
97,p41,v10173,0.361034
98,p41,t10241,0.624027


We can now export the new updated edge list with cosine similarities as edge weights.

In [9]:
# Optionally save the updated edge list
edge_list_df.to_csv('JUST_edge_list_with_similarity.csv', index=False)

## Projecting Weights to New Homogeneous Graph

- A new graph that mimics the same structure as its original heterogeneous counterpart, but ignores node types and edge types. This information should already be embedded structurally and semantically in the node embeddings.
- Based on the cosine similarity calculations, the values are projected onto the graph as edge weights.
- This graph will be constructed using StellarGraph (can be later converted into NetworkX or Adjacency Matrices + Edge Lists based on CD algorithm) 

In [10]:
from stellargraph import StellarGraph

# Assuming edge_list_df columns are ['source', 'target', 'weight']

# Create StellarGraph from edge list with weights
G = StellarGraph(edges=edge_list_df)

G.info()



'StellarGraph: Undirected multigraph\n Nodes: 15649, Edges: 51377\n\n Node types:\n  default: [15649]\n    Features: none\n    Edge types: default-default->default\n\n Edge types:\n    default-default->default: [51377]\n        Weights: range=[-0.083392, 1], mean=0.539301, std=0.203282\n        Features: none'