# Embedding Similarity & Weight Projection - JUST

After extracting the learned node embeddings from the LastFM database using JUST, we will input and process the respective CSV and txt files to calculate `Cosine Similarity` between any two nodes sharing an edge in the original graph.

We first import the required libraries.

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

### Loading Embeddings Data from .embeddings File

Since JUST's exported embeddings are saved in a .embeddings file, we will use Pandas to load this file into a DataFrame. Each row in the file represents a node, and each column represents a feature of the embeddings (i.e., 128-dimension embeddings).

In [2]:
# File path
embeddings_file = 'JUST_Embeddings/musicmicro.embeddings'

# Initialize lists to store node indexes and embeddings
node_indexes = []
embeddings = []

# Process the embeddings file
with open(embeddings_file, 'r') as file:
    next(file)  # Skip the first line
    for line in file:
        parts = line.split(maxsplit=1)
        node_indexes.append(parts[0])
        embeddings.append([float(val) for val in parts[1].split()])

# Convert embeddings to DataFrame
embeddings_df = pd.DataFrame(embeddings)

# Assign a new index to the DataFrame
embeddings_df.index = range(1, len(embeddings_df) + 1)

# Check the first few rows of the DataFrame
print(embeddings_df.head())

# node_indexes Array with all extracted node indexes
print(node_indexes)


         0         1          2         3         4          5         6    \
1   0.001643  2.840096 -11.996254  6.314668  5.382931  -3.106497 -6.132063   
2  11.379065  7.180863 -11.875002 -2.788641 -2.626986   6.036295  2.307652   
3  -4.887768 -7.546073   0.399502  2.208102 -6.190106  11.064276 -5.637013   
4 -15.544735 -0.323526   1.276439 -4.720444  1.161533   8.573752  1.664230   
5   0.424813 -3.887877   0.511577 -1.623160 -1.314947   8.326499 -5.707478   

        7         8         9    ...       118        119       120       121  \
1 -2.159500 -6.081487 -6.390866  ... -2.897009   3.883124  4.101838  4.808561   
2 -5.082263  3.557818 -2.335959  ... -1.370886  -2.016068 -7.800634  5.817414   
3 -3.084814  2.396325 -3.001099  ...  8.717366  -2.926158 -4.935077 -7.889351   
4  0.657840 -1.131716 -0.516707  ...  2.916850 -11.357613  5.982731  0.711121   
5  6.612795 -3.246192  0.107478  ...  5.519053  -7.619304  4.328555 -2.601499   

        122       123       124        125  

### Saving node indexes to new file

Node indexes extracted from JUST embedding file are saved to a new file for later use.

In [3]:
# Save node_indexes to a file
with open('Just_Embeddings/node_indexes.txt', 'w') as f:
    for index in node_indexes:
        f.write("%s\n" % index)

# Add node_indexes back as the first column of the DataFrame
embeddings_df.insert(0, 'node_id', node_indexes)

# Set node indexes as embeddings_df index to allow for faster search later on
embeddings_df.set_index('node_id', inplace=True)

# Now 'embeddings_df' is ready for further analysis
print(embeddings_df.head())

embeddings_df.shape

                  0         1          2         3         4          5    \
node_id                                                                     
u174194590   0.001643  2.840096 -11.996254  6.314668  5.382931  -3.106497   
u26432623   11.379065  7.180863 -11.875002 -2.788641 -2.626986   6.036295   
t141574     -4.887768 -7.546073   0.399502  2.208102 -6.190106  11.064276   
t1479214   -15.544735 -0.323526   1.276439 -4.720444  1.161533   8.573752   
t141567      0.424813 -3.887877   0.511577 -1.623160 -1.314947   8.326499   

                 6         7         8         9    ...       118        119  \
node_id                                             ...                        
u174194590 -6.132063 -2.159500 -6.081487 -6.390866  ... -2.897009   3.883124   
u26432623   2.307652 -5.082263  3.557818 -2.335959  ... -1.370886  -2.016068   
t141574    -5.637013 -3.084814  2.396325 -3.001099  ...  8.717366  -2.926158   
t1479214    1.664230  0.657840 -1.131716 -0.516707  ...  2.9

(245760, 128)

Now that we have cleaned-up the embeddings into a dataframe, we need to check if there are any inconsistencies in the data. We also check for non-numeric data.

In [4]:
# Check for non-numeric data
print("Data types:\n", embeddings_df.dtypes)

# Check for missing values
if embeddings_df.isnull().values.any():
    print("Missing values found")

# Check shape of embeddings dataframe to see if there are varying row lengths
print("DataFrame shape:", embeddings_df.shape)

Data types:
 0      float64
1      float64
2      float64
3      float64
4      float64
        ...   
123    float64
124    float64
125    float64
126    float64
127    float64
Length: 128, dtype: object
DataFrame shape: (245760, 128)


### Cross-Check Node ID List w/M2V Result

For comparison purposes, we would like to also see if M2V was able to explore the same nodes as JUST via its random walk method. We already know that both embedding files involved 29242 instances of 128-dimensional vectors, resembling ALL nodes in the graph. We would like to see if these are the same nodes (i.e., that the set comparison gives an empty result), and if not, what the differences are...

In [5]:
# Path to the M2V Node ID List file
file_path = 'M2V_Embeddings/node_ids.txt'

# Read the file and store the node IDs in a list
with open(file_path, 'r') as file:
    m2v_node_ids = [line.strip() for line in file.readlines()]

#Remove formatting of _ in M2V List
m2v_node_ids = [node_id.replace('_', '') for node_id in m2v_node_ids]

#Add Lists into Sets for easier comparison
m2v_node_set = set(m2v_node_ids)
just_node_set = set(node_indexes)

# Find differences
nodes_in_just_not_in_m2v = just_node_set - m2v_node_set
nodes_in_m2v_not_in_just = m2v_node_set - just_node_set

count_just_nodes = len(just_node_set)
count_m2v_nodes = len(m2v_node_set)

print("Count of nodes in JUST list:", count_just_nodes)
print("Count of nodes in M2V list:", count_m2v_nodes)
print("Nodes in JUST not in M2V:", nodes_in_just_not_in_m2v)
print("Nodes in M2V not in JUST:", nodes_in_m2v_not_in_just)


Count of nodes in JUST list: 245760
Count of nodes in M2V list: 245621
Nodes in JUST not in M2V: {'a309334', 'a118396', 'a437664', 'a131754', 'a89477', 'a91049', 'a350288', 'a220148', 'a422954', 'a182722', 'a186940', 'a454741', 'a136181', 'a392308', 'a465867', 'a346554', 'a388850', 'a314307', 'a256620', 'a309649', 'a105467', 'a333596', 'a231694', 'a368635', 'a284244', 'a223910', 'a174674', 'a285718', 'a195529', 'a248969', 'a410783', 'a432801', 'a168812', 'a321126', 'a379678', 'a376768', 'a163973', 'a338751', 'a12264', 'a169315', 'a303857', 'a388428', 'a450398', 'a159772', 'a7758', 'a156821', 'a169581', 'a372333', 'a288538', 'a245227', 'a346054', 'a397318', 'a70701', 'a205533', 'a35976', 'a6504', 'a9014', 'a245254', 'a279287', 'a288731', 'a295495', 'a82667', 'a244509', 'a229462', 'a492920', 'a362367', 'a174690', 'a99989', 'a367430', 'a124637', 'a380017', 'a517844', 'a60439', 'a236341', 'a25245', 'a281408', 'a235758', 'a125093', 'a324914', 'a329282', 'a34307', 'a349314', 'a269752', 'a222

As you can tell, there is a noticeable discrepancy between the nodes w/ embeddings in M2V and JUST. JUST assigned embeddings to 139 more nodes than M2V. 

In order for us to conduct fair and reliable comparisons, we need to ensure that we are using only the common nodes vs. To maintain the information embedded in JUST, we keep all nodes, even ones not considered by M2V

## Loading Edge List Data from .edgelist File

To be able to access which nodes are connected by an edge, we need to import the edge list into another dataframe. Note that the node IDs must be consistent across both the embedding and edge list dataframes! It is also an undirected graph, meaning source and target do not necessarily mean anything.

In [6]:
# File path
edgelist_file = 'EdgeList_MusicMicro/musicmicro.edgelist'

# Read edge list into DataFrame
edge_list_df = pd.read_csv(edgelist_file, sep=' ', header=None, names=['source', 'target'])

display(edge_list_df.head())

display(embeddings_df.head())

Unnamed: 0,source,target
0,u74717431,t7748381
1,u127821914,t3529910
2,u174194590,t5762915
3,u141847381,t6987845
4,u87215499,t4082536


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
node_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
u174194590,0.001643,2.840096,-11.996254,6.314668,5.382931,-3.106497,-6.132063,-2.1595,-6.081487,-6.390866,...,-2.897009,3.883124,4.101838,4.808561,9.093934,-2.811261,0.207146,11.271386,0.139268,-0.593065
u26432623,11.379065,7.180863,-11.875002,-2.788641,-2.626986,6.036295,2.307652,-5.082263,3.557818,-2.335959,...,-1.370886,-2.016068,-7.800634,5.817414,7.222031,-1.411471,-0.37039,1.494895,6.21584,-9.658212
t141574,-4.887768,-7.546073,0.399502,2.208102,-6.190106,11.064276,-5.637013,-3.084814,2.396325,-3.001099,...,8.717366,-2.926158,-4.935077,-7.889351,-6.906976,-0.147228,-8.234369,2.739229,4.458437,0.207935
t1479214,-15.544735,-0.323526,1.276439,-4.720444,1.161533,8.573752,1.66423,0.65784,-1.131716,-0.516707,...,2.91685,-11.357613,5.982731,0.711121,-0.915399,-1.668162,-0.969677,5.099006,2.074085,4.727633
t141567,0.424813,-3.887877,0.511577,-1.62316,-1.314947,8.326499,-5.707478,6.612795,-3.246192,0.107478,...,5.519053,-7.619304,4.328555,-2.601499,-4.99676,0.046536,-4.347691,7.108927,3.294328,2.158588


## Calculating Cosine Similarity

- For each edge, we retrieve the embeddings of the connected nodes.
- Use cosine_similarity from sklearn.metrics.pairwise to calculate the similarity for each edge.
- Store the similarity values in a new column in the edge list DataFrame.

### Method 1: Row-by-Row Iteration (Slower, Inefficient) --> SKIP

For graphs with a very large number of edges, iterating over each row using DataFrame.iterrows() and calculating cosine similarity one pair at a time can be very inefficient. This method has a time complexity that grows linearly with the number of edges, leading to long execution times for large graphs. 

In [10]:
# Assume embeddings_df is your DataFrame with embeddings indexed by node IDs
# Calculate cosine similarities
similarities = []
for _, row in edge_list_df.iterrows():
    emb1 = embeddings_df.loc[row['source']].values.reshape(1, -1)
    emb2 = embeddings_df.loc[row['target']].values.reshape(1, -1)
    similarity = cosine_similarity(emb1, emb2)[0, 0]
    similarities.append(similarity)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities


### Method 2: Batch Processing using Vectorization (Faster, Efficient)

1. Efficiency and Vectorization
    - Vectorized Operations: Modern CPUs and computing frameworks like NumPy are optimized for vectorized operations, where the same operation is performed simultaneously on multiple data points. This is inherently more efficient than processing each data point (or in this case, each pair of embeddings) individually, as it minimizes the overhead associated with looping constructs in high-level languages like Python.

    - Batch Processing: By processing multiple pairs of embeddings at once, the batch approach reduces the number of iterations and takes full advantage of vectorized operations. This leads to a significant reduction in computation time, especially for large datasets.

2. Scalability
    - Memory Management: Calculating cosine similarities for millions of edges at once can be memory-intensive, leading to memory overflow or significantly slowed performance due to swapping. Processing the data in smaller batches helps manage memory usage more effectively, ensuring that the computation remains within the available system resources, thereby maintaining performance across varying scales of data.

    - Parallelization Potential: Although not implemented in the provided code, batch processing opens up possibilities for parallel computation. Batches can be processed in parallel across multiple CPU cores or even distributed systems, further speeding up the computation for very large graphs.

3. Practicality
    - Adaptability: The batch size can be adjusted based on the available computing resources and the specific requirements of the dataset. This flexibility allows the method to be optimized for different environments, from personal laptops to high-performance computing clusters.

    - Reduced Computational Overhead: The original method's reliance on DataFrame.iterrows() is known to be inefficient for large datasets due to the overhead of generating Series objects for each row. In contrast, the batch processing approach minimizes this overhead by working directly with NumPy arrays, which are more efficient both in terms of memory layout and computational performance.

In [7]:
# Assume embeddings_df is indexed by node IDs and contains embeddings
embeddings = embeddings_df.to_numpy()

# Map node IDs to their index in the embeddings array for quick lookup
node_id_to_index = {node_id: index for index, node_id in enumerate(embeddings_df.index)}

# Convert edge list source and target to indices
edge_indices = [(node_id_to_index[row['source']], node_id_to_index[row['target']])
                for _, row in edge_list_df.iterrows()]

# Calculate similarities in batches to manage memory usage
batch_size = 1000  # Adjust based on your memory capacity
similarities = []

for i in range(0, len(edge_indices), batch_size):
    batch_edges = edge_indices[i:i+batch_size]
    emb1 = np.array([embeddings[index_pair[0]] for index_pair in batch_edges])
    emb2 = np.array([embeddings[index_pair[1]] for index_pair in batch_edges])
    
    # Calculate batch similarities
    batch_similarities = cosine_similarity(emb1, emb2).diagonal()
    similarities.extend(batch_similarities)

# Add similarities to the edge list DataFrame
edge_list_df['weight'] = similarities


In [8]:
display(edge_list_df.head(100)) 

Unnamed: 0,source,target,weight
0,u74717431,t7748381,0.957671
1,u127821914,t3529910,0.426984
2,u174194590,t5762915,0.369877
3,u141847381,t6987845,0.745354
4,u87215499,t4082536,0.946200
...,...,...,...
95,u67598181,t6721280,0.498113
96,u42774199,t655158,0.601942
97,u136260948,t5985732,0.666077
98,u240229687,t8315721,0.666457


We can now export the new updated edge list with cosine similarities as edge weights.

In [9]:
# Optionally save the updated edge list
edge_list_df.to_csv('JUST_edge_list_with_similarity.csv', index=False)

## Projecting Weights to New Homogeneous Graph

- A new graph that mimics the same structure as its original heterogeneous counterpart, but ignores node types and edge types. This information should already be embedded structurally and semantically in the node embeddings.
- Based on the cosine similarity calculations, the values are projected onto the graph as edge weights.
- This graph will be constructed using StellarGraph (can be later converted into NetworkX or Adjacency Matrices + Edge Lists based on CD algorithm) 

In [10]:
from stellargraph import StellarGraph



In [11]:

# Assuming edge_list_df columns are ['source', 'target', 'weight']

# Create StellarGraph from edge list with weights
G = StellarGraph(edges=edge_list_df)

print(
    "Number of nodes {} and number of edges {} in graph.".format(
        G.number_of_nodes(), G.number_of_edges()
    )
)

print("\n")

print("Below is an overview of the StellarGraph structure:")
print(G.info())

Number of nodes 245760 and number of edges 641284 in graph.


Below is an overview of the StellarGraph structure:
StellarGraph: Undirected multigraph
 Nodes: 245760, Edges: 641284

 Node types:
  default: [245760]
    Features: none
    Edge types: default-default->default

 Edge types:
    default-default->default: [641284]
        Weights: range=[-0.456101, 0.99995], mean=0.428242, std=0.343494
        Features: none
