<a href="https://colab.research.google.com/github/perlatomdpi/Graph-algorithms/blob/main/Weighted_Jaccard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Weighted Jaccard Similarity**

RAPIDS cuGraph is a library of graph algorithms that seamlessly integrates into the RAPIDS data science ecosystem and allows to easily call graph algorithms using data stored in a **GPU DataFrame**. <br>

Algorithm optimized for single-GPU analytics: <br>
**Jaccard Similarity**: a measure of neighbourhood similarity between connected vertices. Within recommendations systems, this is very useful for finding customers with similar behaviour. <br>

References: https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and



# Initialize project

In [None]:
#==============================================================================
# CHECK GPU
#==============================================================================
# Runtime -> Change runtime type -> GPU 
# Has to be RAPIDS compatible: 
# If not terminate and restart session
!nvidia-smi

In [None]:
#==============================================================================
# GPUs should be connected with NVlink
#==============================================================================
!nvidia-smi nvlink --status

In [None]:
#==============================================================================
# INSTALL RAPIDS
#==============================================================================
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

In [None]:
# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict

# Read the data as cuDF

In [None]:
# Test file  
datafile='../data/networks/karate-data.csv'

# Read the data file
cols = ["src", "dst"]

dtypes = OrderedDict([
        ("src", "int32"), 
        ("dst", "int32")
        ])

gdf = cudf.read_csv(datafile, names=cols, delimiter='\t', dtype=list(dtypes.values()) )

In [None]:
# Adjust the vertex ID
gdf["src_0"] = gdf["src"] - 1
gdf["dst_0"] = gdf["dst"] - 1

In [None]:
# Create a Graph 
G = cugraph.Graph()
G.add_edge_list(gdf["src_0"], gdf["dst_0"])

# Compute PageRankand use as vertex weights

In [None]:
# Call Pagerank on the graph to get weights to use:
pr_df = cugraph.pagerank(G)

# Compute the Weighted Jaccard

In [None]:
# Compute weighted Jaccard using the Pagerank scores as weights:
df = cugraph.nvJaccard_w(G, pr_df['pagerank'])

In [None]:
# Find the most similar pair of vertices - adjust the vertex ID by adding 1 to match illustration
bestEdge = 0
for i in range(len(df)):
    if df['jaccard_coeff'][i] > df['jaccard_coeff'][bestEdge]:
        bestEdge = i
        
print("Vertices " + str(df['source'][bestEdge] +1) + 
      " and " + str(df['destination'][bestEdge] + 1) + 
      " are most similar with score: " + str(df['jaccard_coeff'][bestEdge]))