# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01 b: Load, Engineer & Connect</span>


# Graph feature engineering using RAPIDS' cuGraph¶
In this notebook we are going performom graph feature engineering using cuGraph library and write to feature stores as feature groups. 

## **🗒️ This notebook is divided in 3 sections:** 
1. Loading the data and do feature engineeing,
2. Connect to the Hopsworks feature store,
3. Create feature groups and upload them to the feature store.

![tutorial-flow](images/01_featuregroups.png)

First of all we will load the data and do some feature engineering on it.

In [1]:
import cuxfilter
import cudf
import cugraph
import numpy as np, pandas as pd

2022-05-29 11:50:41,079 INFO: Note: detected 256 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2022-05-29 11:50:41,080 INFO: Note: NumExpr detected 256 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-05-29 11:50:41,081 INFO: NumExpr defaulting to 8 threads.


In [None]:
# get feature store handle
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

# Get nodes and edges feature groups
transactions_fg = fs.get_feature_group("transaction_labels_fg", 1)
party_fg = fs.get_feature_group("party_fg", 1)

# Get fg as pandas dataframe
node_pdf = party_fg.read()
edge_pdf = transactions_fg.read()

edge_pdf.columns = ["tran_timestamp","source", "target", "tran_id", "tx_type", "base_amt"]
node_pdf.columns = ["id", "type"]


## Create CuGraph Graph

In [3]:
# CuGraph works with only integer node IDs
unique_ids = set()
for [src,dst] in edge_pdf[["source", "target"]].values:
  unique_ids.add(src)
  unique_ids.add(dst)

id_dict = {}
for i, idn in enumerate(unique_ids):
    id_dict[idn]=i# create 2 columns that contain the integer IDs for src and dst
edge_pdf['src_int'] = edge_pdf['source'].apply(lambda x : id_dict[x])
edge_pdf['dst_int'] = edge_pdf['target'].apply(lambda x : id_dict[x])

In [4]:
node_pdf['id_int'] = node_pdf['id'].apply(lambda x : id_dict[x])

In [5]:
# cugraph needs node IDs to be int32 and weights to be float
cuda_g = cudf.DataFrame.from_pandas(edge_pdf)
cuda_g['src_int'] = cuda_g['src_int'].astype(np.int32)
cuda_g['dst_int'] = cuda_g['dst_int'].astype(np.int32)
cuda_g['base_amt'] = cuda_g['base_amt'].astype(np.float)

G = cugraph.Graph()
G.from_cudf_edgelist(cuda_g, source='src_int', destination='dst_int')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  cuda_g['base_amt'] = cuda_g['base_amt'].astype(np.float)


## Page rank

In [6]:
pgr = cugraph.pagerank(G)

## Weakly connected components

In [7]:
wcc = cugraph.weakly_connected_components(G)

## Betweenness centrality

In [8]:
bc = cugraph.betweenness_centrality(G)

## Katz centrality

In [9]:
kzc = cugraph.katz_centrality(G)

## Construct graph algorithm feature groups 

### Convert to pandas dataframes  

In [10]:
pgr = pgr.to_pandas()
wcc = wcc.to_pandas()
bc = bc.to_pandas() 
kzc = kzc.to_pandas()

2022-05-29 11:51:14,960 INFO: init


In [11]:
pgr.columns = ["pagerank", "id_int"]
wcc.columns = ["wcc_labels", "id_int"]
bc.columns = ["betweenness_centralit", "id_int"]
kzc.columns = ["katz_centrality", "id_int"]

### Merge all graph feature dataframes as one dataframe. 

In [12]:
cugraph_alg_fg = pgr.join(node_pdf.set_index('id_int'), on='id_int', how='inner')
cugraph_alg_fg = wcc.join(cugraph_alg_fg.set_index('id_int'), on='id_int', how='inner')
cugraph_alg_fg = bc.join(cugraph_alg_fg.set_index('id_int'), on='id_int', how='inner')
cugraph_alg_fg = kzc.join(cugraph_alg_fg.set_index('id_int'), on='id_int', how='inner')

In [13]:
cugraph_alg_fg.head()

Unnamed: 0,katz_centrality,id_int,betweenness_centralit,wcc_labels,pagerank,id,type
0,0.0,3441,0.001183,5930,0.000178,66e86b1e,Individual
1,0.0,3471,0.000398,5930,0.000117,0d64dd55,Individual
2,0.0,3489,0.001145,5930,0.000226,fcbdc6bb,Individual
3,0.0,3761,0.003356,5930,0.000284,c628b45c,Individual
4,0.0,3858,0.001662,5930,0.000185,7ab7879c,Individual


In [14]:
cugraph_df = cugraph_alg_fg[["katz_centrality","betweenness_centralit", "wcc_labels", "pagerank", "id"]]

In [15]:
cugraph_df.head()

Unnamed: 0,katz_centrality,betweenness_centralit,wcc_labels,pagerank,id,type
0,0.0,0.001183,5930,0.000178,66e86b1e,Individual
1,0.0,0.000398,5930,0.000117,0d64dd55,Individual
2,0.0,0.001145,5930,0.000226,fcbdc6bb,Individual
3,0.0,0.003356,5930,0.000284,c628b45c,Individual
4,0.0,0.001662,5930,0.000185,7ab7879c,Individual


## <span style="color:#ff5f27;"> 🪄 Register Feature Groups </span>

In [None]:
cugraph_fg = fs.create_feature_group(
    name = "cugraph_fg",
    version = 1,
    primary_key = ["id"],
    description = "cugraph graph  features",
    time_travel_format = "HUDI",  
    online_enabled = True,
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False},
    expectation_suite=expectation_suite
)   
cugraph_fg.save(cugraph_df)