# Experiments on an example publicly available dataset

*Disclaimer*: This repository is intended as a demo to showcase the method and its use, it is not a complete package and/or a full replication of the experiments in the paper.

In [1]:
import polars as pl

from load_datasets import load_dataset
from csp_lib import csp_layer, csp_prepare_data

First, we load the dataset, which in this case is the cocitation version of the Cora dataset. See the `load_dataset.py` file for a list of implemented datasets with links to download them. The `Cora-CC` dataset from https://github.com/jianhao2016/AllSet is included in this repository as a starter.

In [2]:
nodes, edges = load_dataset(name = "Cora-CC", with_features = False)
nodes

nodeId,label,y_0,y_1,y_2,y_3,y_4,y_5,y_6
i64,u32,u32,u32,u32,u32,u32,u32,u32
0,0,1,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0
2,2,0,0,1,0,0,0,0
3,2,0,0,1,0,0,0,0
4,3,0,0,0,1,0,0,0
…,…,…,…,…,…,…,…,…
2703,5,0,0,0,0,0,1,0
2704,5,0,0,0,0,0,1,0
2705,5,0,0,0,0,0,1,0
2706,6,0,0,0,0,0,0,1


In [3]:
edges

nodeId,edgeId
i64,i64
538,0
163,0
219,0
1114,1
163,1
…,…
1885,1577
1886,1577
1884,1578
1885,1578


In order to apply CSP, we need a binary signal. In this case, we take the first class as positive and all other classes as negative.

In [4]:
nodes_l0 = nodes.select("nodeId", pl.col("y_0").alias("nodeProperty"))
nodes_l0

nodeId,nodeProperty
i64,u32
0,1
1,0
2,0
3,0
4,0
…,…
2703,0
2704,0
2705,0
2706,0


We assign 10% of nodes to the training set.

In [5]:
nodes_l0 = nodes_l0.with_columns(training_set = (pl.arange(0, pl.len()) < pl.len() / 10).shuffle(seed = 42))
nodes_l0

nodeId,nodeProperty,training_set
i64,u32,bool
0,1,false
1,0,true
2,0,false
3,0,true
4,0,false
…,…,…
2703,0,true
2704,0,false
2705,0,false
2706,0,false


Next, we prepare the data for CSP. We need a DataFrame with each row corresponding to a pair of node and edge, where the given node belongs to the given edge (In total, this gives us $\Sigma_E = \sum_{e_j} \delta \left( e_j \right)$ rows). We also need the signal in the `nodeProperty` column, which for now is constant across all rows with the same `nodeId`. The `csp_prepare data` function also masks labels for nodes not in the training set.

In [6]:
train_df = csp_prepare_data(edges, nodes_l0)
train_df

nodeId,edgeId,nodeProperty,training_set
i64,i64,u32,bool
538,0,0,false
163,0,0,false
219,0,0,false
1114,1,0,false
163,1,0,false
…,…,…,…
1885,1577,0,false
1886,1577,0,false
1884,1578,0,false
1885,1578,0,false


Finally, we can apply one layer of CSP by calling the `csp_layer` function. The function takes an optional parameter `alpha_prime` that is described in Section 4.6.2 of the paper and will not be used here.

In [7]:
updated_df = csp_layer(train_df)
updated_df

nodeId,edgeId,edgeProperty,nodeProperty
i64,i64,f64,f64
538,0,0.0,0.0
163,0,0.0,0.007931
219,0,0.0,0.02
1114,1,0.0,0.0
163,1,0.0,0.007931
…,…,…,…
1885,1577,0.0,0.0
1886,1577,0.0,0.0
1884,1578,0.0,0.0
1885,1578,0.0,0.0


The output of the `csp_layer` fuction is a DataFrame in the same shape as the input DataFrame (making it possible to use multiple layers). In order to get final predictions for each node, we need to aggregate the dataset (`nodeProperty` should be identical for all rows sharing the same node):

In [8]:
output_df = updated_df.group_by("nodeId").agg(pl.first("nodeProperty").alias("prediction")) 

We can sort the nodes by prediction to check how CSP performs in the retrieval setup

In [9]:
output_df.join(nodes_l0, on = "nodeId") \
    .rename({"nodeProperty": "ground_truth"}) \
    .filter(~pl.col("training_set")) \
    .sort(by = "prediction", descending=True)

nodeId,prediction,ground_truth,training_set
i64,f64,u32,bool
671,0.5,0,false
1265,0.5,1,false
1411,0.5,1,false
1615,0.5,1,false
1896,0.5,1,false
…,…,…,…
2696,0.0,0,false
2697,0.0,0,false
2698,0.0,0,false
2706,0.0,0,false


In addition, we can simply investigate the prediction for each node. For instance, let us focus on the first node with id 671

In [10]:
edges.filter(pl.col("nodeId") == 671)

nodeId,edgeId
i64,i64
671,1195


We can see that it is contained only in the hyperedge with id 1195. Let us ivestigate all nodes that are in this hyperedge:

In [11]:
edges.filter(pl.col("edgeId") == 1195)

nodeId,edgeId
i64,i64
1697,1195
671,1195


There are just two nodes, the one of interest and the node with id 1697. Checking label of the node with id 1697, we can easily understand where the prediction 0.5 comes from:

In [12]:
nodes_l0.filter(pl.col("nodeId") == 1697)

nodeId,nodeProperty,training_set
i64,u32,bool
1697,1,True


Note that this final DataFrame omits all nodes that were isolated in the original dataset, hence the lower number of rows (See Table 1 in the paper for an overview of dataset properties including the number of isolated nodes).

## A scikit-learn-style interface for CSP and its very basic use

Apart from the basic interface operating on DataFrames, we also provide a scikit-learn-style interface for a model with functions `fit` and `predict`. This model also allows us to use CSP in the inductive setting, which is described in Section 4.6.3 of the paper.

In [13]:
import numpy as np

from csp_lib import SkLearnCSP

Let us construct a simple hypergraph with 3 nodes and 3 hyperedges as an incidence matrix $\boldsymbol{H}$ and a node label vector $\boldsymbol{y}$:

In [14]:
H = np.array([[1, 0, 0], [1, 1, 0], [0, 1, 1]])
y = np.array([1, 0, 0])

Next, we create a model with 3 CSP layers and $\alpha' = 1$ and train it on the previously constructed graph:

In [15]:
csp = SkLearnCSP(layers = 3, alpha_prime = 1)
csp.fit(H, y)

We can print the aggregated edge scores $\boldsymbol{r}^{(2)}_j$:

In [16]:
csp.model

edgeId,edgeProperty
i64,f64
1,0.15625
2,0.0625
0,0.3125


And we can use the trained model to obtain a prediction for the nodes of the original hypergraph:

In [17]:
csp.predict(H)

array([0.3125  , 0.234375, 0.109375])

With this interface, however, we can also apply the model inductively to a hypothetical new node, which is defined by its edge-incidence list (a binary-valued vector with ones in indices of edges the node belongs to):

In [18]:
csp.predict(np.array([[1, 1, 1]]))

array([0.17708333])