PRincipled Identification of Structural Motif is a framework for identifying structural motifs in relational data. This is an implementation of the algorithms described in the paper "Principled and Efficient Motif Finding for Structure Learning in Lifted Graphical Models" by Jonathan Feldstein, Dominic Phillips and Efthymia Tsamoura, AAAI 2023. It is implemented in C++ and highly parallelized to allow for faster computations.
You can read the original paper here. (https://arxiv.org/abs/2302.04599)
Structure Learning Pipeline: The above shows an example of a dataset about a class, where entities are people or books. Black edges represent
This README should provide you the necessary understanding to run examples using this library and build upon it further. If you do use this library, please cite accordingly. This README consists of the following sections:
- A brief technical overview
- The library structure
- How to install PRISM?
- How to run the examples?
- How to set up your own experiments?
In this section, we briefly describe what PRISM can be used for and the underlying technical steps that are used to find structural motifs. For a detailed description and a more fundamental mathematical understanding, please consult "Principled and Efficient Motif Finding for Structure Learning in Lifted Graphical Models" by Jonathan Feldstein, Dominic Phillips and Efthymia Tsamoura, AAAI 2023.
Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. It consists in automatically learning a logical theory from data. The basis for structure learning is mining repeating patterns in the data, known as structural motifs. Finding these patterns reduces the exponential search space and therefore guides the learning of formulas. Despite the importance of motif learning, it is still not well understood. We present the first principled approach for mining structural motifs in lifted graphical models, languages that blend first-order logic with probabilistic models, which uses a stochastic process to measure the similarity of entities in the data.
PRISM takes the following steps, illustrated in the figure below to find structural motifs
The input to PRISM is data in form of a relational database. A relational database
We then pre-process the data through hierarchical clustering. Illustrated in the figure above (Step (i)). This identifies groups of densely-connected nodes. The motivation behind considering only densely-connected nodes is that spuriously-connected nodes appear rarely in the path signatures and therefore only add noise to the path signature counts that we use at a later stage for clustering nodes into abstract concepts. Furthermore, this pre-processing also leads to a computational saving as we show in our paper.
This hierarchical clustering algorithm is based on spectral clustering.
In overview, we begin by converting a hypergraph
In a next step, we run
An upper bound on the
where
Finally, we cluster nodes in abstract concepts; collections of entities that have similar neighbourhoods in the hypergraph. Illustrated in the figure above (Step (ii) and (iii)). Nodes are then partitioned into path-symmetric sets based on the similarity of their
In practice, we separate the path-symmetry clustering into two steps. Firstly, we cluster nodes based on their distance symmetry and then based on their path symmetry. The stage of clustering by distance only serves to speed up the subsequent path-symmetric clustering. This is a valid approach since any path-symmetry implies distance symmetry, and checking distance symmetry is a faster algorithm (
./Graphs
contains a light C++ implemented graph library supporting both graphs and hypergraphs.
Our UndirectedGraph
class is implemented as a child class of the BOOST graph library. We extended the members by
Graph graph;
property_map<Graph, edge_weight_t>::type weightmap = get(edge_weight, graph);
MatrixXd adjacency_matrix;
MatrixXd degree_matrix;
MatrixXd laplacian_matrix;
MatrixXd sqrt_degree;
bool diameter_computed = false;
size_t diameter{0};
Further, we implemented the following methods (further methods are implemented but meant to be used only as subfunctions of the methods described here):
UndirectedGraph::UndirectedGraph(HyperGraph &hypergraph) : graph(hypergraph.number_of_nodes())
a method that allows to implement a graph from a HyperGraph
object. This achieved by extending each hyperedge in the original HyperGraph
by a clique of edges connecting each node in the hyperedge.
UndirectedGraph::UndirectedGraph(UndirectedGraph &graph_template, set<NodeId> subgraph_nodes)
pair<UndirectedGraph, UndirectedGraph> UndirectedGraph::cheeger_cut(VectorXd &second_EV)
Otherwise, we have implemented a few getter and setter functions to access and modify the different private members.
Our HyperGraph
class is an implementation of a hypergraph implemented from scratch as no library in C++ was available. It contains the following members:
map<EdgeId, vector<NodeId>> edges;
map<EdgeId, double> edge_weights;
map<EdgeId, Predicate> predicates;
map<NodeId, NodeName> node_ids_names;
map<NodeName, NodeId> node_names_ids;
map<NodeId, NodeType> nodes;
map<NodeId, vector<EdgeId>> memberships;
map<Predicate, vector<NodeType>> predicate_argument_types;
set<NodeType> node_types;
size_t estimated_graph_diameter{0};
map<NodeId, bool> is_source_node;
Three constructors are available to instantiate a hypergraph
Firstly, we allow to instantiate a hypergraph from files.
HyperGraph(string const& db_file_path, string const& info_file_path, bool safe);
Secondly, we allow to instantiate a hypergraph from an UndirectedGraph object and a template. This method is used to transform back a graph to a hypergraph after performing hierarchical clustering on the graph that was obtained by extending each hyperedge in an original hypergraph by a clique.
HyperGraph(UndirectedGraph &graph, HyperGraph &hypergraph_template);
Thirdly, we allow to instantiate a hypergraph from a set of nodes with a template describing the edges.
HyperGraph(set<NodeId> nodes_subset, HyperGraph &hypergraph_template);
The HierarchicalClusterer
class is our implementation of a recursive spectral clustering.
Our constructor instantiates an object given a hypergraph to be clustered, and a config file.
HierarchicalClusterer(HyperGraph hypergraph, HierarchicalClustererConfig config);
The main method to perform the hierarchical clustering is
vector<HyperGraph> & run_hierarchical_clustering();
Is a header file for three auxiliary structs, namely
struct GroundRelation;
struct Relation;
struct NodePartition;
The RandomWalker folder contains all classes and functions to run random walks and collect the data on path signatures and their counts.
The Concepts folder contains all classes to store the identified abstract concepts and print them into files for further usage in other frameworks.
The Utils folder contains a plethora of auxiliary methods including to compute statistics, allow operations on sets, vectors and maps and much more.
The UnitTests folders contains all tests to check the correctness of the code.
PRISM needs three libraries pre-installed to run:
-
BOOST: A C++ library we use for their graph implementation.
-
EIGEN: A C++ library for matrix manipulation.
-
SPECTRA: A C++ library for fast eigen decompositions.
Once these three libraries are installed, it remains to download this library, change the paths in the CMakeLists file to the installation paths of the respective libraries and build the library.
To compile the library:
- Check that your system has cmake version 3.16 or later (Installation instructions: https://cmake.org/install/)
- Once CMake is installed, you can use it from the command line by opening a terminal or command prompt and navigating to the directory where your CMakeLists.txt file is located.
To generate build files for your project using CMake, you can use the following command:
cmake -S . -B cmake-build-debug -DBOOST_PATH="/path/to/boost_1_78_0" -DEIGEN_PATH="/path/to/eigen-3.4.0" -DSPECTRA_PATH="/path/to/spectra/include"
Then build the program by
cmake --build cmake-build-debug
You should then be able to run the code by using
cd cmake-build-debug
./PRISM ../Datasets/imdb1.db ../Datasets/imdb1.info ../Datasets/imdb_output 0.1 0.01 1 0
This runs PRISM on a small IMDB test dataset using the following hyperparameters:
-
../Datasets/imdb1.db
is the file path to the database -
../Datasets/imdb1.info
is the file path to the info file describing the different predicates -
../Datasets/imdb_output
is the file path to the different generated outputs. $\epsilon = 0.1$ $\alpha = 0.01$ multiprocessing = True
verbose = False
To run your own experiments, simply add .db
and .info
files akin to the imdb1.db
and imdb1.info
to the Datasets
folder.
Rerun the command above with the corrected paths and changed parameters as required.