code for Entity Resolution in Relational Networks
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
h2
src/main
README.md

README.md

Code for the paper "Entity Resolution in Familial Networks" Pigi Kouki, Jay Pujara, Christopher Marcum, Laura Koehly, Lise Getoor. IEEE International Conference on Data Mining (ICDM) 2017

In order to run this code you need first to install the Probabilistic Soft Logic (PSL) software (version 1.2.1), available here: https://github.com/linqs/psl/tree/1.2.1.

Please cite this work as

@InProceedings{kouki:icdm17, author = "Kouki, Pigi and Pujara, Jay and Marcum, Christopher and Koehly, Laura and Getoor, Lise", title = "Collective Entity Resolution in Familial Networks", booktitle = "IEEE International Conference on Data Mining (ICDM)", year = "2017" }

Installation instructions:

The following assumes everything is down in the same directory e.g. icdm. The instructions are for MacOS.

Download and install the Probabilistic Soft Logic (PSL) software from here: https://github.com/linqs/psl. Useful info: https://github.com/linqs/psl/wiki

Make sure you can run the basic examples. For help check here: https://github.com/linqs/psl/wiki/Running-a-program

Clone this current git repository: git clone https://github.com/pkouki/icdm2017

Go into the h2 directory and run build.sh to compile h2. We need to use this version of h2 as the original version coming with PSL has a bug and crashes under certain cases.

Change the classpath.out file inside your psl-example to use this newly compiled h2. For example change the path from the default: /Users/user/.m2/repository/com/h2database/h2/1.2.126/h2-1.2.126.jar to something like: /Users/pigikouki/Desktop/icdm/icdm2017/h2/bin/h2-1.2.126.jar

copy the folders from icdm/data into psl-archetype/psl-archetype-example/src/main/resources/archetype-resources/data

copy the source files from icdm2017/src/main/java/edu/ucsc/NIH to psl-archetype/psl-archetype-example/src/main/resources/archetype-resources/src/main/java

Compile: mvn compile

You can now run the models as follows from within the psl_example directory:

java -cp ./target/classes:`cat classpath.out` edu.ucsc.NIH.ERWikiDataLearnThresAfterGreedy
java -cp ./target/classes:`cat classpath.out` edu.ucsc.NIH.AllRulesValidTrainOn4FoldsNewThres

If the program runs out of memory you may want to increase the java VM heap size.

For both datasets we provide the feature files that are provided to the PSL models. This is essential for the NIH dataset since it is impossible to release an anonymized version. The data for the NIH are in the folder data/NIH. The data for the Wikidata dataset are in the folder /data/wikidata/newSplit.