No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.settings
ISWC18_experiments
resources
src/cinex
.classpath
.gitattributes
.gitignore
.project
CINEXClassifier.xml
CINEXEvaluation.xml
CINEXPipeline.xml
CINEXPreprocessing.xml
README.md
dependency-reduced-pom.xml
pom.xml

README.md

CINEX (Counting INformation EXtraction)

Information extraction traditionally focuses on extracting relations between identifiable entities, such as <Monterey, isLocatedIn, California>. Yet, texts often also contain counting information, stating that a subject is in a specific relation with a number of objects, without mentioning the objects themselves, for example, “The U.S. state of California is divided into 58 counties.”.

Such counting quantifiers can help in a variety of tasks such as query answering or knowledge base curation, but are neglected by prior work. We develop the first full-fledged system for extracting counting information from text, called CINEX, which predicts counting quantifiers given a pair of <subject, relation>, e.g., <California, hasCounties, ∃58>.

We employ distant supervision using fact counts from a knowledge base as training seeds, and leverage CRF-based sequence tagging models to identify counting information in the text. Experiments with five human-evaluated relations show that CINEX can achieve 60% average precision for extracting counting information. In a large-scale experiment, we demonstrate the potential for knowledge base enrichment by applying CINEX to 2,474 frequent relations in Wikidata. CINEX can assert the existence of 2.5M facts for 110 distinct relations, which is 28% more than the existing Wikidata facts for these relations.

The predicted counting quantifiers for (selected 37) Wikidata relations, by running the learned models on all entities in a class given a Wikidata property-class pair (e.g., all child of humans), can be queried at https://cinex.cs.ui.ac.id/. For instance, "How many spouses does Isaac Newton have?" or "Does Wikidata contain all children of George HW Bush?".

Installing and Running CINEX

Requirements

  • Java Runtime Environment (JRE) 1.7.x or higher
  • CRF++: Yet Another CRF toolkit

Maven

To build the fat (executable) JAR:

  • Install the WS4J library in your local Maven repo, e.g., mvn install:install-file -Dfile=./lib/ws4j-1.0.1.jar -DgroupId=edu.cmu.lti -DartifactId=ws4j -Dversion=1.0.1 -Dpackaging=jar
  • Run mvn package to build the executable JAR file (in target/CINEX-<version>.jar).

CINEX is also available on Maven Central. Please add the following dependency in your pom.xml.

<dependency>
  <groupId>com.github.paramitamirza</groupId>
  <artifactId>CINEX</artifactId>
  <version>1.0.1</version>
</dependency>

Usage

usage: CINEX
 -u,--url <arg>      Input Wikipedia URL   
 -i,--input <arg>    (Optional) Input text file (.txt) path
 -p,--prop <arg>     Wikidata property ID
 -c,--class <arg>    Wikidata class ID
 -m,--models <arg>   Directory containing CRF++ models for relations
 -r,--crf <arg>      CRF++ directory path

As the source text, the URL of a Wikipedia article must be provided, optionally, a cleaner source text can also be given as a text file. A pair of Wikidata <property, class> IDs (e.g., <P40, Q5> denoting a child-of-human relation) is required, as well as the path to a directory containing the corresponding model* (P40_Q5.model.gz). Finally, CRF++ must be installed, and its path must also be given. For example:

java -Xmx2G -jar ./target/CINEX-<version>.jar 
	-u https://en.wikipedia.org/wiki/Robin_Williams 
	-p P26 -c Q5 -m ./crf_models 
	--crf /home/paramita/Projects/counting_quantifier/tools/CRF++-0.58/

which gives as a result:

The predicted counting quantifier of spouse of Robin_Williams (class: human) is: 3
	confidence score: 0.116744
	evidence (type: ordinal): Williams married his [third] wife , graphic designer Susan Schneider , 
				  on October 22 , 2011 , in St. Helena , California .

*) Please find the list of available models in resources/CRF_models.tsv, all model files can be downloaded from here.

ISWC 2018 Experiments

Please check ISWC18_experiments/, as well as the following publication, for more information about data used, experimental details and results.

Publication

Paramita Mirza, Simon Razniewski, Fariz Darari and Gerhard Weikum. Enriching Knowledge Bases with Counting Quantifiers. In Proceedings of ISWC 2018. [pdf]

ACL 2017 Experiments

Please check Relation Cardinality Extraction, as well as the following publication, for more information about data used, experimental details and results.

Publication

Paramita Mirza, Simon Razniewski, Fariz Darari and Gerhard Weikum. Cardinal Virtues: Extracting Relation Cardinalities from Text. In Proceedings of ACL 2017 (short paper). [pdf]

Contact

For more information please contact Paramita Mirza (paramita135@gmail.com).