Skip to content

nju-websoft/PCSG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PCSG

Data, source codes and examples for paper "PCSG: Pattern-Coverage Snippet Generation for RDF Datasets".

For reusing an RDF dataset, understanding its content is a prerequisite. To support the comprehension of its large and complex structure, existing methods mainly generate an abridged version of an RDF dataset by extracting representative data patterns as a summary. As a complement, recent attempts extract a representative subset of concrete data as a snippet. We extend this line of research by injecting the strength of summary into snippet. We propose to generate a pattern-coverage snippet that best exemplifies the patterns of entity descriptions and links in an RDF dataset. Our approach incorporates formulations of group Steiner tree and set cover problems to generate compact snippets. This extensible approach is also capable of modeling query relevance to be used with dataset search. Experiments on thousands of real RDF datasets demonstrate the effectiveness and practicability of our approach.

Data

Datasets, queries and some result snippets used in the experiments are provided in data.

PCSG

In 1-evaluation-of-pcsg, the file dataset.tsv lists all RDF datasets used in the experiment, where the first column is a local ID of each dataset, the second column is the title of the dataset (from the metadata), the third and following (if any) until the second to last column(s) show the links to the dataset dump files. Note that one dataset can have more than one dump files. The last column represent the portal where the dataset is retrieved from (DataHub.io or Data.gov).

For a total of 15 RDF datasets, their result snippets generated by PCSG(-$\tau$) and IlluSnip(-$\tau$) are provided in snippet-result-example. Details of the 15 datasets are listed in snippet-result-example/dataset.tsv, which is a subset extracted from all the datasets mentioned above. Result snippets for each dataset are provided in the folder named by the dataset's local ID in N-Triples format.

QPCSG

In 2-evaluation-of-qpcsg, the file query-dataset-pair.tsv lists all the query-dataset (or Q-D for short) pairs used in the experiment. The first column is a local ID of each Q-D pair, the second column is the local ID of the dataset (which corresponds with the datasets' local ID in 1-evaluation-of-pcsg), the third colunm is the keyword query where each keyword was separated by " ".

For a total of 45 Q-D pairs, their result snippets generated by QPCSG(-$\tau$) and KSD(-$\tau$) are provided in snippet-result-example. Details of the Q-D pairs are listed in snippet-result-example/query-dataset-pair.tsv, which similarly is a subset extracted from all the Q-D pairs mentioned above. Result snippets for each Q-D pair are provided in the folder named by the pair's local ID in N-Triple format.

Comparison with Summary

In 3-comparison-with-summary, the file pcsg80.nt is the result snippet in N-Triples format, the file abstat-summary.tsv is the result summary of ABSTAT. The SPARQL query patterns used in the user study are provided in sparql-queries-for-test.txt.

Source Codes and Dependencies

All source codes of implementation are provided in code/src.

Dependencies

  • JDK 8+
  • MySQL 5.6+
  • Apache Lucene 7.5.0
  • JGraphT 1.3.0

useful packages (jar files) are provided in code/lib.

Run an Example

We provide an example dataset to run our algorithm PCSG, here are the steps:

  1. Move code/example to your local folder, in which dataset.txt contains the triple of element IDs. Each ID is corresponding to the number of row in label.txt , where the first column shows if this element is a literal, and the second column shows the textual form of the element.
  2. Open src as a JAVA project. Edit the variable "folder" in src/PCSG/example/DatasetIndexer.java to your local folder path.
  3. Run all the steps in DatasetIndex.main(), to generate index for EDPs, LPs, get set cover components for the dataset, and get the hub labels for the Group Steiner Tree.
  4. Edit the folder path in getResultTree.java and run, the edges of the result Group Steiner Trees will be output to terminal.
  5. IlluSnip for the example can be similarly generated by running after editing the folder path in illusnipTest.java.

If you have any difficulty or question in using the codes or running the example, please email to xxwang@smail.nju.edu.cn.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use these codes or results, please kindly cite it as follows:

@inproceedings{PCSG,
  author    = {Xiaxia Wang and Gong Cheng and Tengteng Lin and Jing Xu and Jeff Z. Pan and Evgeny Kharlamov and Yuzhong Qu},
  title     = {PCSG: Pattern-Coverage Snippet Generation for RDF Datasets},
  booktitle = {{ISWC} 2021},
  year      = {2021}
}

About

PCSG: Pattern-Coverage Snippet Generation for RDF Datasets (ISWC 2021)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages