SetExpan: Corpus-based Set Expansion Framework
- Add apr dataset in the following Google Drive link and also move wiki queries & ground truth sets into the dataset.
- We add the original EgoSet dataset under "./data/" folder for references.
- A new (but slightly different) version of SetExpan (used in HiExpan) is available at: https://github.com/mickeystroller/HiExpan/tree/master/src/SetExpan-new, together with a more easy-to-use data preprocessing pipeline.
This is the source code for SetExpan framework developed for corpus-based set expansion (i.e., finding the "complete" set of entites belonging to the same semantic class based on a given corpus and a tiny set of seeds).
We provide the data preprocessing code and the python implementation of SetExpan. If you want to use our data preprocessing code, then you need to download the following two related packages and put them in the "/src/tools/" folder:
- AutoPhrase: used to extract quality phrases from raw input data.
- Stanford CoreNLP 3.8.0: used to do POS tagging and select quality Noun Phrases from the previous phrase list generated by AutoPhrase. The quality Noun Phrase will be treated as the "entity".
Otherwise, you can directly download our preprocessed data from Google Drive; unzip it and put the dataset in under the "./data/" folder.
Files in the folder
- /data/, the input folder of SetExpan;
- /result/, the output folder of SetExpan;
- /src/corpusProcessing/, the first step of data preprocessing, convert raw text to sentences.json
- /src/dataProcessing/, the second step of data preprocessing, generate all SetExpan input files from sentences.json
- /src/tools/, tools used in the data processing
- /src/SetExpan/, the python implementation of SetExpan algorithms
- /src/SetExpan/set_expan_main.py: the main entrance of SetExpan, including loading data, forming queries, and running algorithm.
- /src/SetExpan/set_expan.py: the main implementation of SetExpan. You can change model hyper-parameters in this file.
cd src/SetExpan/ python3 ./set_expan_main.py
Results are saved under the same folder and named "setexpan_result.txt"
Please cite the following paper if you are using this code. Thanks!
- Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, Jiawei Han, "SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble", accepted into The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2017)