Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

SetExpan: Corpus-based Set Expansion Framework

Update (2018-09-19)

  1. Add apr dataset in the following Google Drive link and also move wiki queries & ground truth sets into the dataset.

Update (2018-09-07)

  1. We add the original EgoSet dataset under "./data/" folder for references.
  2. A new (but slightly different) version of SetExpan (used in HiExpan) is available at:, together with a more easy-to-use data preprocessing pipeline.


This is the source code for SetExpan framework developed for corpus-based set expansion (i.e., finding the "complete" set of entites belonging to the same semantic class based on a given corpus and a tiny set of seeds).


We provide the data preprocessing code and the python implementation of SetExpan. If you want to use our data preprocessing code, then you need to download the following two related packages and put them in the "/src/tools/" folder:

  • AutoPhrase: used to extract quality phrases from raw input data.
  • Stanford CoreNLP 3.8.0: used to do POS tagging and select quality Noun Phrases from the previous phrase list generated by AutoPhrase. The quality Noun Phrase will be treated as the "entity".

Otherwise, you can directly download our preprocessed data from Google Drive; unzip it and put the dataset in under the "./data/" folder.

Files in the folder

  • /data/, the input folder of SetExpan;
  • /result/, the output folder of SetExpan;
  • /src/corpusProcessing/, the first step of data preprocessing, convert raw text to sentences.json
  • /src/dataProcessing/, the second step of data preprocessing, generate all SetExpan input files from sentences.json
  • /src/tools/, tools used in the data processing
  • /src/SetExpan/, the python implementation of SetExpan algorithms
    • /src/SetExpan/ the main entrance of SetExpan, including loading data, forming queries, and running algorithm.
    • /src/SetExpan/ the main implementation of SetExpan. You can change model hyper-parameters in this file.

To Run

cd src/SetExpan/ 
python3 ./

Results are saved under the same folder and named "setexpan_result.txt"


Please cite the following paper if you are using this code. Thanks!


The source code for SetExpan framework, published in ECML-PKDD 2017




No releases published


No packages published