Skip to content

kleeeeea/ECON

Repository files navigation

Concept Mining via Wmbedding

This repo contains the source code and data for the ICDM'18 paper Concept Mining via Embedding

Introduction

The explosive growth of textual data is becoming prohibitive. According to the prominent STM report, about 3 million journal articles are being published every year at an annual growth rate of 5%. A first step towards this goal is effective discovery of concepts, i.e., to identify integral units such as “generative adversarial network” and “support vector machine” from raw text, which plays an essential role in transforming unstructured text into structured information, and supports downstream analytical tasks such as information extraction, organization, recommendation and search.

Concept Mining via Embedding is novel technique proposed by us, which mines concepts not only based on their occurrence frequency statistics, by also based on their occurrence contexts. It works by first learning embedding vector representations that summarize the context information for each possible candidates, and use these embeddings to evaluate the concept’s global quality and their fitness to each local context.

Requirements

We will take Ubuntu for example.

  • python 3.6
$ sudo apt-get install python3.6
  • other python packages
$ pip install -r ./requirements.txt
  • The AutoPhrase for frequency based concept candidate generation. Please change the path variable AUTOPHRASE_PATH inside the run_econ.sh if you use custom installation directory.
$ mkdir -p $HOME/bin
$ cd $HOME/bin
$ git clone https://github.com/shangjingbo1226/AutoPhrase
$ cd AutoPhrase
$ make
$ # Please refer to AutoPhrase documentation for detailed installation and usage.
$ mkdir -p $HOME/bin
$ cd $HOME/bin
$ git clone https://github.com/dbpedia-spotlight/spotlight-docker.git
$ cd spotlight-docker
$ cd v1.0/english
$ docker build -f Dockerfile  -t english_spotlight .
$ docker run -i -p 2222:80 english_spotlight spotlight.sh
$ # Please refer to spotlight-docker documentation for detailed installation and usage.

How do I run it?

The concept mining pipeline is wrapped into a single script /run_econ.sh, and the only input one needs to supply is the raw text input, by changing the variable

  • TEXT: the input text file

The run the pipeline as

$ bash ./train.sh

Input Format

  • The input files specified by TEXT follows the one document per line format.

Output Format

  • The concept scoring list file obtained from the jupyter notebook notebooks/classifier.ipynb follows one concept - score pair per line
  • The recognized concepts file produced by the last step of pipeline via econ/recognition_fast.py follows the original input file format, with the concept occurrences highlighted with the <c>...</c> notation.
  • beyond this, there are also the original concept candidate file generated via candidate_generation/merge_span.py, which follows a JSON-per-line format, where each JSON file specifies the list of overlapping superspans (see paper for details) produced by the candidate generation approach.
  • There are also concept recognition and annotation generated by each specific candidate sources, including NLTK, Spacy, Textrank, Rake, Autophrase and DBpedia. pt will have

Reference

If you plan to use these scripts in your own work, please consider citing the following paper

@inproceedings{li2018concept,
  title={Concept Mining via Embedding},
  author={Li, Keqian and Zha, Hanwen and Su, Yu and Yan, Xifeng},
  booktitle={2018 IEEE International Conference on Data Mining (ICDM)},
  pages={267--276},
  year={2018},
  organization={IEEE}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published