Concept Mining via Wmbedding

This repo contains the source code and data for the ICDM'18 paper Concept Mining via Embedding

Introduction

The explosive growth of textual data is becoming prohibitive. According to the prominent STM report, about 3 million journal articles are being published every year at an annual growth rate of 5%. A first step towards this goal is effective discovery of concepts, i.e., to identify integral units such as “generative adversarial network” and “support vector machine” from raw text, which plays an essential role in transforming unstructured text into structured information, and supports downstream analytical tasks such as information extraction, organization, recommendation and search.

Concept Mining via Embedding is novel technique proposed by us, which mines concepts not only based on their occurrence frequency statistics, by also based on their occurrence contexts. It works by first learning embedding vector representations that summarize the context information for each possible candidates, and use these embeddings to evaluate the concept’s global quality and their fitness to each local context.

Requirements

We will take Ubuntu for example.

python 3.6

$ sudo apt-get install python3.6

other python packages

$ pip install -r ./requirements.txt

The AutoPhrase for frequency based concept candidate generation. Please change the path variable AUTOPHRASE_PATH inside the run_econ.sh if you use custom installation directory.

$ mkdir -p $HOME/bin
$ cd $HOME/bin
$ git clone https://github.com/shangjingbo1226/AutoPhrase
$ cd AutoPhrase
$ make
$ # Please refer to AutoPhrase documentation for detailed installation and usage.

The Dbpedia Spotlight for knowledge based concept candidate generation.

$ mkdir -p $HOME/bin
$ cd $HOME/bin
$ git clone https://github.com/dbpedia-spotlight/spotlight-docker.git
$ cd spotlight-docker
$ cd v1.0/english
$ docker build -f Dockerfile  -t english_spotlight .
$ docker run -i -p 2222:80 english_spotlight spotlight.sh
$ # Please refer to spotlight-docker documentation for detailed installation and usage.

How do I run it?

The concept mining pipeline is wrapped into a single script /run_econ.sh, and the only input one needs to supply is the raw text input, by changing the variable

TEXT: the input text file

The run the pipeline as

$ bash ./train.sh

Input Format

The input files specified by TEXT follows the one document per line format.

Output Format

The concept scoring list file obtained from the jupyter notebook notebooks/classifier.ipynb follows one concept - score pair per line
The recognized concepts file produced by the last step of pipeline via econ/recognition_fast.py follows the original input file format, with the concept occurrences highlighted with the <c>...</c> notation.
beyond this, there are also the original concept candidate file generated via candidate_generation/merge_span.py, which follows a JSON-per-line format, where each JSON file specifies the list of overlapping superspans (see paper for details) produced by the candidate generation approach.
There are also concept recognition and annotation generated by each specific candidate sources, including NLTK, Spacy, Textrank, Rake, Autophrase and DBpedia. pt will have

Reference

If you plan to use these scripts in your own work, please consider citing the following paper

@inproceedings{li2018concept,
  title={Concept Mining via Embedding},
  author={Li, Keqian and Zha, Hanwen and Su, Yu and Yan, Xifeng},
  booktitle={2018 IEEE International Conference on Data Mining (ICDM)},
  pages={267--276},
  year={2018},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
candidate_generation		candidate_generation
econ		econ
sample_data		sample_data
util		util
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
classifier.ipynb		classifier.ipynb
constants.py		constants.py
requirements.txt		requirements.txt
run_econ.sh		run_econ.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

candidate_generation

candidate_generation

econ

econ

sample_data

sample_data

util

util

.gitignore

.gitignore

README.md

README.md

init.py

init.py

classifier.ipynb

classifier.ipynb

constants.py

constants.py

requirements.txt

requirements.txt

run_econ.sh

run_econ.sh

Repository files navigation

Concept Mining via Wmbedding

Introduction

Requirements

How do I run it?

Input Format

Output Format

Reference

About

Releases

Packages

Languages

kleeeeea/ECON

Folders and files

Latest commit

History

Repository files navigation

Concept Mining via Wmbedding

Introduction

Requirements

How do I run it?

Input Format

Output Format

Reference

About

Resources

Stars

Watchers

Forks

Languages