Concept Embedding

Notes

This software require SegPhrase to extract domain keyphrases. It has been included in this repository but users should check the SegPhrase related documentations for proper instruction.

Requirements

We will take Ubuntu for example.

g++ 4.8

$ sudo apt-get install g++-4.8

python 2.7

$ sudo apt-get install python

other python packages

$ sudo pip install -r requirements.txt

Run

$ ./segmentating_embedding.sh  #run the concept extraction, segmentation and embedding pipeline

Parameters

The running parameters are located in segmentating_embedding.sh

DATASET="signal_processing"

DATASET refers to the name you assign to the dataset

DATADIR_oneDocPerFile=`pwd`/data/signal_processin_oneDocPerFile

DATADIR_oneDocPerFile refers to the directory where the input is stored in one-document-per-file format. Set this value to be non-emtpy if your file are in this format.

DATADIR=`pwd`/data/signal_processing

DATADIR refers to the directory the input lies in

RAW_TEXT=$DATADIR/input.txt

RAW_TEXT refers to the input text file

OMP_NUM_THREADS=36

Number of threads.

RETAIN_PHRASES_RATIO=0.5

Ratio of domain keyphrases in generated keyphrases.csv used to parse the corpus by SegPhrase

MIN_PHRASE_SUPPORT=10

Number of occurrences for a valid domain keyphrase in the corpus.

Input Format

The input can be either one document per line or one document per file. Depending on whether you want to set the paramter $DATADIR_oneDocPerFile.

Since Segphrase parser uses square brackets to identify phrases in the segmented text, these brackets should be cleaned from input files to avoid misidentification.

Output Format

There output consists of

segmented_text.txt_phraseAsWord The segmented text, each phrase is represented as words joined by underscore.
segmented_text.txt_phraseAsWord.concept_embedding*.json The concept/phrase embedding, each term follows the structure of concept : list form of concept embedding array.
segmented_text.txt_phraseAsWord.model* The dump of gensim word2vec model.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
SegPhrase		SegPhrase
output		output
src/tools		src/tools
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
convertSegphraseToPhraseAsWord.py		convertSegphraseToPhraseAsWord.py
domain_keyphrase_extraction.sh		domain_keyphrase_extraction.sh
learnEmbedding.py		learnEmbedding.py
requirements.txt		requirements.txt
segPhrase.sh		segPhrase.sh
segmentating_embedding.sh		segmentating_embedding.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concept Embedding

Notes

Requirements

Run

Parameters

Input Format

Output Format

About

Releases

Packages

Contributors 3

Languages

keqianli/conceptEmbedding

Folders and files

Latest commit

History

Repository files navigation

Concept Embedding

Notes

Requirements

Run

Parameters

Input Format

Output Format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages