GitHub - hywangntut/KBE: Integrating Semantic Knowledge into Lexical Embeddings Based on Information Content Measurement

#Knowledge Base Embedding

Referenced Paper
Integrating Semantic Knowledge into Lexical Embeddings Based on Information Content Measurement.
Author: Hsin-Yang Wang, Wei-Yun Ma.
EACL 2017 short paper.

Implementation
CKIP Team, IIS. AS. 2017.

Overview

This project provides the implementation of the paper - Integrating Semantic Knowledge into Lexical Embeddings Based on Information Content Measurement.

Usage: word2vec_joint

word2vec_joint could produce the embeddings trained by Word2Vec, RCM, Threshold, Frequency, and Entropy.

Parameter Description

Example:

./word2vec_joint -corpus ../Data/Corpus/NYT/NYT.train -dimension 300 -word2vec 1 -word2vec-thread 12 -rcm 0 -pretrain 0 -output-model Model/NYT_WORD2VEC_R

parameter	function
-corpus <file>	-- Use text data from <file> to train the model
-dimension <int>	-- Set the dimension of the vectors; default is 300
-min-count <int>	-- This will discard words that appear less than int times; default is 0
-negative <int>	-- Number of negative examples; default is 15
-word2vec <int>	-- Use Word2Vec to train the vectors
-word2vec-alpha <float>	-- Set the starting learning rate; default is 0.025 (For Word2Vec)
-word2vec-epoch <int>	-- Run more training epochs; default is 1 (For Word2Vec)
-word2vec-subsample <float>	-- Set threshold for occurrence of words.
-word2vec-thread <int>	-- Use <int> threads (For Word2Vec)
-word2vec-window <int>	-- Set max skip length between words; default is 5
-rcm <int>	-- Use RCM with resource to train the vectors
-rcm-epoch <int>	-- Run more training epochs; default is 100 (For RCM)
-rcm-lambda <float>	-- Set the starting learning rate; default is 0.01 (For RCM)
-rcm-resource <file>	-- Use relation pairs from <file> to train the model (For RCM)
-rcm-ic <file>	-- Use IC from <file> to define the entropy of words (Entropy method)
-rcm-thread <int>	-- Use int threads (For RCM)
-rcm-method <int>	-- Set 0 to train Baseline, 1 for Threshold, 2 for Frequency, 3 for Entropy
-rcm-threshold <int>	-- Only revise the word which freq. is under than <int> (Threshold method)
-pretrain <int>	-- Use pretrain model
-pretrain-model <file>	-- Use <file> as pretrain model
-output-model <file>	-- Use <file> to save the resulting vectors

If you want to reproduce the work, please follow up with the bash file "Bash_Train.sh" attached in the folder

For questions, comments and report programing bug, please contact wang@iis.sinica.edu.tw

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Bash_Train.sh		Bash_Train.sh
CorpusFile.cpp		CorpusFile.cpp
CorpusFile.h		CorpusFile.h
DoTrain.cpp		DoTrain.cpp
FileReader.cpp		FileReader.cpp
FileReader.h		FileReader.h
ICFile.cpp		ICFile.cpp
ICFile.h		ICFile.h
MethodInfo.cpp		MethodInfo.cpp
MethodInfo.h		MethodInfo.h
NYT.entropy.zip		NYT.entropy.zip
NeuralNetwork.cpp		NeuralNetwork.cpp
NeuralNetwork.h		NeuralNetwork.h
Parameter.cpp		Parameter.cpp
Parameter.h		Parameter.h
RCM.cpp		RCM.cpp
RCM.h		RCM.h
README.md		README.md
ResourceFile.cpp		ResourceFile.cpp
ResourceFile.h		ResourceFile.h
VocabTable.cpp		VocabTable.cpp
VocabTable.h		VocabTable.h
Word2Vec.cpp		Word2Vec.cpp
Word2Vec.h		Word2Vec.h
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Usage: word2vec_joint

About

Releases

Packages

Languages

hywangntut/KBE

Folders and files

Latest commit

History

Repository files navigation

Overview

Usage: word2vec_joint

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages