This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.
The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a forked gensim verstion to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork).
- 1.Support pretrained word2vec when train doc2vec.
- 2.Support Python 3.
- 3.Support gensim 3.8.
- 4.The pretrainned word2vec model should be C text format.
- 5.The dimension of the pretrained word2vec and the to be trained doc2vec should be the same.
- Clone gensim to your machine
git clone https://github.com/maohbao/gensim.git
- install gensim
python setup.py install
pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format
model = gensim.models.doc2vec.Doc2Vec(
corpus_train, # This is the documents corpus to be trained which should meet gensim's format
vector_size=300,
min_count=1, epochs=20,
dm=0,
pretrained_emb=pretrained_emb)