## Doc2Vec

To use this install `gensim` as follows:
```
$ pip install gensim
```

[Tutorial](http://rare-technologies.com/doc2vec-tutorial/) <br/>
[Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)

In [1]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument, LabeledSentence
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sqlalchemy import create_engine
import datetime
import pandas as pd
import string

Cleaning text helpers

In [2]:
stemmer = PorterStemmer()
stop = stopwords.words('english')
exclude = set(string.punctuation)

def clean_text(text):
    # remove non-ascii and punctuation
    text = ''.join(char for char in text if ord(
        char) < 128 and char not in exclude)
    # remove stop words and stem
    text = [stemmer.stem(word)
            for word in text.split() if word not in stop]
    return text

Get data

In [3]:
df = pd.DataFrame([
                [1, 'here is some sentence cats like'],
                [2, 'here is some other text dogs prefer'],
                [3, 'text about cat'],
                [4, 'sentence about dog'],
    ])
df.columns = ['id', 'desc_init']
df.head()

Unnamed: 0,id,desc_init
0,1,here is some sentence cats like
1,2,here is some other text dogs prefer
2,3,text about cat
3,4,sentence about dog


Prepare data

In [4]:
df['stem'] = df['desc_init'].map(lambda x: clean_text(x))
df['sentence'] = df.apply(lambda x: LabeledSentence(
    words=x[2], tags=[str(x[0])]), axis=1)
df.head()

Unnamed: 0,id,desc_init,stem,sentence
0,1,here is some sentence cats like,"[sentenc, cat, like]","([sentenc, cat, like], [1])"
1,2,here is some other text dogs prefer,"[text, dog, prefer]","([text, dog, prefer], [2])"
2,3,text about cat,"[text, cat]","([text, cat], [3])"
3,4,sentence about dog,"[sentenc, dog]","([sentenc, dog], [4])"


Create model

In [5]:
model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5,
                negative=5, hs=0, min_count=1, workers=2) #make min_count higher for more docs

Build vocabulary

In [6]:
model.build_vocab(df['sentence'])

Train model

In [7]:
for epoch in range(10):
    print '{}: epoch {}'.format(datetime.datetime.now(), epoch)
    model.train(df['sentence'])
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay

2016-07-30 19:41:10.525450: epoch 0
2016-07-30 19:41:10.529073: epoch 1
2016-07-30 19:41:10.531195: epoch 2
2016-07-30 19:41:10.533322: epoch 3
2016-07-30 19:41:10.535402: epoch 4
2016-07-30 19:41:10.537760: epoch 5
2016-07-30 19:41:10.539945: epoch 6
2016-07-30 19:41:10.541987: epoch 7
2016-07-30 19:41:10.544038: epoch 8
2016-07-30 19:41:10.546137: epoch 9


Save / load model

In [8]:
# model.save('doc2vec_model')
# model = Doc2Vec.load('../data/doc2vec_model')

Find similar documents

In [9]:
# two most similar documents to document id 1
print model.docvecs.most_similar('1', topn=2)

[('3', 0.1622544527053833), ('4', -0.15972083806991577)]


In [10]:
# the most similar documents to a sentence
vec = model.infer_vector(['cat', 'sentence'])
print model.docvecs.most_similar([vec], topn=1)

[('1', 0.18620765209197998)]
