This notebook demonstrates how to train document embeddings on our Reddit text dataset.  I follow the general outline presented in this [article](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py) from gensim.  The general outline is as follows:

(1) Read in subreddit data  
(2) Create model objects 
(3) Search for optimal dimensions over range 2-100  
(4) Search for optimal dimensions over range 2-10  
(5) Save models  

This will be done for both the subreddit and post data.

In [1]:
import os
import glob
import json
import gensim
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.cluster import KMeans
from sklearn import metrics

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from nltk.tokenize import word_tokenize

## Training on Subreddits

**Read in subs**

Start by reading in the subreddits we will train the document embeddings on.  Notice that I don't include reading, since there are very few comments for the reading subreddits.

In [2]:
cats = ['art', 'gaming', 'music', 'politics_news', 'science', 'sports', 'tech'] # categories
stem = 'lemma' # stemming type to use

In [3]:
documents = []

for cat in cats:
    os.chdir(fr'..\Data\{cat}\Processed\{stem}')
    files = glob.glob('*.json') # grab all .json files and store their names in a files list
    for file in files: # read each .json file and extract its comments
        with open(fr'..\Data\{cat}\Processed\{stem}\{file}', 'r') as f:
            comments = json.load(f)
        documents.append((' '.join([comment['comment'] for comment in comments]), cat, cat, file.split('.json')[0])) # extra cat for label encoding
        
documents = np.array(documents)

In [4]:
encoder = LabelEncoder() # encode categories using LabelEncoder
documents[:, 2] = encoder.fit_transform(documents[:, 2])

**Create model objects**

The Doc2Vec model requires a corpus (a list of tokenized, labeled documents) to build a vocabulary to train embeddings.  We'll tokenize each document in our list using NLTK's [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html) function.

In [6]:
docs = list(documents[:, 0])
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(docs)]

**Search for optimal dimension using k-means measures**

My knowledge of ideal embedding dimensions is very limited.  As such, I'll treat the dimension as a hyperparameter to train with our model.  My assumption is that the dimension that balances silhouette score and model accuracy (as measured by homogeneity, completeness, and v-measure) the best is the ideal dimension to train embeddings on; measures are tested on a k-means model with n_clusters set to the number of categories.  Again, this is probably an inaccurate way to go about it, but I feel it's a reasonable enough start.  

We'll start by looking at embedding dimensions from 2 to 100 (with a step of 5):

In [8]:
for vec_size in range(2, 100, 5):
    model = gensim.models.doc2vec.Doc2Vec(dm=0, vector_size=vec_size, min_count=2, epochs=100)
    model.build_vocab(tagged_data)
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
    
    doc_vectors = []
    for i in range(0, len(tagged_data)):
        doc_vectors.append(model[str(i)])

    doc_vectors = np.array(doc_vectors)
    scaler = StandardScaler()
    doc_scaled = scaler.fit_transform(doc_vectors)
    
    km = KMeans(n_clusters=len(cats), init='k-means++')
    km.fit(doc_scaled)
    print(f'Vector size: {vec_size}')
    print(f'Silhouette score: {metrics.silhouette_score(doc_scaled, labels=km.labels_.reshape(-1))}')
    labels = np.array(documents[:, 2], dtype=int)
    print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
    print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
    print("V-measure: %0.3f\n" % metrics.v_measure_score(labels, km.labels_))

Vector size: 2
Silhouette score: 0.5192214250564575
Homogeneity: 0.503
Completeness: 0.508
V-measure: 0.505

Vector size: 7
Silhouette score: 0.22882992029190063
Homogeneity: 0.595
Completeness: 0.609
V-measure: 0.602

Vector size: 12
Silhouette score: 0.11338256299495697
Homogeneity: 0.679
Completeness: 0.684
V-measure: 0.681

Vector size: 17
Silhouette score: 0.038808673620224
Homogeneity: 0.493
Completeness: 0.502
V-measure: 0.497

Vector size: 22
Silhouette score: 0.02382996305823326
Homogeneity: 0.402
Completeness: 0.408
V-measure: 0.405

Vector size: 27
Silhouette score: 0.0020397771149873734
Homogeneity: 0.426
Completeness: 0.474
V-measure: 0.449

Vector size: 32
Silhouette score: 0.007579568773508072
Homogeneity: 0.607
Completeness: 0.650
V-measure: 0.628

Vector size: 37
Silhouette score: 0.011560635641217232
Homogeneity: 0.539
Completeness: 0.564
V-measure: 0.551

Vector size: 42
Silhouette score: 0.01644926704466343
Homogeneity: 0.736
Completeness: 0.765
V-measure: 0.750

Ve

It looks like the dimensions ranging from 2-10 give the best balance between sillhouette score and accuracy.  Notice how sillhouete score goes down dramatically as the number of dimensions increases.

**Refine: search dimensions 2-10**:

Now we'll search in the range 2-10 to get a better look:

In [10]:
for vec_size in range(2, 11):
    model = gensim.models.doc2vec.Doc2Vec(dm=0, vector_size=vec_size, min_count=2, epochs=100)
    model.build_vocab(tagged_data)
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
    
    doc_vectors = []
    for i in range(0, len(tagged_data)):
        doc_vectors.append(model[str(i)])

    doc_vectors = np.array(doc_vectors)
    scaler = StandardScaler()
    doc_scaled = scaler.fit_transform(doc_vectors)
    
    km = KMeans(n_clusters=len(cats), init='k-means++')
    km.fit(doc_scaled)
    print(f'Vector size: {vec_size}')
    print(f'Silhouette score: {metrics.silhouette_score(doc_scaled, labels=km.labels_.reshape(-1))}')
    labels = np.array(documents[:, 2], dtype=int)
    print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
    print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
    print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
    print(f'Inertia: {km.inertia_}\n')

Vector size: 2
Silhouette score: 0.5725953578948975
Homogeneity: 0.471
Completeness: 0.481
V-measure: 0.476
Inertia: 1.9910777133684405

Vector size: 3
Silhouette score: 0.48049601912498474
Homogeneity: 0.662
Completeness: 0.667
V-measure: 0.665
Inertia: 6.612958480170278

Vector size: 4
Silhouette score: 0.39680349826812744
Homogeneity: 0.777
Completeness: 0.785
V-measure: 0.781
Inertia: 23.123259042838527

Vector size: 5
Silhouette score: 0.30829083919525146
Homogeneity: 0.720
Completeness: 0.731
V-measure: 0.725
Inertia: 43.48196606527199

Vector size: 6
Silhouette score: 0.2835235297679901
Homogeneity: 0.688
Completeness: 0.690
V-measure: 0.689
Inertia: 61.54994779152912

Vector size: 7
Silhouette score: 0.20817075669765472
Homogeneity: 0.705
Completeness: 0.716
V-measure: 0.711
Inertia: 93.0012925107917

Vector size: 8
Silhouette score: 0.1611940711736679
Homogeneity: 0.700
Completeness: 0.709
V-measure: 0.705
Inertia: 121.35718024162634

Vector size: 9
Silhouette score: 0.1222222

Dimension 4 seems to give the best balance between high accuracy and good clustering (as measured by sillhouette score).

**Select model and save**

Now let's train a model using dimension 4 and save it.

In [11]:
model = gensim.models.doc2vec.Doc2Vec(dm=0, vector_size=4, min_count=2, epochs=100)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

In [12]:
os.chdir(r'..\Data')

In [13]:
model.save('subs.model')

## Training on Posts

We'll repeat the same process as above, but we'll assume that dimension 4 is ideal.

**Read in posts:**

In [11]:
documents = []

for cat in cats:
    os.chdir(fr'..\Data\{cat}\Processed\{stem}')
    files = glob.glob('*.json')
    for file in files:
        with open(fr'..\Data\{cat}\Processed\{stem}\{file}', 'r') as f:
            comments = json.load(f)
        df = pd.DataFrame(comments)
        posts = df['post_id'].unique()
        for post in posts:
            documents.append((' '.join(list(df[df['post_id'] == post]['comment'])), cat, cat, file.split('.json')[0]))
        
documents = np.array(documents)

In [4]:
encoder = LabelEncoder()
documents[:, 2] = encoder.fit_transform(documents[:, 2])

**Train model**

In [5]:
docs = list(documents[:, 0])
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(docs)]

In [7]:
model = gensim.models.doc2vec.Doc2Vec(dm=0, vector_size=4, min_count=2, epochs=100)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

In [8]:
os.chdir(r'..\Data')

In [9]:
model.save('posts.model')