<a id="top"></a>
# Predicting Gender Based On Blog Text - Part 3 - Doc2Vec

A comparison of a few solutions for identifying the gender of blog authors based on his/her writing style. It is based on a dataset containing 681288 blog posts downloaded from <a href=\"http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm\" target="_blank">here</a>. I am using a smaller dataset containing 145044 blog posts written by 3074 authors (1716 female, 1358 male) in the 24-25 age group.

This is the 3rd in a series of notebooks.

In this notebook, I compare the 2 types of [Doc2Vec](#2.-Doc2Vec) models
- [*distributed memory*](#2.1-Doc2Vec---distributed-memory) (PV-DM) and
- [*distributed bag of words*](#2.3-Doc2Vec---distributed-bag-of-words) (PV-DBOW)

For the impatient, [here](#3.-Final-Comparison) are the results.

*The hyperlinks should help navigate through this notebook*.

In [1]:
%load_ext autoreload
%autoreload 2

import ast
import gc
import gensim
import logging
import multiprocessing
import numpy as np
import os
import pandas as pd
import sklearn
import time
import yaml

from genderpredictutils import dataprep, textpreprocess, trainingutils
from gensim import parsing, utils
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from imp import reload
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import RidgeClassifier, SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC, SVC
from tabulate import tabulate

_random_state = 371250
num_cores = multiprocessing.cpu_count()

# Print versions
print("gensim : {}".format(gensim.__version__))
print("numpy : {}".format(np.__version__))
print("pandas : {}".format(pd.__version__))
print("sklearn : {}".format(sklearn.__version__))

reload(logging)
logging.basicConfig(format="%(asctime)s: %(message)s", level=logging.INFO, datefmt="%H:%M:%S")

gensim : 0.12.4
numpy : 1.10.4
pandas : 0.18.1
sklearn : 0.17.1


In [2]:
with open("which_gender.yml", "r") as f:
    cfg = yaml.load(f)

# 1. Reading the dataset

Each author's posts appear as a separate file. The name indicates blogger id#, self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

The work for reading the XML files from the .zip file has been done by the dataprep module. So, just reusing the pre-created dataset and filtering out authors not in the 24-25 age group.

In [3]:
data_dir = cfg["common"]["data_dir"]
models_dir = cfg["common"]["models_dir"]
d2v_models_dir = os.path.join(models_dir, "d2v")

### 1.1 Read the `gz` files

In [4]:
file_path_1 = os.path.join(data_dir, "blog_posts_metadata.txt.gz")
file_path_2 = os.path.join(data_dir, "blog_posts.txt.gz")

if os.path.exists(file_path_1) == False or os.path.exists(file_path_2) == False:
    print("One (or both) of {} or {} does not exist, so creating them".format(file_path_1, file_path_2))
    file_paths = dataprep.prepare_data(data_dir, num_processes=6) # This takes a while
    file_path_1, file_path_2 = file_paths[0], file_paths[1]

df0 = pd.read_csv(file_path_1, usecols=["blogger_id", "gender", "age"], sep="\t", index_col=False)
target_age_grp = df0[df0["age"].isin([24,25])]["blogger_id"].values.tolist()

df_iter = pd.read_csv(file_path_2, sep="\t", index_col=False, iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk["blogger_id"].isin(target_age_grp)] for chunk in df_iter])
df = pd.merge(df0, df)
df = df.dropna(subset=["blog_post"])
df = df.sort_values(by="blogger_id")
df.shape

(145044, 5)

In [5]:
# Get rid of columns we do not need
df = df[["blogger_id", "gender", "date", "blog_post"]]
num_unreachable_objects = gc.collect()
df = df.dropna(subset=["blog_post"])
df.shape

(145044, 4)

In [6]:
df.head(n=3)

Unnamed: 0,blogger_id,gender,date,blog_post
119689,5114,male,2002-11-06,Sign #249 urlLink CNN needs some sense slapp...
119722,5114,male,2004-04-19,The new issue of Mindjack is urlLink now onli...
119723,5114,male,2004-04-12,There's a urlLink new issue of Mindjack now ...


### 1.2 Encode the gender

In [7]:
gender_enc = LabelEncoder()
gender_enc.fit(df.gender.values.tolist())
logging.info(list(gender_enc.classes_))
df["gender"] = gender_enc.transform(df.gender.values.tolist())
df.head(n=3)

10:21:27: ['female', 'male']


Unnamed: 0,blogger_id,gender,date,blog_post
119689,5114,1,2002-11-06,Sign #249 urlLink CNN needs some sense slapp...
119722,5114,1,2004-04-19,The new issue of Mindjack is urlLink now onli...
119723,5114,1,2004-04-12,There's a urlLink new issue of Mindjack now ...


### 1.3 Preprocess the text
This takes a while. I noticed **each instance** of Spacy English parser takes up **~3GB of RAM** (also verified it, https://github.com/spacy-io/spaCy/issues/100), so set the number of processes prudently.

In [8]:
tokenized_dataset = "tokenized_text.txt"
if os.path.exists(tokenized_dataset) == False:
    %time df = textpreprocess.tokenize_text(df, col_name="blog_post", num_processes=cfg["tokenize_text"]["num_processes"])
    df.to_csv(tokenized_dataset, sep="\t", index=False)
else:
    logging.info("Reading {}".format(tokenized_dataset))
    %time df = pd.read_csv(tokenized_dataset, sep="\t", converters={"tokenized_text":ast.literal_eval})

if len(df.columns.values.tolist()) >= 6:
    df.drop(["Unnamed: 0"], axis=1, inplace=True)

logging.info(df.shape)
df.head(n=3)

10:21:27: Reading tokenized_text.txt
10:22:09: (145044, 5)


CPU times: user 40.7 s, sys: 1.16 s, total: 41.8 s
Wall time: 42.6 s


Unnamed: 0,blogger_id,gender,date,blog_post,tokenized_text
0,5114,1,2002-11-06,Sign #249 urlLink CNN needs some sense slapp...,"[sign, 249, cnn, need, sense, slap, day, elect..."
1,5114,1,2004-04-19,The new issue of Mindjack is urlLink now onli...,"[new, issue, mindjack, online, link, blogging,..."
2,5114,1,2004-04-12,There's a urlLink new issue of Mindjack now ...,"[new, issue, mindjack, online, article, mindja..."


In [9]:
# Concatenate the tokens into a single string - as required downstream
def func_concat_tokens(x):
    terms = x["tokenized_text"]
    terms = [str(t) for t in terms]
    return " ".join(terms)
%time df["tokenized_text_rejoined"] = df.apply(func_concat_tokens , axis=1)

# Next, replace the text within "blog_post" with text in "tokenized_text_rejoined"
df["blog_post"] = df["tokenized_text_rejoined"]
df.drop(["tokenized_text", "tokenized_text_rejoined"], axis=1, inplace=True)
del textpreprocess._spacy_parser_
num_unreachable_objects = gc.collect()
df.head(n=3)

CPU times: user 7.01 s, sys: 0 ns, total: 7.01 s
Wall time: 7.02 s


Unnamed: 0,blogger_id,gender,date,blog_post
0,5114,1,2002-11-06,sign 249 cnn need sense slap day election repu...
1,5114,1,2004-04-19,new issue mindjack online link blogging equali...
2,5114,1,2004-04-12,new issue mindjack online article mindjack 's ...


### 1.4 Feature Extraction
How many features do we have?

- I ran TF-IDF vectorizer with minimum document frequency 0.1%, 0.5% and 1.0% to get 6851, 2069 and 1104 features respectively.
- I then repeated with maximum document frequency in the range 10 - 50%, to get ~240K features.

In [10]:
# Choosing max_df=0.5, which gives ~240K features - feature selection will be done later
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
%time X = vectorizer.fit_transform(df["blog_post"]) # sparse matrix in CSR format
y = np.array(df.gender.values.tolist())
logging.info("max_df = 0.5, X.shape : {}, len(y) : {}".format(X.shape, len(y)))
num_unreachable_objects = gc.collect()

10:22:36: max_df = 0.5, X.shape : (145044, 239936), len(y) : 145044


CPU times: user 18.8 s, sys: 271 ms, total: 19 s
Wall time: 19.1 s


Back to [Top](#top)
# 2. Doc2Vec
Apply variants of the 2 types of Doc2Vec models,
- *distributed memory* (PV-DM)
 - [first](#2.1-Doc2Vec---distributed-memory), I try various windows sizes - for different feature vector lengths (100, 200, 300)
 - [next](#2.2-Doc2Vec---distributed-memory---2nd-try), I manipulate the `sample` and `negative` parameters - also for different feature vector lengths (100, 200, 300)
- [*distributed bag of words*](#2.3-Doc2Vec---distributed-bag-of-words) (PV-DBOW)

First, get the text of the blog posts in the format required

In [11]:
# Get the text of the blog posts in the format required
documents = df.blog_post.values.tolist() # this is a list of strings
documents = [str(x).split() for x in documents] # this is a list of lists of tokens
logging.info("{} documents".format(len(documents)))

# Prepare the LabeledSentences required by Doc2Vec
d2v_sentences = []
for i, item in enumerate(documents):
    sentence = LabeledSentence(item, [u"SENT_{}".format(i)])
    d2v_sentences.append(sentence)

10:23:59: 145044 documents


Define common helper methods

In [34]:
d2v_dim = 100

def train_doc2vec_models(d2v_models):
    trained_d2v_models = []
    for model_id, doc2vec_dim, model in d2v_models:
        model_file_path = os.path.join(d2v_models_dir, "{}.doc2vec".format(model_id))
        model, model_id = trainingutils.train_doc2vec_model(model, model_id, d2v_sentences, model_file_path)
        trained_d2v_models.append((model_id, doc2vec_dim, model_file_path))
    
    return trained_d2v_models

def train_classifiers_on_vectors_from_d2v_model(d2v_models, top_n=10):
    scores = []
    for model_id, doc2vec_dim, model_file_path in d2v_models:
        # First, load the model from file
        model = Doc2Vec().load(model_file_path)
        
        # Next, split the vectors into random train and test subsets -- vectors are specific to current model
        train_arrays, train_labels, test_arrays, test_labels = \
            trainingutils.get_doc2vec_train_test_data(model, doc2vec_dim, y, _random_state)
        
        # Next, rename the classifier id strings based on the model_id
        _clfs = [(x[0] + ", " + model_id, x[1]) for x in clfs]
        
        # Train the classifiers
        logging.info("Training classifiers using vectors from Doc2Vec model, {}".format(model_id))
        model_specific_scores = []
        for clf_id, clf in _clfs:
            cv_score = trainingutils.get_cv_score(clf_id, clf, train_arrays, train_labels, n_jobs=num_cores)
            logging.info("{}, {:.4f}".format(clf_id, cv_score))
            model_specific_scores.append(cv_score)
            scores.append((clf_id, cv_score))
    
        logging.info("Trained classifiers using vectors from Doc2Vec model, {}".format(model_id))
        logging.info("Average F1-score: {:.4f}".format(np.mean(model_specific_scores)))

        # Convert the scores into a DF
        scores_df = pd.DataFrame(scores, columns=["Model", "F1-score"])
        
        # Read the older scores - to compare
        scores_file_path = "scores.txt"
        if os.path.exists(scores_file_path):
            tmp_df = pd.read_csv(scores_file_path, sep="\t")
            # Concat with scores of previous runs
            scores_df = pd.concat([scores_df, tmp_df])
        
        scores_df = scores_df.drop_duplicates(subset=["Model"], keep="last")
        scores_df = scores_df.sort_values(by=["F1-score"], ascending=[0])
        scores_df.index = range(1, len(scores_df) + 1)
        scores_df.to_csv(scores_file_path, sep="\t", index=False, float_format="%.4f")
        
        # Free up some RAM
        del model
        num_unreachable_objects = gc.collect()

    scores = sorted(scores, key=lambda x: -x[1])
    # Tabulating only the top N scores, the rest are too low to be considered
    print(tabulate(scores[:top_n], floatfmt=".4f", headers=("Model", "F1-score")))
    return scores

In [35]:
# Classifiers being used to compare the various Doc2Vec models
clfs = [
    ("ExtraTrees_600, Doc2Vec", ExtraTreesClassifier(n_estimators=600, random_state=_random_state)),
    ("LinearSVC_07, Doc2Vec", LinearSVC(C=0.7, random_state=_random_state)),
    ("LogisticRegression, Doc2Vec", LogisticRegression()),
    ("LogisticRegressionCV_sag, Doc2Vec", LogisticRegressionCV(cv=5, Cs=list(np.power(10.0, np.arange(-10, 10))),
                                                               random_state=_random_state, solver="sag")),
    ("PassiveAggressive_01, Doc2Vec", PassiveAggressiveClassifier(C=0.1, n_iter=50, random_state=_random_state)),
    ("RandomForest_600, Doc2Vec", RandomForestClassifier(n_estimators=600, random_state=_random_state)),
    ("RidgeClassifier-auto-1e-3, Doc2Vec", RidgeClassifier(tol=1e-3, solver="auto", random_state=_random_state)),
    ("SGD_elasticnet_penalty, Doc2Vec", 
     SGDClassifier(alpha=.0001, n_iter=150, penalty="elasticnet", random_state=_random_state)),
    ("SGD_l1_penalty, Doc2Vec", SGDClassifier(alpha=.0001, n_iter=150, penalty="l1", random_state=_random_state)),
    ("SGD_l2_penalty, Doc2Vec", SGDClassifier(alpha=.0001, n_iter=150, penalty="l2", random_state=_random_state)),
]

Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec)
### 2.1 Doc2Vec - *distributed memory*

Train the **Doc2Vec-DM** models (*varying window sizes and length of feature vectors*) and then train the classifiers using the vectors from these models.

Training the models take time (30+ minutes), so in most cases I load (from file) previosuly trained models. I set `min_count`=1 because each post is being treated as a sentence with a label (example `SENT123`) which appears just once.

Best [Average F1-scores](#2.1.1-Doc2Vec-DM-Average-F1-scores) and [Maximum F1-scores](#2.1.2-Doc2Vec-DM-Maximum-F1-scores)

In [36]:
# Train the `distributed memory` (PV-DM) models
# Model 1 : distributed memory (dm=1), vary the dimensionality of feature vectors along with the window size

# NOTE: All models listed below (including the commented ones) have been run previously.
#       Using the uncommented models for demonstration because they have the highest Average F1-scores.

models = [
    # 100 dimension feature vectors
    #("model1_dms_d100_hs_w5", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*1, window=5,
    #                                             workers=num_cores)),
    #("model1_dms_d100_hs_w8", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*1, window=8,
    #                                              workers=num_cores)),
    #("model1_dms_d100_hs_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*1, window=10,
    #                                              workers=num_cores)),
    ("model1_dms_d100_hs_w20", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*1, window=20,
                                                  workers=num_cores)),
    
    # 200 dimension feature vectors
    #("model1_dms_d200_hs_w5", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*2, window=5,
    #                                             workers=num_cores)),
    #("model1_dms_d200_hs_w8", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*2, window=8,
    #                                             workers=num_cores)),
    #("model1_dms_d200_hs_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*2, window=10,
    #                                              workers=num_cores)),
    ("model1_dms_d200_hs_w20", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*2, window=20,
                                                  workers=num_cores)),
    
    # 300 dimension feature vectors
    #("model1_dms_d300_hs_w5", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*3, window=5,
    #                                             workers=num_cores)),
    #("model1_dms_d300_hs_w8", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*3, window=8,
    #                                             workers=num_cores)),
    #("model1_dms_d300_hs_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*3, window=10,
    #                                              workers=num_cores)),
    #("model1_dms_d300_hs_w20", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, size=d2v_dim*3, window=20,
    #                                              workers=num_cores)),
]

trained_dm_models = train_doc2vec_models(models)

# Train the classifiers using the vectors from the Doc2Vec-DM models (model 1 variants),
# show only top 5 highest F1-scores
scores_dm_1 = train_classifiers_on_vectors_from_d2v_model(trained_dm_models, top_n=5)

05:46:59: Doc2Vec model 'model1_dms_d100_hs_w20', Doc2Vec(dm/s,d100,hs,w20,t12)
05:46:59: Loading from models/d2v/model1_dms_d100_hs_w20.doc2vec ...
05:47:09: Doc2Vec model 'model1_dms_d200_hs_w20', Doc2Vec(dm/s,d200,hs,w20,t12)
05:47:09: Loading from models/d2v/model1_dms_d200_hs_w20.doc2vec ...
05:47:29: Training classifiers using vectors from Doc2Vec model, model1_dms_d100_hs_w20
05:51:46: ExtraTrees_600, Doc2Vec, model1_dms_d100_hs_w20, 0.7044
05:52:28: LinearSVC_07, Doc2Vec, model1_dms_d100_hs_w20, 0.6918
05:52:35: LogisticRegression, Doc2Vec, model1_dms_d100_hs_w20, 0.6911
05:53:53: LogisticRegressionCV_sag, Doc2Vec, model1_dms_d100_hs_w20, 0.6913
05:54:02: PassiveAggressive_01, Doc2Vec, model1_dms_d100_hs_w20, 0.6808
06:05:51: RandomForest_600, Doc2Vec, model1_dms_d100_hs_w20, 0.6946
06:05:57: RidgeClassifier-auto-1e-3, Doc2Vec, model1_dms_d100_hs_w20, 0.6921
06:06:18: SGD_elasticnet_penalty, Doc2Vec, model1_dms_d100_hs_w20, 0.6949
06:06:36: SGD_l1_penalty, Doc2Vec, model1_dms_d

Model                                                      F1-score
-------------------------------------------------------  ----------
ExtraTrees_600, Doc2Vec, model1_dms_d100_hs_w20              0.7044
ExtraTrees_600, Doc2Vec, model1_dms_d200_hs_w20              0.6994
SGD_elasticnet_penalty, Doc2Vec, model1_dms_d200_hs_w20      0.6972
SGD_l2_penalty, Doc2Vec, model1_dms_d200_hs_w20              0.6972
SGD_l1_penalty, Doc2Vec, model1_dms_d200_hs_w20              0.6972


Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec)
#### 2.1.1 Doc2Vec-DM Average F1-scores
Scores from previous runs

| Window Size                   | 5      | 8      | 10     | 20     |
|-------------------------------|--------|--------|--------|--------|
| 100 dimension feature vectors | 0.6845 | 0.6886 | 0.6890 | 0.6945 |
| 200 dimension feature vectors | 0.6848 | 0.6886 | 0.6918 | 0.6954 |
| 300 dimension feature vectors | 0.6859 | 0.6907 | 0.6918 | 0.6954 |

#### 2.1.2 Doc2Vec-DM Maximum F1-scores
From previous runs

| Model                                                            | F1-score | Dimensionality of feature vectors | Window Size |
|------------------------------------------------------------------|----------|-----------------------------------|-------------|
| ExtraTrees_600, Doc2Vec, model1_dms_d100_hs_w20                  | 0.7067   | 600                               | 20          |
| ExtraTrees_600, Doc2Vec, model1_dms_d100_hs_w10                  | 0.7032   | 100                               | 10          |
| SGD_l1_penalty, Doc2Vec, model1_dms_d300_hs_w20                  | 0.7027   | 300                               | 20          |

Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec), [Doc2Vec-DM](#2.1-Doc2Vec---distributed-memory)

### 2.2 Doc2Vec - *distributed memory* - 2nd try
The [previous attempt](#2.1-Doc2Vec---distributed-memory) just experimented with the window size and the dimensionality of the feature vectors. Next, I manipulate
- `sample`, to randomly downsample high frequency words
- `negative`, to use negative sampling, i.e. *how many noise words should be drawn*

also for different feature vector lengths (100, 200, 300)

Best [Average F1-scores](#2.2.1-Doc2Vec---DM-Average-F1-scores,-varying-sample-and-negative) and [Maximum F1-scores](#2.2.2-Doc2Vec---DM-Maximum-F1-scores,)

In [37]:
# Model 2 : distributed memory (dm=1)

# NOTE: All models listed below (including the commented ones) have been run previously.
#       Using the uncommented models for demonstration because they have the highest Average F1-scores.

models = [
    #("model2_dms_d100_hs_n5_s0001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                       sample=1e-4, size=d2v_dim*1, window=10, workers=num_cores)),
    #("model2_dms_d200_hs_n5_s0001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                       sample=1e-4, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n5_s0001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                       sample=1e-4, size=d2v_dim*3, window=10, workers=num_cores)),
    #("model2_dms_d100_hs_n8_s0001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
    #                                                       sample=1e-4, size=d2v_dim*1, window=10, workers=num_cores)),
    ("model2_dms_d200_hs_n8_s0001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
                                                           sample=1e-4, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n8_s0001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
    #                                                       sample=1e-4, size=d2v_dim*3, window=10, workers=num_cores)),
    ("model2_dms_d100_hs_n10_s0001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
                                                           sample=1e-4, size=d2v_dim*1, window=10, workers=num_cores)),
    #("model2_dms_d200_hs_n10_s0001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
    #                                                        sample=1e-4, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n10_s0001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
    #                                                        sample=1e-4, size=d2v_dim*3, window=10, workers=num_cores)),
    
    # Change the sample size
    #
    #("model2_dms_d100_hs_n5_s001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                      sample=1e-3, size=d2v_dim*1, window=10, workers=num_cores)),
    #("model2_dms_d200_hs_n5_s001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                      sample=1e-3, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n5_s001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=5,
    #                                                      sample=1e-3, size=d2v_dim*3, window=10, workers=num_cores)),
    #                                                        sample=1e-4, size=d2v_dim*3, window=10, workers=num_cores)),
    ("model2_dms_d100_hs_n8_s001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
                                                          sample=1e-3, size=d2v_dim*1, window=10, workers=num_cores)),
    ("model2_dms_d200_hs_n8_s001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
                                                          sample=1e-3, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n8_s001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=8,
    #                                                      sample=1e-3, size=d2v_dim*3, window=10, workers=num_cores)),
    #("model2_dms_d100_hs_n10_s001_w10", d2v_dim*1, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
    #                                                       sample=1e-3, size=d2v_dim*1, window=10, workers=num_cores)),
    #("model2_dms_d200_hs_n10_s001_w10", d2v_dim*2, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
    #                                                       sample=1e-3, size=d2v_dim*2, window=10, workers=num_cores)),
    #("model2_dms_d300_hs_n10_s001_w10", d2v_dim*3, Doc2Vec(alpha=0.025, min_alpha=0.025, min_count=1, negative=10,
    #                                                       sample=1e-3, size=d2v_dim*3, window=10, workers=num_cores)),
]

trained_dm_models_2 = train_doc2vec_models(models) # if skip first x, models[x:]
scores_dm_2 = train_classifiers_on_vectors_from_d2v_model(trained_dm_models_2, top_n=5)

06:44:15: Doc2Vec model 'model2_dms_d200_hs_n8_s0001_w10', Doc2Vec(dm/s,d200,n8,hs,w10,s0.0001,t12)
06:44:15: Building vocabulary ...
06:44:53: Training ...
07:23:54: Saving Doc2Vec model to models/d2v/model2_dms_d200_hs_n8_s0001_w10.doc2vec ...
07:24:08: Doc2Vec model 'model2_dms_d100_hs_n10_s0001_w10', Doc2Vec(dm/s,d100,n10,hs,w10,s0.0001,t12)
07:24:08: Building vocabulary ...
07:24:44: Training ...
08:08:05: Saving Doc2Vec model to models/d2v/model2_dms_d100_hs_n10_s0001_w10.doc2vec ...
08:08:16: Doc2Vec model 'model2_dms_d100_hs_n8_s001_w10', Doc2Vec(dm/s,d100,n8,hs,w10,s0.001,t12)
08:08:16: Building vocabulary ...
08:08:50: Training ...
08:52:10: Saving Doc2Vec model to models/d2v/model2_dms_d100_hs_n8_s001_w10.doc2vec ...
08:52:22: Doc2Vec model 'model2_dms_d200_hs_n8_s001_w10', Doc2Vec(dm/s,d200,n8,hs,w10,s0.001,t12)
08:52:22: Building vocabulary ...
08:52:57: Training ...
09:34:40: Saving Doc2Vec model to models/d2v/model2_dms_d200_hs_n8_s001_w10.doc2vec ...
09:35:05: Training 

Model                                                               F1-score
----------------------------------------------------------------  ----------
ExtraTrees_600, Doc2Vec, model2_dms_d100_hs_n10_s0001_w10             0.7091
ExtraTrees_600, Doc2Vec, model2_dms_d100_hs_n8_s001_w10               0.7049
SGD_elasticnet_penalty, Doc2Vec, model2_dms_d200_hs_n8_s0001_w10      0.7037
SGD_l1_penalty, Doc2Vec, model2_dms_d200_hs_n8_s0001_w10              0.7037
SGD_l2_penalty, Doc2Vec, model2_dms_d200_hs_n8_s001_w10               0.7036


Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec)
#### 2.2.1 Doc2Vec - DM Average F1-scores, varying sample and negative
Average F1-scores (sample=0.0001, window size fixed at 10) from previous runs

| Neagtive Samples              | 5      | 8      | 10     |
|-------------------------------|--------|--------|--------|
| (sample=0.0001)               |        |        |        |
| 100 dimension feature vectors | 0.6955 | 0.6966 | 0.6958 |
| 200 dimension feature vectors | 0.6972 | **0.6984** | 0.6965 |
| 300 dimension feature vectors | 0.6971 | 0.6981 | 0.6961 |


Average F1-scores (sample=0.001, window size fixed at 10) from previous runs

| Neagtive Samples              | 5      | 8      | 10     |
|-------------------------------|--------|--------|--------|
| (sample=0.001)                |        |        |        |
| 100 dimension feature vectors | 0.6975 | **0.6992** | 0.6987 |
| 200 dimension feature vectors | 0.6984 | **0.6992** | 0.6979 |
| 300 dimension feature vectors | 0.6975 | 0.6971 | 0.6969 |

#### 2.2.2 Doc2Vec - DM Maximum F1-scores,
varying sample and negative (*also comparing with [previous attempt](#2.1-Doc2Vec---distributed-memory) where only window size was varied*)

| Model                                                            | F1-score | Dimensionality of feature vectors | Window Size |
|------------------------------------------------------------------|----------|-----------------------------------|-------------|
| ExtraTrees_600, Doc2Vec, model2_dms_d100_hs_n10_s0001_w10        | 0.7075   | 100                               | 10          |
| ExtraTrees_600, Doc2Vec, model2_dms_d100_hs_n8_s0001_w10         | 0.7062   | 100                               | 10          |
| SGD_l1_penalty, Doc2Vec, model2_dms_d300_hs_n5_s001_w10          | 0.7060   | 300                               | 10          |
| SGD_l1_penalty, Doc2Vec, model2_dms_d300_hs_n8_s0001_w10         | 0.7059   | 300                               | 10          |
| SGD_elasticnet_penalty, Doc2Vec, model2_dms_d300_hs_n5_s001_w10  | 0.7057   | 300                               | 10          |
| SGD_elasticnet_penalty, Doc2Vec, model2_dms_d300_hs_n10_s001_w10 | 0.7038   | 300                               | 10          |
| ExtraTrees_600, Doc2Vec, model1_dms_d100_hs_w10                  | 0.7032   | 100                               | 10          |
| SGD_l1_penalty, Doc2Vec, model1_dms_d300_hs_w20                  | 0.7027   | 300                               | 20          |

Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec)

### 2.3 Doc2Vec - *distributed bag of words*
Train the **Doc2Vec-DBOW** models (`dm`=0) and then train the classifiers using the vectors from these models.

Training the models take time (30+ minutes), so in most cases I load (from file) previosuly trained models. Set `min_count`=1 because we treat each post as a sentence with a label (example `SENT123`) which appears just once.

In [39]:
# Train the `distributed bag of words` (PV-DBOW) models

# NOTE: All models listed below (including the commented ones) have been run previously.
#       Using the uncommented models for demonstration because they have the highest Average F1-scores
#       or a high individual F1-score.

models = [
    ("PV-DBOW_d100_mc1_n5", d2v_dim*1, Doc2Vec(dm=0, hs=0, min_count=1, negative=5, size=d2v_dim*1, workers=num_cores)),
    #("PV-DBOW_d200_mc1_n5", d2v_dim*2, Doc2Vec(dm=0, hs=0, min_count=1, negative=5, size=d2v_dim*2, workers=num_cores)),
    ("PV-DBOW_d300_mc1_n5", d2v_dim*3, Doc2Vec(dm=0, hs=0, min_count=1, negative=5, size=d2v_dim*3, workers=num_cores)),
    #("PV-DBOW_d100_mc1_n8", d2v_dim*1, Doc2Vec(dm=0, hs=0, min_count=1, negative=8, size=d2v_dim*1, workers=num_cores)),
    #("PV-DBOW_d200_mc1_n8", d2v_dim*2, Doc2Vec(dm=0, hs=0, min_count=1, negative=8, size=d2v_dim*2, workers=num_cores)),
    ("PV-DBOW_d300_mc1_n8", d2v_dim*3, Doc2Vec(dm=0, hs=0, min_count=1, negative=8, size=d2v_dim*3, workers=num_cores)),
    ("PV-DBOW_d100_mc1_n10", d2v_dim*1, Doc2Vec(dm=0, hs=0, min_count=1, negative=10, size=d2v_dim*1, workers=num_cores)),
    #("PV-DBOW_d200_mc1_n10", d2v_dim*2, Doc2Vec(dm=0, hs=0, min_count=1, negative=10, size=d2v_dim*2, workers=num_cores)),
    #("PV-DBOW_d300_mc1_n10", d2v_dim*3, Doc2Vec(dm=0, hs=0, min_count=1, negative=10, size=d2v_dim*3, workers=num_cores))
]
trained_dbow_models = train_doc2vec_models(models)

# Train the classifiers using the vectors from the Doc2Vec-DBOW models
scores_dbow_1 = train_classifiers_on_vectors_from_d2v_model(trained_dbow_models)

04:36:17: Doc2Vec model 'PV-DBOW_d100_mc1_n5', Doc2Vec(dbow,d100,n5,t12)
04:36:17: Loading from models/d2v/PV-DBOW_d100_mc1_n5.doc2vec ...
04:36:21: Doc2Vec model 'PV-DBOW_d300_mc1_n5', Doc2Vec(dbow,d300,n5,t12)
04:36:21: Loading from models/d2v/PV-DBOW_d300_mc1_n5.doc2vec ...
04:36:30: Doc2Vec model 'PV-DBOW_d300_mc1_n8', Doc2Vec(dbow,d300,n8,t12)
04:36:30: Loading from models/d2v/PV-DBOW_d300_mc1_n8.doc2vec ...
04:36:39: Doc2Vec model 'PV-DBOW_d100_mc1_n10', Doc2Vec(dbow,d100,n10,t12)
04:36:39: Loading from models/d2v/PV-DBOW_d100_mc1_n10.doc2vec ...
04:36:51: Training classifiers using vectors from Doc2Vec model, PV-DBOW_d100_mc1_n5
04:40:35: ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n5, 0.7137
04:42:08: LinearSVC_07, Doc2Vec, PV-DBOW_d100_mc1_n5, 0.7035
04:42:15: LogisticRegression, Doc2Vec, PV-DBOW_d100_mc1_n5, 0.7033
04:43:43: LogisticRegressionCV_sag, Doc2Vec, PV-DBOW_d100_mc1_n5, 0.7033
04:43:52: PassiveAggressive_01, Doc2Vec, PV-DBOW_d100_mc1_n5, 0.6665
04:55:02: RandomForest_

Model                                                      F1-score
-------------------------------------------------------  ----------
ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n10                0.7151
ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n5                 0.7137
RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n5      0.7112
LinearSVC_07, Doc2Vec, PV-DBOW_d300_mc1_n5                   0.7109
LogisticRegressionCV_sag, Doc2Vec, PV-DBOW_d300_mc1_n5       0.7107
LogisticRegression, Doc2Vec, PV-DBOW_d300_mc1_n5             0.7102
RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n8      0.7095
LogisticRegressionCV_sag, Doc2Vec, PV-DBOW_d300_mc1_n8       0.7094
LinearSVC_07, Doc2Vec, PV-DBOW_d300_mc1_n8                   0.7090
RandomForest_600, Doc2Vec, PV-DBOW_d100_mc1_n10              0.7087


Average and Maximum F1-scores from previous runs

- n=5,
  - 100 dimensions, Average F1-score: 0.6984, maximum : ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n5, **0.7152**
  - 200 dimensions, Average F1-score: 0.6998, maximum : RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d200_mc1_n5, 0.7062
  - 300 dimensions, Average F1-score: **0.7013**, maximum : RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n5, 0.7111
- n=8,
  - 100 dimensions, Average F1-score: 0.6969, maximum : ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n8, 0.7150
  - 200 dimensions, Average F1-score: 0.6985, maximum : LinearSVC_07, Doc2Vec, PV-DBOW_d200_mc1_n8, 0.7057
  - 300 dimensions, Average F1-score: **0.7011**, maximum : LinearSVC_07, Doc2Vec, PV-DBOW_d300_mc1_n8, 0.7088
- n=10,
  - 100 dimensions, Average F1-score: 0.6970, maximum : ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n10, **0.7160**
  - 200 dimensions, Average F1-score: 0.6976, maximum : RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d200_mc1_n10, 0.7054
  - 300 dimensoons, Average F1-score: 0.6996, maximum : RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n10, 0.7088

Back to [Top](#top), [Doc2Vec](#2.-Doc2Vec)

### 3. Final Comparison
For this dataset, 
- *Distributed bag of words* (PV-DBOW) marginally better than *distributed memory* (PV-DM) models (0.7151 vs 0.7091)
- Most *Bag of Words* (BoW) approaches score higher than Doc2Vec and Word2Vec approaches - was expecting otherwise.
  - 0.7307, BoW 
  - 0.7151, Doc2Vec
  - 0.6875, Word2Vec

In [48]:
top_scores = scores_dm_1 + scores_dm_2 + scores_dbow_1
top_scores = sorted(top_scores, key=lambda x: -x[1])
print(tabulate(top_scores[:20], floatfmt=".4f", headers=("Model", "F1-score")))

Model                                                        F1-score
---------------------------------------------------------  ----------
ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n10                  0.7151
ExtraTrees_600, Doc2Vec, PV-DBOW_d100_mc1_n5                   0.7137
RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n5        0.7112
LinearSVC_07, Doc2Vec, PV-DBOW_d300_mc1_n5                     0.7109
LogisticRegressionCV_sag, Doc2Vec, PV-DBOW_d300_mc1_n5         0.7107
LogisticRegression, Doc2Vec, PV-DBOW_d300_mc1_n5               0.7102
RidgeClassifier-auto-1e-3, Doc2Vec, PV-DBOW_d300_mc1_n8        0.7095
LogisticRegressionCV_sag, Doc2Vec, PV-DBOW_d300_mc1_n8         0.7094
ExtraTrees_600, Doc2Vec, model2_dms_d100_hs_n10_s0001_w10      0.7091
LinearSVC_07, Doc2Vec, PV-DBOW_d300_mc1_n8                     0.7090


Back to [Top](#top)