# Predicting Gender Based On Blog Text - Part 2

A comparison of a few solutions for identifying the gender of blog authors based on his/her writing style. It is based on a dataset containing 681288 blog posts downloaded from <a href="http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm" target="_blank">here</a>.

I am using a smaller dataset containing 145044 blog posts written by 3074 authors (1716 female, 1358 male) in the 24-25 age group.

This is the 2nd part in a series of notebooks. I address some of the TODOs from part 1 (data preparation, text preprocessing, etc.) and then
1. First, I use the *Bag of Words* (BoW) approach to compare a few classifiers
2. Next, I use `GloVe small` **word vector** representation file (`small` has 400K words, `medium` has 1.9M words)
3. Next, I train **`Word2Vec`** models on the blog text

In [1]:
%load_ext autoreload
%autoreload 2

from gensim import parsing, utils

import ast
import gc
import multiprocessing
import numpy as np
import os
import pandas as pd
import platform
import re
import sys
import time
import traceback
import yaml

from genderpredictutils import dataprep, textpreprocess, trainingutils
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, LabeledSentence # I have version 0.12.4 installed
from operator import itemgetter
from sklearn import metrics
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.linear_model import RidgeClassifier, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
from tabulate import tabulate

_random_state = 371250
num_cores = multiprocessing.cpu_count()
pd.set_option("display.max_colwidth", 300)

t0 = time.time()

with open("which_gender.yml", "r") as f:
    cfg = yaml.load(f)

print("Current PID : {}".format(os.getpid()))
if platform.system() == 'Linux':
    mem_bytes = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')  # e.g. 4015976448
    mem_gib = mem_bytes/(1024.**3)
    print("Memory : {:.2f} GB".format(mem_gib))
print("#cores : {}".format(num_cores))

Current PID : 29565
Memory : 15.67 GB
#cores : 8


## 1. Reading the dataset
Each author's posts appear as a separate file. The name indicates blogger id#, self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

The work for reading the XML files from the `.zip` file has been done by the `dataprep` module.
So, just reusing the pre-created dataset and filtering out authors not in the 24-25 age group.

In [2]:
data_dir = cfg["common"]["data_dir"]
models_dir = cfg["common"]["models_dir"]

### 1.1 Read the `gz` files

In [3]:
file_path_1 = os.path.join(data_dir, "blog_posts_metadata.txt.gz")
file_path_2 = os.path.join(data_dir, "blog_posts.txt.gz")

if os.path.exists(file_path_1) == False or os.path.exists(file_path_2) == False:
    print("One (or both) of {} or {} does not exist, so creating them".format(file_path_1, file_path_2))
    file_paths = dataprep.prepare_data(data_dir, num_processes=6) # This takes a while
    file_path_1, file_path_2 = file_paths[0], file_paths[1]

df0 = pd.read_csv(file_path_1, usecols=["blogger_id", "gender", "age"], sep="\t", index_col=False)
target_age_grp = df0[df0["age"].isin([24,25])]["blogger_id"].values.tolist()

df_iter = pd.read_csv(file_path_2, sep="\t", index_col=False, iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk["blogger_id"].isin(target_age_grp)] for chunk in df_iter])
df = pd.merge(df0, df)
df = df.dropna(subset=["blog_post"])
df = df.sort_values(by="blogger_id")
df.shape

(145044, 5)

In [4]:
# Get rid of columns we do not need
df = df[["blogger_id", "gender", "date", "blog_post"]]
num_unreachable_objects = gc.collect()
df = df.dropna(subset=["blog_post"])
df.shape

(145044, 4)

In [5]:
df.head(n=3)

Unnamed: 0,blogger_id,gender,date,blog_post
119689,5114,male,2002-11-06,"Sign #249 urlLink CNN needs some sense slapped into them: it's the day after an election, Republicans have taken control the Senate, Dick Gephardt is stepping down as House minority leader and right now, on Larry King, they're talking about the Winona Ryder verdict. At least they took down t..."
119722,5114,male,2004-04-19,"The new issue of Mindjack is urlLink now online . In this issue: urlLink Linked Out: blogging, equality and the future by Melanie McBride. Plus urlLink Kill Bill Vol. 2 reviewed by Jesse Walker."
119723,5114,male,2004-04-12,"There's a urlLink new issue of Mindjack now online. In it, the first article from Mindjack's newest contributor J.D. Lasica; he writes about copyright law in urlLink ""The Killing Fields"" . Also, I review DVDs of urlLink Breathless, Russian Ark, and Z . Spread the word."


### 1.2 Encode the gender

In [6]:
gender_enc = LabelEncoder()
gender_enc.fit(df.gender.values.tolist())
print list(gender_enc.classes_)
df["gender"] = gender_enc.transform(df.gender.values.tolist())
df.head(n=3)

['female', 'male']


Unnamed: 0,blogger_id,gender,date,blog_post
119689,5114,1,2002-11-06,"Sign #249 urlLink CNN needs some sense slapped into them: it's the day after an election, Republicans have taken control the Senate, Dick Gephardt is stepping down as House minority leader and right now, on Larry King, they're talking about the Winona Ryder verdict. At least they took down t..."
119722,5114,1,2004-04-19,"The new issue of Mindjack is urlLink now online . In this issue: urlLink Linked Out: blogging, equality and the future by Melanie McBride. Plus urlLink Kill Bill Vol. 2 reviewed by Jesse Walker."
119723,5114,1,2004-04-12,"There's a urlLink new issue of Mindjack now online. In it, the first article from Mindjack's newest contributor J.D. Lasica; he writes about copyright law in urlLink ""The Killing Fields"" . Also, I review DVDs of urlLink Breathless, Russian Ark, and Z . Spread the word."


### 1.3 Preprocess the text
This takes a while. I noticed **each instance** of Spacy English parser takes up **~3GB of RAM** (also verified it, https://github.com/spacy-io/spaCy/issues/100), so set the number of processes prudently.

In [8]:
tokenized_dataset = "tokenized_text.txt"
if os.path.exists(tokenized_dataset) == False:
    %time df = textpreprocess.tokenize_text(df, col_name="blog_post", num_processes=cfg["tokenize_text"]["num_processes"])
    df.to_csv(tokenized_dataset, sep="\t", index=False)
else:
    print("Reading {}".format(tokenized_dataset))
    %time df = pd.read_csv(tokenized_dataset, sep="\t", converters={"tokenized_text":ast.literal_eval})

if len(df.columns.values.tolist()) >= 6:
    df.drop(["Unnamed: 0"], axis=1, inplace=True)

print(df.shape)
df.head(n=3)

Reading tokenized_text.txt
CPU times: user 33.5 s, sys: 2.27 s, total: 35.8 s
Wall time: 35.8 s
(145044, 5)


Unnamed: 0,blogger_id,gender,date,blog_post,tokenized_text
0,5114,1,2002-11-06,"Sign #249 urlLink CNN needs some sense slapped into them: it's the day after an election, Republicans have taken control the Senate, Dick Gephardt is stepping down as House minority leader and right now, on Larry King, they're talking about the Winona Ryder verdict. At least they took down t...","[sign, 249, cnn, need, sense, slap, day, election, republicans, control, senate, dick, gephardt, step, house, minority, leader, right, larry, king, 're, talk, winona, ryder, verdict, breaking, news, graphic]"
1,5114,1,2004-04-19,"The new issue of Mindjack is urlLink now online . In this issue: urlLink Linked Out: blogging, equality and the future by Melanie McBride. Plus urlLink Kill Bill Vol. 2 reviewed by Jesse Walker.","[new, issue, mindjack, online, link, blogging, equality, future, melanie, mcbride, plus, kill, vol, 2, review, jesse, walker]"
2,5114,1,2004-04-12,"There's a urlLink new issue of Mindjack now online. In it, the first article from Mindjack's newest contributor J.D. Lasica; he writes about copyright law in urlLink ""The Killing Fields"" . Also, I review DVDs of urlLink Breathless, Russian Ark, and Z . Spread the word.","[new, issue, mindjack, online, article, mindjack, 's, new, contributor, j.d., lasica, write, copyright, law, killing, fields, review, dvd, breathless, russian, ark, z, spread, word]"


In [9]:
# Concatenate the tokens into a single string - as required downstream
def func_concat_tokens(x):
    terms = x["tokenized_text"]
    terms = [str(t) for t in terms]
    return " ".join(terms)
%time df["tokenized_text_rejoined"] = df.apply(func_concat_tokens , axis=1)

# Next, replace the text within "blog_post" with text in "tokenized_text_rejoined"
df["blog_post"] = df["tokenized_text_rejoined"]
df.drop(["tokenized_text", "tokenized_text_rejoined"], axis=1, inplace=True)
del textpreprocess._spacy_parser_
num_unreachable_objects = gc.collect()
df.head(n=3)

CPU times: user 7.24 s, sys: 296 ms, total: 7.53 s
Wall time: 7.54 s


Unnamed: 0,blogger_id,gender,date,blog_post
0,5114,1,2002-11-06,sign 249 cnn need sense slap day election republicans control senate dick gephardt step house minority leader right larry king 're talk winona ryder verdict breaking news graphic
1,5114,1,2004-04-19,new issue mindjack online link blogging equality future melanie mcbride plus kill vol 2 review jesse walker
2,5114,1,2004-04-12,new issue mindjack online article mindjack 's new contributor j.d. lasica write copyright law killing fields review dvd breathless russian ark z spread word


### 1.4 Feature Extraction
How many features do we have?

In [15]:
# Iterate through multiple values of min_df and max_df
min_dfs = [0.001, 0.005, 0.01]
for min_df in min_dfs:
    vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=min_df, stop_words="english")
    %time X = vectorizer.fit_transform(df["blog_post"]) # sparse matrix in CSR format
    y = np.array(df.gender.values.tolist())
    print "min_df = {:.3f}, X.shape : {}, len(y) : {}".format(min_df, X.shape, len(y))
    del vectorizer, X, y
    num_unreachable_objects = gc.collect()

max_dfs = [0.1, 0.2, 0.3, 0.4, 0.5]
for max_df in max_dfs:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=max_df, stop_words="english")
    %time X = vectorizer.fit_transform(df["blog_post"]) # sparse matrix in CSR format
    y = np.array(df.gender.values.tolist())
    print "max_df = {:.3f}, X.shape : {}, len(y) : {}".format(max_df, X.shape, len(y))
    del vectorizer, X, y
    num_unreachable_objects = gc.collect()

CPU times: user 17.7 s, sys: 790 ms, total: 18.5 s
Wall time: 18.5 s
min_df = 0.001, X.shape : (145044, 6739), len(y) : 145044
CPU times: user 17.7 s, sys: 899 ms, total: 18.6 s
Wall time: 18.6 s
min_df = 0.005, X.shape : (145044, 2039), len(y) : 145044
CPU times: user 17.9 s, sys: 731 ms, total: 18.6 s
Wall time: 18.7 s
min_df = 0.010, X.shape : (145044, 1091), len(y) : 145044
CPU times: user 18.6 s, sys: 879 ms, total: 19.5 s
Wall time: 19.5 s
max_df = 0.100, X.shape : (145044, 239874), len(y) : 145044
CPU times: user 18.7 s, sys: 891 ms, total: 19.6 s
Wall time: 19.6 s
max_df = 0.200, X.shape : (145044, 239919), len(y) : 145044
CPU times: user 18.8 s, sys: 819 ms, total: 19.6 s
Wall time: 19.6 s
max_df = 0.300, X.shape : (145044, 239931), len(y) : 145044
CPU times: user 18.7 s, sys: 987 ms, total: 19.7 s
Wall time: 19.7 s
max_df = 0.400, X.shape : (145044, 239936), len(y) : 145044
CPU times: user 18.6 s, sys: 979 ms, total: 19.6 s
Wall time: 19.6 s
max_df = 0.500, X.shape : (145044,

I ran TF-IDF vectorizer with minimum document frequency 0.1%, 0.5% and 1.0% to get 6739, 2039 and 1091 features respectively. I then repeated with maximum document frequency in the range 10 - 50%, to get ~240K features.

In [16]:
# Choosing max_df=0.5, which gives ~240K features - feature selection will be done later
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
%time X = vectorizer.fit_transform(df["blog_post"]) # sparse matrix in CSR format
y = np.array(df.gender.values.tolist())
print "max_df = 0.5, X.shape : {}, len(y) : {}".format(X.shape, len(y))
num_unreachable_objects = gc.collect()

CPU times: user 18.8 s, sys: 734 ms, total: 19.6 s
Wall time: 19.6 s
max_df = 0.5, X.shape : (145044, 239936), len(y) : 145044


### 1.5 Feature Selection
Using recursive feature elimination to select the top 20K and 30K features. 30K might increase memory requirements for downstream approaches, such as `GloVe`.

In [17]:
print("X.shape (before feature selection) : {}".format(X.shape))
print("Feature Selection : Top 20K")
feature_selector = RFE(estimator=LinearSVC(C=0.1, random_state=_random_state), n_features_to_select=20000, step=0.05)
%time X20K = feature_selector.fit_transform(X, y)
print("X20K.shape : {}".format(X20K.shape))
print("Feature Selection : Top 30K")
feature_selector = RFE(estimator=LinearSVC(C=0.1, random_state=_random_state), n_features_to_select=30000, step=0.05)
%time X30K = feature_selector.fit_transform(X, y)
print("X30K.shape : {}".format(X30K.shape))

X.shape (before feature selection) : (145044, 239936)
Feature Selection : Top 20K
CPU times: user 46.9 s, sys: 1.78 s, total: 48.6 s
Wall time: 48.7 s
X20K.shape : (145044, 20000)
Feature Selection : Top 30K
CPU times: user 45 s, sys: 1.8 s, total: 46.9 s
Wall time: 46.9 s
X30K.shape : (145044, 30000)


## 2. Bag of Words (BoW) Approach
Training a few classifiers and compare their F1-scores. Doing this for both the 20K and 30K feature datasets.

In [18]:
# Run BoW using new method
from sklearn.linear_model import LogisticRegression
clfs = [
    #("ExtraTrees_200, BOW", ExtraTreesClassifier(n_estimators=200, random_state=_random_state)), # Takes too long
    ("LinearSVC_01, BOW", LinearSVC(C=0.1, random_state=_random_state)),
    ("LinearSVC_04, BOW", LinearSVC(C=0.4, random_state=_random_state)),
    ("LinearSVC_07, BOW", LinearSVC(C=0.7, random_state=_random_state)),
    ("LinearSVC_10, BOW", LinearSVC(C=1.0, random_state=_random_state)),
    ("LogisticRegression, BOW", LogisticRegression()),
    ("MultinomialNB_01, BOW", MultinomialNB(alpha=.10)),
    ("MultinomialNB_02, BOW", MultinomialNB(alpha=.20)),
    ("MultinomialNB_04, BOW", MultinomialNB(alpha=.40)),
    ("MultinomialNB_06, BOW", MultinomialNB(alpha=.60)),
    ("MultinomialNB_08, BOW", MultinomialNB(alpha=.80)),
    ("MultinomialNB_10, BOW", MultinomialNB(alpha=1.0)),
    ("MultinomialNB_15, BOW", MultinomialNB(alpha=1.5)),
    ("PassiveAggressive_01, BOW", PassiveAggressiveClassifier(C=0.1, n_iter=50, random_state=_random_state)),
    ("RidgeClassifier-auto-1e-3, BOW", RidgeClassifier(tol=1e-3, solver="auto", random_state=_random_state)),
    ("RidgeClassifier-auto-1e-4, BOW", RidgeClassifier(tol=1e-4, solver="auto", random_state=_random_state)),
    ("SGD_elasticnet_penalty, BOW", 
     SGDClassifier(alpha=.0001, n_iter=150, penalty="elasticnet", random_state=_random_state)),
    ("SGD_l1_penalty, BOW", SGDClassifier(alpha=.0001, n_iter=150, penalty="l1", random_state=_random_state)),
    ("SGD_l2_penalty, BOW", SGDClassifier(alpha=.0001, n_iter=150, penalty="l2", random_state=_random_state)),
]

clfs20K = [(x[0] + ", 20K", x[1]) for x in clfs]
scores_bow = trainingutils.compare_classifiers(clfs20K, X20K, y, n_jobs=num_cores, print_scores=False)

clfs30K = [(x[0] + ", 30K", x[1]) for x in clfs]
scores_bow.extend(trainingutils.compare_classifiers(clfs30K, X30K, y, n_jobs=num_cores, print_scores=False))

scores_bow = sorted(scores_bow, key=lambda (_, x): -x)
print(tabulate(scores_bow, floatfmt=".4f", headers=("Model", "F1-score")))
del clfs, clfs20K, clfs30K
num_unreachable_objects = gc.collect()

Model                                  F1-score
-----------------------------------  ----------
MultinomialNB_01, BOW, 20K               0.7307
LinearSVC_10, BOW, 20K                   0.7296
MultinomialNB_02, BOW, 20K               0.7294
RidgeClassifier-auto-1e-3, BOW, 20K      0.7293
RidgeClassifier-auto-1e-4, BOW, 20K      0.7293
LinearSVC_07, BOW, 20K                   0.7289
MultinomialNB_04, BOW, 20K               0.7277
LinearSVC_04, BOW, 20K                   0.7276
MultinomialNB_06, BOW, 20K               0.7258
MultinomialNB_08, BOW, 20K               0.7249
PassiveAggressive_01, BOW, 20K           0.7246
MultinomialNB_10, BOW, 20K               0.7240
MultinomialNB_01, BOW, 30K               0.7235
LinearSVC_01, BOW, 20K                   0.7223
MultinomialNB_15, BOW, 20K               0.7220
RidgeClassifier-auto-1e-4, BOW, 30K      0.7218
RidgeClassifier-auto-1e-3, BOW, 30K      0.7218
LinearSVC_10, BOW, 30K                   0.7215
SGD_l2_penalty, BOW, 20K                

## 3. Using Word Vectors


### 3.1 Using `GloVe` word vector representation files
The **`GloVe`** word vector representation files can be downloaded from http://nlp.stanford.edu/data/ or https://github.com/stanfordnlp/GloVe. There are 3 files : 
- `glove.6B.zip` (6 Billion tokens, hence '*small*', 400K words) - has 50/100/200/300-dimension vectors,
- `glove.42B.300d.zip` (42 Billion tokens, hence '*medium*', 1.9M words),	
- `glove.840B.300d.zip` (840 Billion tokens, hence '*large*', 2.2M words)

I just used the `GloVe-small` **100-dimension** word vector representation file - `medium` requires very high RAM usage.

Each word in each blog post needs to be mapped to its vector representation - which is accordingly used as features.  This is done using customer vectorizers as first introduced by 
<a href="http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/" target="_blank">nadbordrozd</a> (`MeanEmbeddingVectorizer` and `TfidfEmbeddingVectorizer`).

Using `GloVe-small` and taking the mean of the vectors using the `MeanEmbeddingVectorizer`.

In [20]:
%reload_ext autoreload
%time glove_w2v = trainingutils.read_GloVe_file(os.path.join(models_dir, "glove.6B.100d.txt.gz"))

ts = time.time()

vec = ("vectorizer", trainingutils.MeanEmbeddingVectorizer(glove_w2v))

clfs = [
    ("LinearSVC_07, GloVe small", Pipeline([vec, ("lsvc_07", LinearSVC(C=0.7, random_state=_random_state))])),
    ("MultinomialNB_01, GloVe small", Pipeline([vec, ("mnb_01", MultinomialNB(alpha=.10))])),
    ("RidgeClassifier-auto-1e-3, GloVe small",
     Pipeline([vec, ("ridge", RidgeClassifier(tol=1e-3, solver="auto", random_state=_random_state))])),
    ("SGD_elasticnet_penalty, GloVe small", 
     Pipeline([vec, ("sgd", SGDClassifier(alpha=.0001, n_iter=150, penalty="elasticnet", random_state=_random_state))])),
]

# Run for 20K features
clfs20K = [(x[0] + ", 20K", x[1]) for x in clfs]
%time scores_glove_w2v = trainingutils.compare_classifiers(clfs20K, X20K, y, n_jobs=num_cores, print_scores=False)

# Run for 30K features
clfs30K = [(x[0] + ", 30K", x[1]) for x in clfs]
%time scores_glove_w2v.extend(trainingutils.compare_classifiers(clfs30K, X30K, y, n_jobs=num_cores, print_scores=False))

scores_glove_w2v = sorted(scores_glove_w2v, key=lambda (_, x): -x)
print(tabulate(scores_glove_w2v, floatfmt=".4f", headers=("Model", "F1-score")))
del clfs, clfs20K, clfs30K
num_unreachable_objects = gc.collect()

Reading ../../../models/glove.6B.100d.txt.gz
400000 keys. First 8 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo', 'sowell', 'mdbu', 'woods']

CPU times: user 18.9 s, sys: 464 ms, total: 19.3 s
Wall time: 19.5 s
CPU times: user 5min 36s, sys: 5min 45s, total: 11min 22s
Wall time: 20min 1s
CPU times: user 6min 8s, sys: 4min 48s, total: 10min 57s
Wall time: 30min 23s
Model                                          F1-score
-------------------------------------------  ----------
LinearSVC_07, GloVe small, 20K                   0.6935
MultinomialNB_01, GloVe small, 20K               0.6935
RidgeClassifier-auto-1e-3, GloVe small, 20K      0.6935
SGD_elasticnet_penalty, GloVe small, 20K         0.6935
LinearSVC_07, GloVe small, 30K                   0.6935
MultinomialNB_01, GloVe small, 30K               0.6935
RidgeClassifier-auto-1e-3, GloVe small, 30K      0.6935
SGD_elasticnet_penalty, GloVe small, 30K         0.6935


Something is not right using the `MeanEmbeddingVectorizer` with the `GloVe small`. Need to revisit this later.

### 3.2 `Word2Vec` model on blog text
Training `Word2Vec` models on the blog text
- both continuous bag-of-word (CBOW) 
- skip-gram (SG) models (hierarchical softmax and negative sampling).

In [21]:
documents = df.blog_post.values.tolist() # this is a list of strings
documents = [unicode(x).split() for x in documents] # this is a list of lists of tokens

### 3.2.1 Word2Vec - CBOW

In [22]:
w2v_dim = 100

ts = time.time()
print("Constructing {}-dimension Word2Vec CBOW model based on text of {} blog posts".format(w2v_dim, len(documents)))
%time w2v = Word2Vec(documents, size=w2v_dim, window=5, min_count=10, sg=0, workers=num_cores)
print("Word2Vec CBOW model : {}".format(w2v)) # hs=0, negative=5, cbow_mean=1
vec = ("vec1", trainingutils.MeanEmbeddingVectorizer({w: vec for w, vec in zip(w2v.index2word, w2v.syn0)}))

clfs = [
    ("LinearSVC_07, CBOW", Pipeline([vec, ("lsvc_07", LinearSVC(C=0.7, random_state=_random_state))])),
    ("MultinomialNB_01, CBOW", Pipeline([vec, ("mnb_01", MultinomialNB(alpha=.10))])),
    ("RidgeClassifier-auto-1e-3, CBOW",
     Pipeline([vec, ("ridge", RidgeClassifier(tol=1e-3, solver="auto", random_state=_random_state))])),
    ("SGD_elasticnet_penalty, CBOW", 
     Pipeline([vec, ("sgd", SGDClassifier(alpha=.0001, n_iter=150, penalty="elasticnet", random_state=_random_state))])),
]

# Run for 20K features
clfs20K = [(x[0] + ", 20K", x[1]) for x in clfs]
%time scores_w2v_cbow = trainingutils.compare_classifiers(clfs20K, X20K, y, n_jobs=num_cores, print_scores=False)

# Run for 30K features
clfs30K = [(x[0] + ", 30K", x[1]) for x in clfs]
%time scores_w2v_cbow.extend(trainingutils.compare_classifiers(clfs30K, X30K, y, n_jobs=num_cores, print_scores=False))

scores_w2v_cbow = sorted(scores_w2v_cbow, key=lambda (_, x): -x)
print(tabulate(scores_w2v_cbow, floatfmt=".4f", headers=("Model", "F1-score")))
del clfs, vec, w2v
num_unreachable_objects = gc.collect()

Constructing 100-dimension Word2Vec CBOW model based on text of 145044 blog posts
CPU times: user 2min 36s, sys: 7.02 s, total: 2min 43s
Wall time: 45.1 s
Word2Vec CBOW model : Word2Vec(vocab=41338, size=100, alpha=0.025)
CPU times: user 42.3 s, sys: 28.7 s, total: 1min 10s
Wall time: 3min 59s
CPU times: user 43.3 s, sys: 29.9 s, total: 1min 13s
Wall time: 3min 59s
Model                                   F1-score
------------------------------------  ----------
LinearSVC_07, CBOW, 20K                   0.6935
MultinomialNB_01, CBOW, 20K               0.6935
RidgeClassifier-auto-1e-3, CBOW, 20K      0.6935
SGD_elasticnet_penalty, CBOW, 20K         0.6935
LinearSVC_07, CBOW, 30K                   0.6935
MultinomialNB_01, CBOW, 30K               0.6935
RidgeClassifier-auto-1e-3, CBOW, 30K      0.6935
SGD_elasticnet_penalty, CBOW, 30K         0.6935


Something is not right using the `MeanEmbeddingVectorizer`. Need to explore other ways.