Identify gender based on analysis of text - based on dataset containing 681288 blog posts downloaded from 
<a href="http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm" target="_blank">here</a>.

I am using a **much smaller** dataset containing blog posts written by 3515 authors (1799 female, 1716 male) in the 24-25 age group.

This is 2nd of 3 parts. In this notebook,
1. First, I address some of the TODOs from previous part,
 - Some basic preprocessing on the text - helped improve scores
 - Fixed whatever caused the 577 errors during reading
 - Use **`GloVe`** word vector representation files - `small` and `medium` (400K words Vs. 1.9M words)
2. Next, I build **`Word2Vec`** models on the blog text - both continuous bag-of-word (CBOW) and skip-gram (SG) models (hierarchical softmax and negative sampling)
3. Next, I use a pretrained `Word2Vec` model on the Google News (100B) corpus

In [1]:
from gensim import parsing, utils

import gc
import itertools
import json
import numpy as np
import os
import pandas as pd
import re
import sys
import tarfile
import time
import traceback
import xml.etree.ElementTree as ET

t0 = time.time()

def print_elapsed_time(ts=None):
    if ts:
        print "\nTime Taken :", "%.1f" % ((time.time() - ts)/60), "minutes\n"
    else:
        print "\nElapsed Time :", "%.1f" % ((time.time() - t0)/60), "minutes to reach this point (from the start)"

## 1. Reading the dataset
Each author's posts appear as a separate file. The name indicates blogger id#, self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

The work for reading the XML files from the `.tar.gz` file has been done by the `which_gender_dataprep.ipynb` notebook.
So, just resding the pre-created dataset and filtering out authors not in the 24-25 age group.

In [2]:
# dataset_filepath = "../../datasets/blog_dataset_df"
dataset_filepath = "blog_dataset_df"

an_iter = pd.read_csv(dataset_filepath, sep="\t", index_col=False,
                      usecols=["age", "gender", "text_all_posts"],
                      iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk["age"].isin([24,25])] for chunk in an_iter])
print df.shape
df.head()

(3515, 3)


Unnamed: 0,age,gender,text_all_posts
3,25,female,and did i mention that i no longer have to dea...
4,25,male,B-Logs: The Business Blogs Paradox urlLink ...
8,25,male,"Planning the Marathon I checked Active.com, ..."
10,25,female,MSN conversation: 11.17am Iggbalbollywall (...
17,24,female,You love me... I have you here by my side... O...


## 2. Bag of Words (BoW) Approach
based on the text in the `first_blog_post_preprocessed` column. Text tokenized to create an intermediate dataset.

In [3]:
# Drop the 'age' column, not required for the BoW approach
df.drop(["age"], axis=1, inplace=True)
# Replace gender with a numeric value
df["gender"] = df["gender"].replace({"female":2, "male":1})
# Before tokenizing, drop rows where the text is NA
df = df.dropna(subset=["text_all_posts"])
df.reset_index(drop=True, inplace=True)
print df.shape
# Tokenize the text, prepare datset
def my_func(x):
    text = x["text_all_posts"]
    return text.lower().split() if text else []
df["tokenized_text"] = df.apply(my_func , axis=1)
X, y = np.array(df.tokenized_text.values.tolist()), np.array(df.gender.values.tolist())
df.head()

(3515, 2)


Unnamed: 0,gender,text_all_posts,tokenized_text
0,2,and did i mention that i no longer have to dea...,"[and, did, i, mention, that, i, no, longer, ha..."
1,1,B-Logs: The Business Blogs Paradox urlLink ...,"[b-logs:, the, business, blogs, paradox, urlli..."
2,1,"Planning the Marathon I checked Active.com, ...","[planning, the, marathon, i, checked, active.c..."
3,2,MSN conversation: 11.17am Iggbalbollywall (...,"[msn, conversation:, 11.17am, iggbalbollywall,..."
4,2,You love me... I have you here by my side... O...,"[you, love, me..., i, have, you, here, by, my,..."


Get an idea of the text of a blog post

In [4]:
print df.loc[4].tokenized_text[:500]

['you', 'love', 'me...', 'i', 'have', 'you', 'here', 'by', 'my', 'side...', 'our', 'hearts', 'overflow', 'with', 'happiness', 'and', 'love', 'for', 'each', 'other...', "that's", 'all', 'that', 'matters', 'to', 'me', 'now...', 'sorry', 'i', 'ever', 'doubted', 'you', 'and', 'your', 'love...', 'i', 'just', "don't", 'feel', 'too', 'special', 'anymore...', 'maybe', "i'm", 'not', 'the', 'one', 'you', 'should', 'love...', 'maybe', "you're", 'beside', 'the', 'wrong', 'person...', 'maybe', 'you', 'could', 'have', 'been', 'happier', 'with', 'her...', "i'm", 'not', 'that', 'special,', 'you', 'know...', '(as', 'if', 'on', 'cue,', 'i', 'hear', '"our"', 'songs', 'play', 'on', 'my', 'winamp)', 'what', 'the', 'fuck', 'am', 'i', 'saying?!?!?', 'you', 'love', 'me...', 'i', 'have', 'you', 'here', 'by', 'my', 'side...', 'our', 'hearts', 'overflow', 'with', 'happiness', 'and', 'love', 'for', 'each', 'other...', "that's", 'all', 'that', 'matters', 'to', 'me', 'now...', 'sorry', 'i', 'ever', 'doubted', 'you'

### Preprocessing the text
- Strip off **repeating characters** from words. For example,
 - *AAAAAAAAAAARGH* becomes *AARGH*
 - *AAAAAAAAARRRRRRRGGGGGGGGGHHHHHHHHHHHH* becomes *AARRGGHH*
 - *daaaaarling* becomes *daarling*
 - *yeaaaaar* becomes *yeaar*
- expand contractions ("*i'm*" becomes "*i am*")
- remove a subset of punctuations

In [5]:
def expand_contraction(token):
    # if not isinstance(token, unicode):
    #    token = unicode(token)
    return english_contractions[token] if token in english_contractions else token

def load_english_contractions(file_path):
    english_contractions = {}
    if file_path is not None:
        try:
            print("Reading English contractions from %s" % file_path)
            english_contractions = json.load(open(file_path))
        except Exception, e:
            print(str(e))
            pass
    return english_contractions

def preprocess_token(token):
    corrected_token = remove_repeated_chars(token)
    corrected_token = expand_contraction(corrected_token)
    return unicode(remove_punct(corrected_token))

def remove_punct(token):
    #return token.rstrip(".?:!,*")
    return re.sub(r"\W+", " ", token).strip()

def remove_repeated_chars(token):
    # Credit : http://stackoverflow.com/a/10072826
    return re.sub(r'(.)\1+', r'\1\1', token)

english_contractions = load_english_contractions("english_contractions.json")

Reading English contractions from english_contractions.json


Run a simple test - combining the preprocessing steps above

In [6]:
for w in ["AAAAAAAAAAARGH", "AAAAAAAAARRRRRRRGGGGGGGGGHHHHHHHHHHHH", "(as", "b-logs:", "daaaaarling", 
          "person...", "saying?!?!?", "that's", "yeaaaaar"]:
    print w, "-->", remove_repeated_chars(w), "-->", expand_contraction(remove_repeated_chars(w)), "-->",\
            remove_punct(expand_contraction(remove_repeated_chars(w)))

AAAAAAAAAAARGH --> AARGH --> AARGH --> AARGH
AAAAAAAAARRRRRRRGGGGGGGGGHHHHHHHHHHHH --> AARRGGHH --> AARRGGHH --> AARRGGHH
(as --> (as --> (as --> as
b-logs: --> b-logs: --> b-logs: --> b logs
daaaaarling --> daarling --> daarling --> daarling
person... --> person.. --> person.. --> person
saying?!?!? --> saying?!?!? --> saying?!?!? --> saying
that's --> that's --> that is --> that is
yeaaaaar --> yeaar --> yeaar --> yeaar


Next, run the preprocessing steps on 'tokenized_text' column

In [7]:
# Tokenize the text, prepare datset
ts = time.time()

# Prepare a look up dictionary between token and its corrected version
features = set(list(itertools.chain(*df['tokenized_text'].values.tolist())))
# Maintain a mapping between original token and correction
corrected = dict(zip(features, [None]*len(features)))
print "Number of dimensions (distinct tokens) :", len(features), "(before preprocessing)" # 89487

# Preprocess each token (remove subset of punctuations, expand contractions, etc.)
# CANT Do the spell checks - because the current implementation is very slow
for token in corrected.keys():
    corrected[token] = preprocess_token(token)

def preprocess_tokens(x):
    text = x["tokenized_text"]
    output = []
    if text:
        for token in text:
            corrected_token = None
            if token in corrected:
                corrected_token = corrected[token]
            else:
                corrected[token] = preprocess_token(token)
            if len(corrected_token) > 0:
                output.append(corrected_token)
    return output

df["tokenized_text"] = df.apply(preprocess_tokens, axis=1)
features = set(list(itertools.chain(*df['tokenized_text'].values.tolist())))
print "Number of dimensions (distinct tokens) :", len(features), "(after preprocessing)" # 51396
X, y = np.array(df.tokenized_text.values.tolist()), np.array(df.gender.values.tolist())
print len(corrected.keys())
del corrected
print_elapsed_time(ts)
df.head()

Number of dimensions (distinct tokens) : 869687 (before preprocessing)
Number of dimensions (distinct tokens) : 428162 (after preprocessing)
869687

Time Taken : 0.5 minutes



Unnamed: 0,gender,text_all_posts,tokenized_text
0,2,and did i mention that i no longer have to dea...,"[and, did, i, mention, that, i, no, longer, ha..."
1,1,B-Logs: The Business Blogs Paradox urlLink ...,"[b logs, the, business, blogs, paradox, urllin..."
2,1,"Planning the Marathon I checked Active.com, ...","[planning, the, marathon, i, checked, active c..."
3,2,MSN conversation: 11.17am Iggbalbollywall (...,"[msn, conversation, 11 17am, iggbalbollywall, ..."
4,2,You love me... I have you here by my side... O...,"[you, love, me, i, have, you, here, by, my, si..."


### Train basic models
- Experiment with the vectorizers - each will give different number of features.
- Next, run grid search to pick the best hyperparameters for some of the commonly used classifiers (RF, SVM, etc.)
- Next, train the classifiers

In [8]:
# Experiment with the vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer

ts = time.time()
min_dfs = [0.01, 0.005]
tfidf_vec = [None] * len(min_dfs)
features = [None] * len(min_dfs)
ctr = 0

def get_topn_tfidf_vec_terms(vec, features, n=5):
    # Get the top n terms with highest tf-idf score
    # Credit : http://stackoverflow.com/a/34236002
    feature_array = np.array(vec.get_feature_names())
    tfidf_sorting = np.argsort(features.toarray()).flatten()[::-1]
    return feature_array[tfidf_sorting][:n]

for min_df in min_dfs:
    tfidf_vec[ctr] = TfidfVectorizer(analyzer=lambda x: x, min_df=min_df)
    features[ctr] = tfidf_vec[ctr].fit_transform(X)
    print features[ctr].shape[1], \
          "features for minimum document frequency %.1f%%\n" % (min_df * 100), \
          "top 8 terms", get_topn_tfidf_vec_terms(tfidf_vec[ctr], features[ctr], n=8), "\n"
    ctr += 1

print_elapsed_time(ts)

18471 features for minimum document frequency 1.0%
top 8 terms [u'i' u'and' u'my' u'to' u'the' u'a' u'of' u'it'] 

28284 features for minimum document frequency 0.5%
top 8 terms [u'i' u'and' u'my' u'to' u'the' u'a' u'of' u'it'] 


Time Taken : 0.6 minutes



In [9]:
# Next, run grid search to pick the best hyperparameters 
from operator import itemgetter
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.grid_search import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC

def find_best_hyperparameters(clf, vectorizer, param_dist, num_iters=20):
    # Run the grid search
    print "Finding best hyperparameters for", clf.__class__.__name__
    random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                       n_iter=num_iters, n_jobs=7)
    random_search.fit(vectorizer.fit_transform(X), y)
    # Iterate through the scores and print the best 3
    top_scores = sorted(random_search.grid_scores_, key=itemgetter(1), reverse=True)[:3]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("\tMean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("\tParameters: {0}".format(score.parameters))
    # print top_scores[0]
    return random_search.best_estimator_

# Using the TfidfVectorizer for minimum document frequency 0.5%
ts = time.time()
best_rf = find_best_hyperparameters(RandomForestClassifier(random_state = 120), tfidf_vec[1],
                                    { "bootstrap": [True, False],
                                      "criterion": ["gini", "entropy"],
                                      "max_depth": np.arange(5, 11).tolist() + [None],
                                      "n_estimators": np.arange(50, 550, 50).tolist()
                                    },
                                    num_iters=20)
print best_rf,"\n"

best_et = find_best_hyperparameters(ExtraTreesClassifier(random_state = 9000), tfidf_vec[1],
                                    { "max_depth": np.arange(5, 11).tolist() + [None],
                                      "n_estimators": np.arange(50, 550, 50).tolist(),
                                    },
                                    num_iters=20)
print best_et,"\n"

best_svm = find_best_hyperparameters(SVC(kernel="linear", random_state = 840), tfidf_vec[1],
                                    { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                      "gamma": [0.1, 0.2, 0.3, 0.4, 0.5, "auto"],
                                      "tol": [0.0001, 0.001, 0.01]
                                    },
                                    num_iters=20)
print best_svm,"\n"

best_linearsvc = find_best_hyperparameters(LinearSVC(random_state = 11640), tfidf_vec[1],
                                           { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                             "loss": ["hinge", "squared_hinge"],
                                             "tol": [0.0001, 0.001, 0.01]
                                           },
                                           num_iters=20)
print best_linearsvc,"\n"

best_svm_rbf = find_best_hyperparameters(SVC(kernel="rbf", random_state = 600), tfidf_vec[1],
                                         { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                           "gamma": [0.1, 0.2, 0.3, 0.4, 0.5, "auto"],
                                           "tol": [0.0001, 0.001, 0.01]
                                         },
                                         num_iters=20)
print best_svm_rbf
print_elapsed_time(ts)

Finding best hyperparameters for RandomForestClassifier
Model with rank: 1
	Mean validation score: 0.742 (std: 0.015)
	Parameters: {'n_estimators': 300, 'bootstrap': False, 'criterion': 'entropy', 'max_depth': 9}
Model with rank: 2
	Mean validation score: 0.737 (std: 0.009)
	Parameters: {'n_estimators': 450, 'bootstrap': False, 'criterion': 'gini', 'max_depth': 9}
Model with rank: 3
	Mean validation score: 0.737 (std: 0.015)
	Parameters: {'n_estimators': 250, 'bootstrap': False, 'criterion': 'entropy', 'max_depth': 9}
RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=9, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
            oob_score=False, random_state=120, verbose=0, warm_start=False) 

Finding best hyperparameters for ExtraTreesClassifier
Model with rank: 1
	Mean validation score: 0.742 (std: 0.021)
	Parameters:

In [10]:
# Next, train the models
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from tabulate import tabulate

def get_cv_scores(models_with_desc):
    cv_scores = []
    for model_id, model in models_with_desc:
        print "Training:", model_id
        ts = time.time()
        cv_score = cross_val_score(model, X, y, cv=5).mean() # gives a pickling error
        # cv_score = cross_val_score(model, X, y, cv=5).mean()
        cv_scores.append((model_id, cv_score))
        print "\tScore:", "%.4f" % cv_score
        print "\tTime taken:", "%.1f" % ((time.time() - ts)/60), "minutes\n"
    return cv_scores

ts = time.time()
models_with_desc = [
    ("3 Nearest Neighbors, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("3nn", KNeighborsClassifier(3))])),
    ("5 Nearest Neighbors, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("5nn", KNeighborsClassifier(5))])),
    ("SVM (Linear), TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_svm_linear", best_svm)])),
    ("Extra Trees, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_et", best_et)])),
    ("Random Forest, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_rf", best_rf)])),
    ("LinearSVC, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_linearsvc", best_linearsvc)])),
    ("MultinomialNB, TF-IDF, min_df 1.0%", Pipeline([("tfidf_vec", tfidf_vec[0]), ("mnb", MultinomialNB())])),
    ("MultinomialNB, TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("mnb", MultinomialNB())])),
    ("SVM (RBF), TF-IDF, min_df 0.5%", Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_svm_rbf", best_svm_rbf)]))
]

scores = get_cv_scores(models_with_desc)
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time(ts)

del models_with_desc
del tfidf_vec
num_unreachable_objects = gc.collect()

Training: 3 Nearest Neighbors, TF-IDF, min_df 0.5%
	Score: 0.6270
	Time taken: 1.5 minutes

Training: 5 Nearest Neighbors, TF-IDF, min_df 0.5%
	Score: 0.6461
	Time taken: 1.6 minutes

Training: SVM (Linear), TF-IDF, min_df 0.5%
	Score: 0.7562
	Time taken: 7.7 minutes

Training: Extra Trees, TF-IDF, min_df 0.5%
	Score: 0.7439
	Time taken: 7.6 minutes

Training: Random Forest, TF-IDF, min_df 0.5%
	Score: 0.7388
	Time taken: 3.6 minutes

Training: LinearSVC, TF-IDF, min_df 0.5%
	Score: 0.7801
	Time taken: 2.3 minutes

Training: MultinomialNB, TF-IDF, min_df 1.0%
	Score: 0.6529
	Time taken: 1.5 minutes

Training: MultinomialNB, TF-IDF, min_df 0.5%
	Score: 0.6452
	Time taken: 1.2 minutes

Training: SVM (RBF), TF-IDF, min_df 0.5%
	Score: 0.7408
	Time taken: 7.4 minutes

model                                       score
----------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%             0.7801
SVM (Linear), TF-IDF, min_df 0.5%          0.7562
Extra Trees, TF-IDF, min_

## 3. Using Word Vectors


### 3.1 Using `GloVe` word vector representation files
First, using **`GloVe`** word vector representation files downloaded from http://nlp.stanford.edu/data/ or https://github.com/stanfordnlp/GloVe. There are 3 files : 
- `glove.6B.zip` (6 Billion tokens, hence '*small*', 400K words) - has 50/100/200/300-dimension vectors,
- `glove.42B.300d.zip` (42 Billion tokens, hence '*medium*', 1.9M words),	
- `glove.840B.300d.zip` (840 Billion tokens, hence '*large*', 2.2M words)

Reference : http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

In [11]:
# Read the GloVe word vector representation file,
# Using just the 300-dimension because glove_medium and glove_large only have 300-dimension vectors
import gzip

def read_GloVe_file(filepath):
    print "Reading", filepath
    glove_w2v = {}
    with gzip.open(filepath, "rb") as lines:
        for line in lines:
            parts = line.split()
            glove_w2v[parts[0]] = np.array(map(float, parts[1:]))
    print len(glove_w2v.keys()), "keys. First 8 :", glove_w2v.keys()[:8], "\n"
    return glove_w2v

Each word in each blog post needs to be mapped to its vector representation - which is accordingly used as features.

In [12]:
# Word vector equivalent of CountVectorizer & TfidfVectorizer (respectively)
# Each word in each blog post is mapped to its vector; 
# then this helper class computes the mean of those vectors
# Credit : https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking.ipynb
from collections import defaultdict

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())
    
    def fit(self, X, y):
        return self 

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec] 
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
        
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf, 
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
    
        return self
    
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

### 3.1.1  **`GloVe` *small***

Train the models based on the `GloVe` word vector representation files.
The `TfidfEmbeddingVectorizer` is slower then the `MeanEmbeddingVectorizer`, so using the latter, first.

In [13]:
ts = time.time()
glove_300_w2v = read_GloVe_file("glove.6B.300d.txt.gz")
print_elapsed_time(ts)

# Test whether some of the words are present in the GloVe word vector representation file
for word in ["friday", "night", "music"]:
    if word in glove_300_w2v:
        print word, "\t", "first 10 (of 300)\t", glove_300_w2v[word][:10], ".."

# Train
ts = time.time()
vec = MeanEmbeddingVectorizer(glove_300_w2v)

models_with_desc = [
    ("Random Forest, MeanEmbeddingVectorizer, GloVe small 300-Dim", Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, GloVe small 300-Dim", Pipeline([("vec", vec), ("best_svm_linear", best_svm)])),
    ("LinearSVC, MeanEmbeddingVectorizer, GloVe small 300-Dim", Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)]))
    #("SVM (RBF) - 'Best', TF-IDF GloVe small 300-Dim", Pipeline([("vec", vec), ("best_svm_rbf", best_svm_rbf)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))

del models_with_desc
num_unreachable_objects = gc.collect()

print_elapsed_time(ts)

Reading glove.6B.300d.txt.gz
400000 keys. First 8 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo', 'sowell', 'mdbu', 'woods'] 


Time Taken : 0.8 minutes

friday 	first 10 (of 300)	[ 0.20283  -0.22845  -0.33061   0.25058   0.047943 -0.16453  -0.084213
 -0.16797   0.1062   -1.3609  ] ..
night 	first 10 (of 300)	[ 0.16882   0.041931 -0.068774  0.516    -0.40985   0.34697  -0.006856
  0.080555 -0.091977 -0.56115 ] ..
music 	first 10 (of 300)	[-0.38081   -0.24764   -0.24949    0.10468   -0.56411   -0.80654   -0.057066
 -0.095754   0.0068887 -0.7162   ] ..
Training: Random Forest, MeanEmbeddingVectorizer, GloVe small 300-Dim
	Score: 0.7223
	Time taken: 6.8 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, GloVe small 300-Dim
	Score: 0.7476
	Time taken: 5.4 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, GloVe small 300-Dim
	Score: 0.7570
	Time taken: 5.5 minutes

model                                                          score
------------------------------------

### 3.1.2  **`GloVe` *medium***

In [14]:
ts = time.time()
glove_300_w2v = read_GloVe_file("glove.42B.300d.txt.gz")
print_elapsed_time(ts)

ts = time.time()
vec = MeanEmbeddingVectorizer(glove_300_w2v)

models_with_desc = [
    ("Random Forest, MeanEmbeddingVectorizer, GloVe medium 300-Dim", Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, GloVe medium 300-Dim", Pipeline([("vec", vec), ("best_svm_linear", best_svm)])),
    ("LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim", Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)]))
    #("SVM (RBF) - 'Best', TF-IDF GloVe medium 300-Dim", Pipeline([("vec", vec), ("best_svm_rbf", best_svm_rbf)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))

del models_with_desc
num_unreachable_objects = gc.collect()

print_elapsed_time(ts)

Reading glove.42B.300d.txt.gz
1917495 keys. First 8 : ['ftdna', 'tripolitan', 'soestdijk', '6-night', 'un-loveyou', '20:09:49', '20:09:48', 'homespice'] 


Time Taken : 4.4 minutes

Training: Random Forest, MeanEmbeddingVectorizer, GloVe medium 300-Dim
	Score: 0.7394
	Time taken: 10.9 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, GloVe medium 300-Dim
	Score: 0.7604
	Time taken: 9.4 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim
	Score: 0.7744
	Time taken: 10.2 minutes

model                                                           score
------------------------------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%                                 0.7801
LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim       0.7744
SVM (Linear), MeanEmbeddingVectorizer, GloVe medium 300-Dim    0.7604
LinearSVC, MeanEmbeddingVectorizer, GloVe small 300-Dim        0.7570
SVM (Linear), TF-IDF, min_df 0.5%                              0.

### 3.2 Word2Vec model on blog text
Computing these models take a while..

In [15]:
from gensim.models import Word2Vec

documents = df.tokenized_text.values.tolist()

### 3.2.1 Word2Vec - CBOW

In [16]:
ts = time.time()
print "Constructing Word2Vec CBOW model based on text of", len(documents), "blogs"
w2v = Word2Vec(documents, size=300, window=8, min_count=5, sg=0, workers=7)# hs=0, negative=5, cbow_mean=1
print_elapsed_time(ts)

ts = time.time()
vec = MeanEmbeddingVectorizer({w: vec for w, vec in zip(w2v.index2word, w2v.syn0)})

models_with_desc = [    
    ("Random Forest, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim", Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("LinearSVC, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim", Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim", Pipeline([("vec", vec), ("best_svm_linear", best_svm)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time(ts)

del models_with_desc
num_unreachable_objects = gc.collect()

Constructing Word2Vec CBOW model based on text of 3515 blogs

Time Taken : 1.1 minutes

Training: Random Forest, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim
	Score: 0.7209
	Time taken: 11.9 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim
	Score: 0.7658
	Time taken: 6.1 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim
	Score: 0.7616
	Time taken: 3.2 minutes

model                                                             score
--------------------------------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%                                   0.7801
LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim         0.7744
LinearSVC, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim       0.7658
SVM (Linear), MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim    0.7616
SVM (Linear), MeanEmbeddingVectorizer, GloVe medium 300-Dim      0.7604
LinearSVC, MeanEmbeddingVectorizer, GloVe small 300-Dim          0.757

### 3.2.2 Word2Vec - Skip-gram (SG) using negative sampling

In [17]:
ts = time.time()
print "Constructing Word2Vec model (Skip-gram using negative sampling) based on text of", len(documents), "blogs"
w2v = Word2Vec(documents, size=300, window=8, min_count=5, sg=1, hs=0, workers=7) # using negative sampling
print_elapsed_time(ts)

ts = time.time()
vec = MeanEmbeddingVectorizer({w: vec for w, vec in zip(w2v.index2word, w2v.syn0)})

models_with_desc = [    
    ("Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim",
     Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim",
     Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim",
     Pipeline([("vec", vec), ("best_svm_linear", best_svm)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time(ts)

del models_with_desc
num_unreachable_objects = gc.collect()

Constructing Word2Vec model (Skip-gram using negative sampling) based on text of 3515 blogs

Time Taken : 5.0 minutes

Training: Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim
	Score: 0.7331
	Time taken: 12.1 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim
	Score: 0.7567
	Time taken: 7.2 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim
	Score: 0.7252
	Time taken: 7.4 minutes

model                                                                               score
--------------------------------------------------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%                                                     0.7801
LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim                           0.7744
LinearSVC, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim                         0.7658
SVM (Linear), MeanEmbeddingVectorizer, 

### 3.2.3 Word2Vec - Skip-gram (SG) using hierarchical softmax

In [18]:
ts = time.time()
print "Constructing Word2Vec model (Skip-gram using hierarchical softmax) based on text of", len(documents), "blogs"
w2v = Word2Vec(documents, size=300, window=8, min_count=5, sg=1, hs=1, workers=7) # using hierarchical softmax
print_elapsed_time(ts)

ts = time.time()
vec = MeanEmbeddingVectorizer({w: vec for w, vec in zip(w2v.index2word, w2v.syn0)})

models_with_desc = [    
    ("Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim",
     Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim",
     Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim",
     Pipeline([("vec", vec), ("best_svm_linear", best_svm)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time(ts)

del models_with_desc
num_unreachable_objects = gc.collect()

Constructing Word2Vec model (Skip-gram using hierarchical softmax) based on text of 3515 blogs

Time Taken : 11.1 minutes

Training: Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim
	Score: 0.7354
	Time taken: 12.0 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim
	Score: 0.7479
	Time taken: 7.0 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + hierarchical softmax, 300-Dim
	Score: 0.7186
	Time taken: 8.0 minutes

model                                                                                  score
-----------------------------------------------------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%                                                        0.7801
LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim                              0.7744
LinearSVC, MeanEmbeddingVectorizer, Word2Vec CBOW, 300-Dim                            0.7658
SVM (Linear

In [19]:
len(documents)

3515

The `Word2Vec` model was built on a dataset containing blog posts written by 3515 authors - this probably explains the low scores.
### 3.3 Use pretrained Word2Vec model on the Google News (100B) corpus
There is a GitHub repository (https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models) listing several pretrained models. I am using the 300-dimension model trained using negative sampling (size 1.5GB, 3M words).

In [20]:
# Load the pretrained Word2Vec model into gensim
ts = time.time()
print "Loading the pretrained Word2Vec model, GoogleNews-vectors-negative300.bin.gz, into gensim"
w2v = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
print_elapsed_time(ts)

ts = time.time()
vec = MeanEmbeddingVectorizer({w: vec for w, vec in zip(w2v.index2word, w2v.syn0)})

models_with_desc = [    
    ("Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)",
     Pipeline([("vec", vec), ("best_rf", best_rf)])),
    ("LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)",
     Pipeline([("vec", vec), ("best_linearsvc", best_linearsvc)])),
    ("SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)",
     Pipeline([("vec", vec), ("best_svm_linear", best_svm)]))
]

# scores = [] # Because we want to compare with the previous approaches
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time(ts)

del models_with_desc
del vec
del w2v
num_unreachable_objects = gc.collect()

Loading the pretrained Word2Vec model, GoogleNews-vectors-negative300.bin.gz, into gensim

Time Taken : 7.3 minutes

Training: Random Forest, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)
	Score: 0.7360
	Time taken: 15.6 minutes

Training: LinearSVC, MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)
	Score: 0.7565
	Time taken: 11.5 minutes

Training: SVM (Linear), MeanEmbeddingVectorizer, Word2Vec SG + negative sampling, 300-Dim (pretrained-100B)
	Score: 0.7234
	Time taken: 11.7 minutes

model                                                                                                 score
--------------------------------------------------------------------------------------------------  -------
LinearSVC, TF-IDF, min_df 0.5%                                                                       0.7801
LinearSVC, MeanEmbeddingVectorizer, GloVe medium 300-Dim                                             0.7744
Lin

## `GloVe` & `Word2Vec` Slightly Lower Performace Than BoW - Expected Higher
- `63.55%` (Best in part 1, no preprocessing + smaller dataset).
- **`78.01%`** (BoW, basic preprocessing on text)
- `77.44%` (using word vectors - GloVe *medium*, basic preprocessing on text)
- `76.58%` Word2Vec CBOW model on blog text
- `75.65%` using pretrained Word2Vec model on the Google News (100B) corpus

## Current TODO
- Need to investigate **why no significant improvements using word vectors**
- Further preprocessing