Identify gender based on analysis of text - based on dataset containing 681288 blog posts downloaded from 
<a href="http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm" target="_blank">here</a>.

I am using a **much smaller** dataset containing blog posts written by 3515 authors (1799 female, 1716 male) in the 24-25 age group.

In [18]:
from bs4 import BeautifulSoup
from functools32 import lru_cache
from gensim import parsing, utils
from nltk.stem import WordNetLemmatizer

import numpy as np
import os
import pandas as pd
import re
import sys
import tarfile
import time
import traceback
import xml.etree.cElementTree as ET

dataset_filepath = "../../datasets/blogs_dataset_ages_24_25.tar.gz"
#dataset_filepath = "../../datasets/blogs_dataset_tiny.tar.gz"
dataset_filepath = "blogs_dataset_ages_24_25.tar.gz"
t0 = time.time()

def print_elapsed_time():
    print "\nElapsed Time :", "%.1f" % ((time.time() - t0)/60), "minutes to reach this point (from the start)"

## 1. Reading the dataset
Each author's posts appear as a separate file. The name indicates blogger id#, self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)
### Using Pandas Dataframe

In [19]:
# Read file contents

wnl = WordNetLemmatizer()
lemmatize = lru_cache(maxsize=150000)(wnl.lemmatize)

stoplist = set("urllink".split())
stoplist.update(parsing.preprocessing.STOPWORDS)

def read_blog_file(blogfile):
    contents = " ".join(blogfile.readlines())
    blog = BeautifulSoup(contents, "lxml")
    # datelist = blog.findAll("date")
    # print(datelist)
    postlist = blog.findAll("post")
    # print(len(postlist))
    # TODO complete the implementation

def read_first_blog_post(blogfile):
    # TODO replace this with reading all posts
    first_post_text = None
    for event, elem in ET.iterparse(blogfile):
        # print("%5s, %4s, %10s" % (event, element.tag, element.text))
        if elem.tag == "post" and first_post_text is None:
            first_post_text = elem.text.strip()
            # Remove tabs
            first_post_text = re.sub("[\s+]", " ", first_post_text)
            break
    
    return first_post_text, preprocess_text(first_post_text.lower())

def preprocess_text(text):
    # Remove none English characters
    text = re.sub("[^a-zA-Z]", " ", text)
    words = [lemmatize(token) for token in utils.simple_preprocess(text) if
             token not in stoplist and len(token) > 1]
    return " ".join(words)

In [20]:
columns = ["blogger_id", "gender", "age", "industry", "astro_sign", "filename", "first_blog_post"]
df = pd.DataFrame(columns=columns)
tar=tarfile.open(dataset_filepath)
ctr = [0] * 2
for tarinfo in tar.getmembers():
    if os.path.splitext(tarinfo.name)[1] == ".xml":
        info = os.path.splitext(tarinfo.name)[0].split("/")[-1]
        tmp_df = pd.DataFrame(dict(zip(columns, info.split("."))), index=[0])
        tmp_df["filename"] = info + ".xml"
        blogfile = tar.extractfile(tarinfo)
        ctr[0] += 1
        #read_blog_file(blogfile)
        try : 
            text, preprocessed_text = read_first_blog_post(blogfile)
            blogfile.close()
        except Exception, e:
            text, preprocessed_text = None, None
            # traceback.print_exc(file=sys.stdout)
            ctr[1] += 1
            print info, "has problem,", str(e)
        tmp_df["first_blog_post"] = text
        tmp_df["first_blog_post_preprocessed"] = preprocessed_text
        df = pd.concat([df, tmp_df])

tar.close()
print ctr[0], "read"
print ctr[1], "has errors reading"

df = df.reset_index(drop=True)
print df.shape
print sys.getsizeof(df)
df.head()
df.shape
print_elapsed_time()

1009572.male.25.indUnk.Cancer has problem, undefined entity: line 7, column 149
1021779.female.25.indUnk.Scorpio has problem, not well-formed (invalid token): line 7, column 2468
1062652.female.25.indUnk.Aries has problem, undefined entity: line 8, column 71
1070326.male.24.Government.Aquarius has problem, undefined entity: line 7, column 212
1079521.female.24.indUnk.Capricorn has problem, not well-formed (invalid token): line 7, column 2432
1151815.male.25.Education.Leo has problem, not well-formed (invalid token): line 7, column 193
1163257.female.24.indUnk.Gemini has problem, undefined entity: line 7, column 100
1209241.male.25.Government.Virgo has problem, not well-formed (invalid token): line 9, column 767
1273229.male.25.Internet.Cancer has problem, undefined entity: line 9, column 1259
1281160.male.24.Technology.Sagittarius has problem, not well-formed (invalid token): line 7, column 1938
1336804.female.24.Law.Taurus has problem, undefined entity: line 7, column 175
1350461.male

In [21]:
df.head()

Unnamed: 0,age,astro_sign,blogger_id,filename,first_blog_post,first_blog_post_preprocessed,gender,industry
0,25,Cancer,1005076,1005076.female.25.Arts.Cancer.xml,and did i mention that i no longer have to dea...,mention longer deal friday night music complai...,female,Arts
1,25,Sagittarius,1005545,1005545.male.25.Engineering.Sagittarius.xml,B-Logs: The Business Blogs Paradox urlLink ...,log business blog paradox hindustantimes com d...,male,Engineering
2,25,Cancer,1009572,1009572.male.25.indUnk.Cancer.xml,,,male,indUnk
3,25,Libra,1011289,1011289.female.25.indUnk.Libra.xml,MSN conversation: 11.17am Iggbalbollywall (...,msn conversation iggbalbollywall say yo yipee ...,female,indUnk
4,24,Leo,1016787,1016787.female.24.Communications-Media.Leo.xml,You love me... I have you here by my side... O...,love heart overflow happiness love matter sorr...,female,Communications-Media


There were some errors during reading, so remove those rows

In [22]:
print "Before pruning", pd.value_counts(df.gender)
df = df.dropna(subset=["first_blog_post"])
print "After pruning", pd.value_counts(df.gender)
df.head()

Before pruning female    1799
male      1716
Name: gender, dtype: int64
After pruning female    1474
male      1464
Name: gender, dtype: int64


Unnamed: 0,age,astro_sign,blogger_id,filename,first_blog_post,first_blog_post_preprocessed,gender,industry
0,25,Cancer,1005076,1005076.female.25.Arts.Cancer.xml,and did i mention that i no longer have to dea...,mention longer deal friday night music complai...,female,Arts
1,25,Sagittarius,1005545,1005545.male.25.Engineering.Sagittarius.xml,B-Logs: The Business Blogs Paradox urlLink ...,log business blog paradox hindustantimes com d...,male,Engineering
3,25,Libra,1011289,1011289.female.25.indUnk.Libra.xml,MSN conversation: 11.17am Iggbalbollywall (...,msn conversation iggbalbollywall say yo yipee ...,female,indUnk
4,24,Leo,1016787,1016787.female.24.Communications-Media.Leo.xml,You love me... I have you here by my side... O...,love heart overflow happiness love matter sorr...,female,Communications-Media
5,24,Aquarius,1019622,1019622.female.24.indUnk.Aquarius.xml,yay! it changed! :) i think i get it now. you ...,yay changed think post publish new change effe...,female,indUnk


In [23]:
# Make a backup of the Pandas Dataframe
import datetime
fmt_str = "%Y%m%d_%H%M%S"
filename = "blog_dataset_df_"+datetime.datetime.now().strftime(fmt_str)
df.to_csv(filename,sep="\t", index=False, encoding="utf8")

## 2. Bag of Words (BoW) Model
**Prepare the features** based on the text in the `first_blog_post_preprocessed` column. Text tokenized to create an intermediate dataset.

In [24]:
# Drop columns not required for the BoW model
df.drop(["age", "blogger_id", "industry", "astro_sign", "filename", "first_blog_post"], axis=1, inplace=True)
# Replace gender with a numeric value
df["gender"] = df["gender"].replace({"female":2, "male":1})
# Tokenize the text, prepare datset
def my_func(x):
    text = x["first_blog_post_preprocessed"]
    return text.split() if text else []
df["tokenized_text"] = df.apply(my_func , axis=1)
X, y = np.array(df.tokenized_text.values.tolist()), np.array(df.gender.values.tolist())
df.head()

Unnamed: 0,first_blog_post_preprocessed,gender,tokenized_text
0,mention longer deal friday night music complai...,2,"[mention, longer, deal, friday, night, music, ..."
1,log business blog paradox hindustantimes com d...,1,"[log, business, blog, paradox, hindustantimes,..."
3,msn conversation iggbalbollywall say yo yipee ...,2,"[msn, conversation, iggbalbollywall, say, yo, ..."
4,love heart overflow happiness love matter sorr...,2,"[love, heart, overflow, happiness, love, matte..."
5,yay changed think post publish new change effe...,2,"[yay, changed, think, post, publish, new, chan..."


In [25]:
print df.shape
print len(X)
print len(y)

(2938, 3)
2938
2938


### Train basic models
- Experiment with the vectorizers - each will give different number of features.
- Next, run grid search to pick the best hyperparameters for some of the commonly used classifiers (RF, SVM, etc.)
- Next, train the classifiers

In [26]:
# Experiment with the vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer

min_dfs = [0.01, 0.005, 0.001]
tfidf_vec = [None] * len(min_dfs)
features = [None] * len(min_dfs)
ctr = 0
def get_topn_tfidf_vec_terms(vec, features, n=5):
    # Get the top n terms with highest tf-idf score
    # Credit : http://stackoverflow.com/a/34236002
    feature_array = np.array(vec.get_feature_names())
    tfidf_sorting = np.argsort(features.toarray()).flatten()[::-1]
    return feature_array[tfidf_sorting][:n]

for min_df in min_dfs:
    tfidf_vec[ctr] = TfidfVectorizer(analyzer=lambda x: x, min_df=min_df)
    features[ctr] = tfidf_vec[ctr].fit_transform(X)
    print features[ctr].shape[1], \
          "features for minimum document frequency %.1f%%\n" % (min_df * 100), \
          "top 8 terms", get_topn_tfidf_vec_terms(tfidf_vec[ctr], features[ctr], n=8), "\n"
    ctr += 1

1519 features for minimum document frequency 1.0%
top 8 terms [u'im' u'afford' u'feel' u'desire' u'accept' u'aside' u'want' u'son'] 

2787 features for minimum document frequency 0.5%
top 8 terms [u'im' u'debt' u'hurting' u'afford' u'feel' u'desire' u'accept' u'aside'] 

10202 features for minimum document frequency 0.1%
top 8 terms [u'im' u'piling' u'duality' u'afterall' u'debt' u'hurting' u'afford'
 u'feel'] 



In [27]:
# Next, run grid search to pick the best hyperparameters 
from operator import itemgetter
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import LinearSVC, SVC

def find_best_hyperparameters(clf, vectorizer, param_dist, num_iters=20):
    # Run the grid search
    print "Finding best hyperparameters for", clf.__class__.__name__
    random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                       n_iter=num_iters, n_jobs=7)
    random_search.fit(vectorizer.fit_transform(X), y)
    # Iterate through the scores and print the best 3
    top_scores = sorted(random_search.grid_scores_, key=itemgetter(1), reverse=True)[:3]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("\tMean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("\tParameters: {0}".format(score.parameters))
    print top_scores[0]
    return random_search.best_estimator_


best_rf = find_best_hyperparameters(RandomForestClassifier(random_state = 120), tfidf_vec[1],
                                    { "bootstrap": [True, False],
                                      "criterion": ["gini", "entropy"],
                                      "max_depth": np.arange(3, 11).tolist() + [None],
                                      "n_estimators": np.arange(50, 550, 50).tolist(),
                                      "random_state": np.arange(120, 12000, 240).tolist()
                                    },
                                    num_iters=20)
print best_rf,"\n"

best_et = find_best_hyperparameters(ExtraTreesClassifier(random_state = 9000), tfidf_vec[1],
                                    { "max_depth": np.arange(3, 11).tolist() + [None],
                                      "n_estimators": np.arange(50, 550, 50).tolist(),
                                      "random_state": np.arange(120, 12000, 240).tolist()
                                    },
                                    num_iters=20)
print best_et,"\n"

best_svm = find_best_hyperparameters(SVC(kernel="linear", random_state = 840), tfidf_vec[1],
                                    { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                      "gamma": [0.1, 0.2, 0.3, 0.4, 0.5, "auto"],
                                      "random_state": np.arange(120, 12000, 240).tolist(),
                                      "tol": [0.0001, 0.001, 0.01]
                                    },
                                    num_iters=20)
print best_svm,"\n"

best_lianersvc = find_best_hyperparameters(LinearSVC(random_state = 11640), tfidf_vec[1],
                                           { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                             "loss": ["hinge", "squared_hinge"],
                                             "random_state": np.arange(120, 12000, 240).tolist(),
                                             "tol": [0.0001, 0.001, 0.01]
                                           },
                                           num_iters=20)
print best_lianersvc,"\n"

best_svm_rbf = find_best_hyperparameters(SVC(kernel="rbf", random_state = 600), tfidf_vec[1],
                                         { "C" : np.arange(0.1, 1, 0.1).tolist(),
                                           "gamma": [0.1, 0.2, 0.3, 0.4, 0.5, "auto"],
                                           "random_state": np.arange(120, 12000, 240).tolist(),
                                           "tol": [0.0001, 0.001, 0.01]
                                         },
                                         num_iters=20)
print best_svm_rbf
print_elapsed_time()

Finding best hyperparameters for RandomForestClassifier
Model with rank: 1
	Mean validation score: 0.624 (std: 0.005)
	Parameters: {'n_estimators': 350, 'random_state': 3000, 'criterion': 'entropy', 'max_depth': 3, 'bootstrap': True}
Model with rank: 2
	Mean validation score: 0.620 (std: 0.011)
	Parameters: {'n_estimators': 450, 'random_state': 5160, 'criterion': 'gini', 'max_depth': 9, 'bootstrap': False}
Model with rank: 3
	Mean validation score: 0.618 (std: 0.011)
	Parameters: {'n_estimators': 500, 'random_state': 5640, 'criterion': 'entropy', 'max_depth': 7, 'bootstrap': False}
mean: 0.62423, std: 0.00515, params: {'n_estimators': 350, 'random_state': 3000, 'criterion': 'entropy', 'max_depth': 3, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=350, n_jobs=1,
  

In [28]:
# Next, train the models
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from tabulate import tabulate

models = []
models.append(Pipeline([("tfidf_vec", tfidf_vec[0]), ("svm_klinear", SVC(kernel="linear"))]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("svm_klinear", SVC(kernel="linear"))]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[2]), ("svm_klinear", SVC(kernel="linear"))]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_et", best_et)]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_rf", best_rf)]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_svm_klinear", best_svm)]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_lianersvc", best_lianersvc)]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("best_svm_rbf", best_svm_rbf)]))
models.append(Pipeline([("tfidf_vec", tfidf_vec[1]), ("mnb", MultinomialNB())]))

models_with_desc = [
    ("SVM (Linear), TF-IDF, min_df 1.0%", models[0]),
    ("SVM (Linear), TF-IDF, min_df 0.5%", models[1]),
    ("SVM (Linear), TF-IDF, min_df 0.1%", models[2]),
    ("Extra Trees - 'Best', TF-IDF, min_df 0.5%", models[3]),
    ("Random Forest - 'Best', TF-IDF, min_df 0.5%", models[4]),
    ("SVM (Linear) - 'Best', TF-IDF, min_df 0.5%", models[5]),
    ("LinearSVC - 'Best', TF-IDF, min_df 0.5%", models[6]),
    ("SVM (RBF) - 'Best', TF-IDF, min_df 0.5%", models[7]),
    ("MultinomialNB, TF-IDF, min_df 0.5%", models[8])
]

def get_cv_scores(models_with_desc):
    cv_scores = []
    for model_id, model in models_with_desc:
        print "Training:", model_id
        ts = time.time()
        # cv_score = cross_val_score(model, X, y, cv=5, n_jobs=7).mean() # gives a pickling error
        cv_score = cross_val_score(model, X, y, cv=5).mean()
        cv_scores.append((model_id, cv_score))
        print "\tTime taken:", "%.1f" % ((time.time() - ts)/60), "minutes\n"
        print_elapsed_time()
    return cv_scores

scores = sorted([(model_id, cross_val_score(model, X, y, cv=5).mean()) 
                 for model_id, model in models_with_desc], 
                key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time()

model                                          score
-------------------------------------------  -------
SVM (RBF) - 'Best', TF-IDF, min_df 0.5%       0.6293
MultinomialNB, TF-IDF, min_df 0.5%            0.6286
LinearSVC - 'Best', TF-IDF, min_df 0.5%       0.6273
SVM (Linear) - 'Best', TF-IDF, min_df 0.5%    0.6263
Random Forest - 'Best', TF-IDF, min_df 0.5%   0.6201
Extra Trees - 'Best', TF-IDF, min_df 0.5%     0.6178
SVM (Linear), TF-IDF, min_df 0.1%             0.6157
SVM (Linear), TF-IDF, min_df 0.5%             0.6089
SVM (Linear), TF-IDF, min_df 1.0%             0.6038

Elapsed Time : 3.7 minutes to reach this point (from the start)


## Using Word Vectors : GloVe
Reference : http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

First, using **GloVe** word vector representation files downloaded from http://nlp.stanford.edu/data/ or https://github.com/stanfordnlp/GloVe. There are 3 files : 
- `glove.6B.zip` (6 Billion tokens, hence '*small*', 400K words) - has 50/100/200/300-dimension vectors,
- `glove.42B.300d.zip` (42 Billion tokens, hence '*medium*', 1.9M words),	
- `glove.840B.300d.zip` (840 Billion tokens, hence '*large*', 2.2M words)

In [29]:
# Read the GloVe word vector representation file
glove_small_50_filepath = "glove.6B.50d.txt"
glove_small_100_filepath = "glove.6B.100d.txt"
glove_small_200_filepath = "glove.6B.200d.txt"
glove_small_300_filepath = "glove.6B.300d.txt"

def read_GloVe_file(filepath):
    print "Reading", filepath
    glove_w2v = {}
    with open(filepath, "rb") as lines:
        for line in lines:
            parts = line.split()
            glove_w2v[parts[0]] = np.array(map(float, parts[1:]))
    print len(glove_w2v.keys()), "keys. First 5 :", glove_w2v.keys()[:5], "\n"
    return glove_w2v

glove_small_50_w2v = read_GloVe_file(glove_small_50_filepath)
glove_small_100_w2v = read_GloVe_file(glove_small_100_filepath)
glove_small_300_w2v = read_GloVe_file(glove_small_300_filepath)
print_elapsed_time()

Reading glove.6B.50d.txt
400000 keys. First 5 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo'] 

Reading glove.6B.100d.txt
400000 keys. First 5 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo'] 

Reading glove.6B.300d.txt
400000 keys. First 5 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo'] 


Elapsed Time : 4.6 minutes to reach this point (from the start)


In [30]:
X[:2]

array([ [u'mention', u'longer', u'deal', u'friday', u'night', u'music', u'complaining', u'complaining', u'smoke', u'park', u'close', u'car', u'mother', u'hard', u'carry', u'grocery', u'stair', u'shrek', u'looking', u'balding', u'rhinocerous', u'hoof', u'fat', u'as', u'bitch', u'lived', u'upstairs', u'mewith', u'waking', u'getting', u'bed', u'shaking', u'window', u'stomped', u'mention', u'christian', u'lived', u'door', u'grr'],
       [u'log', u'business', u'blog', u'paradox', u'hindustantimes', u'com', u'discus', u'effect', u'technology', u'blog', u'particular', u'according', u'article', u'blog', u'direct', u'vehicle', u'communicating', u'idea', u'make', u'disruptive', u'business', u'application', u'allow', u'business', u'human', u'communicate', u'human', u'real', u'voice', u'hand', u'webpronews', u'com', u'discus', u'idea', u'corporate', u'newsletter', u'publishing', u'blog', u'idea', u'pragmatic', u'futuristic', u'way']], dtype=object)

In [31]:
# Test whether some of the words are present in the GloVe word vector representation file
for word in ["friday", "night", "music"]:
    if word in glove_small_50_w2v:
        print word, "\t", glove_small_50_w2v[word]

friday 	[  1.86620000e-01   6.71270000e-02   3.82290000e-04   7.60140000e-01
   2.82300000e-01  -8.88870000e-01  -9.16460000e-01   7.21830000e-01
  -4.99410000e-01  -7.09100000e-01  -5.40870000e-01  -1.39310000e+00
  -5.80250000e-01   1.16550000e-01   1.04180000e+00   2.23800000e-01
  -1.04690000e+00  -9.99340000e-01  -1.04580000e+00  -2.15490000e-01
   7.65340000e-01   7.90050000e-01   1.42300000e-01  -6.02670000e-01
  -1.29470000e-01  -1.87230000e+00   8.35300000e-01   6.20410000e-01
  -5.67460000e-01   3.63180000e-01   3.32670000e+00   3.19490000e-01
  -5.26580000e-01   2.83150000e-01  -8.31970000e-02  -8.07470000e-01
   6.16920000e-01  -1.20720000e-01   9.36550000e-02   6.46350000e-02
  -4.31670000e-01   5.14760000e-01   1.50980000e-01  -6.78110000e-01
   7.04250000e-01   2.83110000e-01  -4.95620000e-01   7.54890000e-01
   2.85220000e-01  -3.56220000e-01]
night 	[  3.08140000e-01   4.71290000e-01  -2.79290000e-01   3.77600000e-01
   2.58810000e-01  -6.42760000e-01  -1.07960000e+00 

Each word in each blog post needs to be mapped to its vector representation - which is accordingly used as features.

In [32]:
# Word vector equivalent of CountVectorizer & TfidfVectorizer (respectively)
# Each word in each blog post is mapped to its vector; 
# then this helper class computes the mean of those vectors
# Credit : https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking.ipynb
from collections import defaultdict

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())
    
    def fit(self, X, y):
        return self 

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec] 
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
        
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf, 
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
    
        return self
    
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

### Next, retrain the models based on the word vectors
This takes a while..

In [33]:
tfidf_vec_glove_small_50 = TfidfEmbeddingVectorizer(glove_small_50_w2v)
tfidf_vec_glove_small_100 = TfidfEmbeddingVectorizer(glove_small_100_w2v)
tfidf_vec_glove_small_300 = TfidfEmbeddingVectorizer(glove_small_300_w2v)

models = []
models.append(Pipeline([("tfidf_vec_glove_small_50", tfidf_vec_glove_small_50), ("best_rf", best_rf)]))
models.append(Pipeline([("tfidf_vec_glove_small_50", tfidf_vec_glove_small_50), ("best_svm_klinear", best_svm)]))
models.append(Pipeline([("tfidf_vec_glove_small_50", tfidf_vec_glove_small_50), ("best_lianersvc", best_lianersvc)]))
models.append(Pipeline([("tfidf_vec_glove_small_50", tfidf_vec_glove_small_50), ("best_svm_rbf", best_svm_rbf)]))
models.append(Pipeline([("tfidf_vec_glove_small_100", tfidf_vec_glove_small_100), ("best_rf", best_rf)]))
models.append(Pipeline([("tfidf_vec_glove_small_100", tfidf_vec_glove_small_100), ("best_svm_klinear", best_svm)]))
models.append(Pipeline([("tfidf_vec_glove_small_100", tfidf_vec_glove_small_100), ("best_lianersvc", best_lianersvc)]))
models.append(Pipeline([("tfidf_vec_glove_small_100", tfidf_vec_glove_small_100), ("best_svm_rbf", best_svm_rbf)]))
models.append(Pipeline([("tfidf_vec_glove_small_300", tfidf_vec_glove_small_300), ("best_rf", best_rf)]))
models.append(Pipeline([("tfidf_vec_glove_small_300", tfidf_vec_glove_small_300), ("best_svm_klinear", best_svm)]))
models.append(Pipeline([("tfidf_vec_glove_small_300", tfidf_vec_glove_small_300), ("best_lianersvc", best_lianersvc)]))
models.append(Pipeline([("tfidf_vec_glove_small_300", tfidf_vec_glove_small_300), ("best_svm_rbf", best_svm_rbf)]))
#models.append(Pipeline([("tfidf_vec_glove_small_50", tfidf_vec_glove_small_50), ("mnb", ())])) # 
# NOTE : MultinomialNB will not work because of non-negative feature values

models_with_desc = [
    ("Random Forest - 'Best', TF-IDF GloVe small 50-Dim", models[0]),
    ("SVM (Linear) - 'Best',  TF-IDF GloVe small 50-Dim", models[1]),
    ("LinearSVC - 'Best',     TF-IDF GloVe small 50-Dim", models[2]),
    ("SVM (RBF) - 'Best',     TF-IDF GloVe small 50-Dim", models[3]),
    ("Random Forest - 'Best', TF-IDF GloVe small 100-Dim", models[4]),
    ("SVM (Linear) - 'Best',  TF-IDF GloVe small 100-Dim", models[5]),
    ("LinearSVC - 'Best',     TF-IDF GloVe small 100-Dim", models[6]),
    ("SVM (RBF) - 'Best',     TF-IDF GloVe small 100-Dim", models[7]),
    ("Random Forest - 'Best', TF-IDF GloVe small 300-Dim", models[8]),
    ("SVM (Linear) - 'Best',  TF-IDF GloVe small 300-Dim", models[9]),
    ("LinearSVC - 'Best',     TF-IDF GloVe small 300-Dim", models[10]),
    ("SVM (RBF) - 'Best',     TF-IDF GloVe small 300-Dim", models[11])
]

# scores = [] # Because we want to compare with the previous BoW model
scores.extend(get_cv_scores(models_with_desc))

scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time()

Training: Random Forest - 'Best', TF-IDF GloVe small 50-Dim
	Time taken: 2.0 minutes


Elapsed Time : 6.6 minutes to reach this point (from the start)
Training: SVM (Linear) - 'Best',  TF-IDF GloVe small 50-Dim
	Time taken: 1.7 minutes


Elapsed Time : 8.4 minutes to reach this point (from the start)
Training: LinearSVC - 'Best',     TF-IDF GloVe small 50-Dim
	Time taken: 1.7 minutes


Elapsed Time : 10.1 minutes to reach this point (from the start)
Training: SVM (RBF) - 'Best',     TF-IDF GloVe small 50-Dim
	Time taken: 1.8 minutes


Elapsed Time : 11.9 minutes to reach this point (from the start)
Training: Random Forest - 'Best', TF-IDF GloVe small 100-Dim
	Time taken: 2.0 minutes


Elapsed Time : 13.9 minutes to reach this point (from the start)
Training: SVM (Linear) - 'Best',  TF-IDF GloVe small 100-Dim
	Time taken: 1.8 minutes


Elapsed Time : 15.6 minutes to reach this point (from the start)
Training: LinearSVC - 'Best',     TF-IDF GloVe small 100-Dim
	Time taken: 1.7 minutes




Some **improvement** : `63.55%` now, `62.93%` previously. Using word vectors helps, so makes sense to explore further. 
- First try the 200-Dim word vectors from the GloVe small dataset
- Next, need to explore if basic preprocessing on the text helps
- Next, explore the GloVe medium dataset (1.9M words Vs. 400K in the small)

In [34]:
# Try 200-Dim word vectors from the GloVe small dataset
glove_small_200_filepath = "glove.6B.200d.txt"
glove_small_200_w2v = read_GloVe_file(glove_small_200_filepath)
tfidf_vec_glove_small_200 = TfidfEmbeddingVectorizer(glove_small_200_w2v)

Reading glove.6B.200d.txt
400000 keys. First 5 : ['biennials', 'verplank', 'soestdijk', 'woode', 'mdbo'] 



In [35]:
models = []
models.append(Pipeline([("tfidf_vec_glove_small_200", tfidf_vec_glove_small_200), ("best_rf", best_rf)]))
models.append(Pipeline([("tfidf_vec_glove_small_200", tfidf_vec_glove_small_200), ("best_svm_klinear", best_svm)]))
models.append(Pipeline([("tfidf_vec_glove_small_200", tfidf_vec_glove_small_200), ("best_lianersvc", best_lianersvc)]))
models.append(Pipeline([("tfidf_vec_glove_small_200", tfidf_vec_glove_small_200), ("best_svm_rbf", best_svm_rbf)]))
models_with_desc = [
    ("Random Forest - 'Best', TF-IDF GloVe small 200-Dim", models[0]),
    ("SVM (Linear) - 'Best',  TF-IDF GloVe small 200-Dim", models[1]),
    ("LinearSVC - 'Best',     TF-IDF GloVe small 200-Dim", models[2]),
    ("SVM (RBF) - 'Best',     TF-IDF GloVe small 200-Dim", models[3]),
]
# scores = [] # Because we want to compare with the previous BoW model
scores.extend(get_cv_scores(models_with_desc))
scores = sorted(scores, key=lambda (_, x): -x)
print tabulate(scores, floatfmt=".4f", headers=("model", 'score'))
print_elapsed_time()

Training: Random Forest - 'Best', TF-IDF GloVe small 200-Dim
	Time taken: 2.2 minutes


Elapsed Time : 29.9 minutes to reach this point (from the start)
Training: SVM (Linear) - 'Best',  TF-IDF GloVe small 200-Dim
	Time taken: 2.8 minutes


Elapsed Time : 32.8 minutes to reach this point (from the start)
Training: LinearSVC - 'Best',     TF-IDF GloVe small 200-Dim
	Time taken: 5.4 minutes


Elapsed Time : 38.2 minutes to reach this point (from the start)
Training: SVM (RBF) - 'Best',     TF-IDF GloVe small 200-Dim
	Time taken: 4.7 minutes


Elapsed Time : 42.8 minutes to reach this point (from the start)
model                                                 score
--------------------------------------------------  -------
SVM (Linear) - 'Best',  TF-IDF GloVe small 100-Dim   0.6355
LinearSVC - 'Best',     TF-IDF GloVe small 100-Dim   0.6351
Random Forest - 'Best', TF-IDF GloVe small 50-Dim    0.6310
SVM (RBF) - 'Best', TF-IDF, min_df 0.5%              0.6293
MultinomialNB, TF-IDF, min_d

No improvement using 200-Dim word vectors from the GloVe small dataset..

## Current TODO
- Need to explore if basic preprocessing on the text helps
- Also, need to explore what caused the 577 errors during reading (above runs were based on 2938 blog posts)
- Need to explore the GloVe medium dataset (1.9M words Vs. 400K in the small)