## Deep Inverse Regression with Yelp reviews
In this note we'll use [gensim](http://radimrehurek.com/gensim/) to turn the Word2Vec machinery into a document classifier, as in [Document Classification by Inversion of Distributed Language Representations](http://arxiv.org/pdf/1504.07295v3) from ACL 2015.

### Data and prep

First, download to the same directory as this note the data from the [Yelp recruiting contest](https://www.kaggle.com/c/yelp-recruiting) on [kaggle](https://www.kaggle.com/):
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_training_set.zip
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_test_set.zip

You'll need to sign-up for kaggle. You can then unpack the data and grab the information we need.  We'll use an incredibly simple parser



In [1]:
import pickle
import re
import logging
import imp
from common import *

imp.reload(logging)
logging.basicConfig(level=logging.INFO,
                        format="%(asctime)s %(levelname)s : %(message)s",
                        datefmt='%m-%d %H:%M')

And put everything together in a review generator that provides tokenized sentences and the number of stars for every review.

For example:

In [2]:
next(YelpReviews("test"))

NameError: name 'YelpReviews' is not defined

Now, since the files are small we'll just read everything into in-memory lists.  It takes a minute ...

In [3]:
revtrain = list(YelpReviews("training"))
print(len(revtrain), "training reviews")

## and shuffle just in case they are ordered
import numpy as np
np.random.shuffle(revtrain)

229907 training reviews


Finally, write a function to generate sentences -- ordered lists of words -- from reviews that have certain star ratings

# Word2Vec specialisation

In [4]:
def train(basemodel, pkl_name):
    from copy import deepcopy
    starmodels = [deepcopy(basemodel) for i in range(5)]
    for i in range(5):
        slist = list(StarSentences(revtrain, [i+1]))
        print(i+1, "stars (", len(slist), ")")
        starmodels[i].train(  slist, total_examples=len(slist) )    
    with open(pkl_name, 'wb') as outfile:
        pickle.dump(starmodels, outfile)

# Bare-bones model trained on Yelp reviews only

In [5]:
from gensim.models import Word2Vec
import multiprocessing

## create a w2v learner 
yelpmodel = Word2Vec(
    workers=multiprocessing.cpu_count(), # use your cores
    iter=3,# sweeps of SGD through the data; more is better
    min_count=15) 
yelpmodel.build_vocab(StarSentences(revtrain))  

12-05 12:40 INFO : 'pattern' package not found; tag filters are not available for English
12-05 12:40 INFO : collecting all words and their counts
12-05 12:40 INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
12-05 12:40 INFO : PROGRESS: at sentence #10000, processed 104571 words, keeping 9087 word types
12-05 12:40 INFO : PROGRESS: at sentence #20000, processed 208288 words, keeping 13354 word types
12-05 12:40 INFO : PROGRESS: at sentence #30000, processed 314919 words, keeping 16403 word types
12-05 12:40 INFO : PROGRESS: at sentence #40000, processed 420988 words, keeping 19007 word types
12-05 12:40 INFO : PROGRESS: at sentence #50000, processed 527377 words, keeping 21220 word types
12-05 12:40 INFO : PROGRESS: at sentence #60000, processed 636732 words, keeping 23296 word types
12-05 12:40 INFO : PROGRESS: at sentence #70000, processed 741437 words, keeping 25037 word types
12-05 12:40 INFO : PROGRESS: at sentence #80000, processed 849243 words, keeping 26

# Pretrained base model

In [6]:
with open('wtv_model_cwiki_50perc.pkl', 'rb') as infile:
    basemodel = pickle.load(infile)
basemodel.iter = 3

Now, we will _deep_ copy each base model and do star-specific training. This is where the big computations happen...

In [7]:
train(basemodel, 'wtv_models_cwiki_yelp.pkl')

12-05 12:40 INFO : training model with 10 workers on 62365 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:40 INFO : PROGRESS: at 5.05% examples, 332902 words/s
12-05 12:40 INFO : PROGRESS: at 10.26% examples, 337787 words/s
12-05 12:40 INFO : PROGRESS: at 15.47% examples, 339588 words/s
12-05 12:40 INFO : PROGRESS: at 20.69% examples, 339861 words/s
12-05 12:40 INFO : PROGRESS: at 25.82% examples, 340449 words/s
12-05 12:40 INFO : PROGRESS: at 30.91% examples, 339851 words/s
12-05 12:40 INFO : PROGRESS: at 35.89% examples, 338461 words/s
12-05 12:40 INFO : PROGRESS: at 40.94% examples, 337564 words/s
12-05 12:40 INFO : PROGRESS: at 46.00% examples, 336877 words/s
12-05 12:40 INFO : PROGRESS: at 51.18% examples, 337007 words/s
12-05 12:40 INFO : PROGRESS: at 56.43% examples, 337952 words/s
12-05 12:41 INFO : PROGRESS: at 61.53% examples, 338275 words/s
12-05 12:41 INFO : PROGRESS: at 66.60% examples, 337999 words/s
12-05 12:41 INFO : PROGRESS: at 71.76% ex

1 stars ( 246207 )
2

12-05 12:41 INFO : training model with 10 workers on 62365 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:41 INFO : PROGRESS: at 4.27% examples, 337704 words/s
12-05 12:41 INFO : PROGRESS: at 8.51% examples, 339707 words/s
12-05 12:41 INFO : PROGRESS: at 12.73% examples, 338481 words/s
12-05 12:41 INFO : PROGRESS: at 16.76% examples, 334067 words/s
12-05 12:41 INFO : PROGRESS: at 20.81% examples, 332358 words/s
12-05 12:41 INFO : PROGRESS: at 24.95% examples, 332691 words/s
12-05 12:41 INFO : PROGRESS: at 29.21% examples, 333591 words/s
12-05 12:41 INFO : PROGRESS: at 33.33% examples, 333053 words/s
12-05 12:41 INFO : PROGRESS: at 37.67% examples, 334191 words/s
12-05 12:41 INFO : PROGRESS: at 41.92% examples, 335051 words/s
12-05 12:41 INFO : PROGRESS: at 46.16% examples, 335385 words/s
12-05 12:41 INFO : PROGRESS: at 50.04% examples, 333213 words/s
12-05 12:41 INFO : PROGRESS: at 54.01% examples, 332040 words/s
12-05 12:41 INFO : PROGRESS: at 58.30% exa

 stars ( 295371 )
3

12-05 12:41 INFO : training model with 10 workers on 62365 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:41 INFO : PROGRESS: at 2.57% examples, 310326 words/s
12-05 12:41 INFO : PROGRESS: at 5.45% examples, 328202 words/s
12-05 12:41 INFO : PROGRESS: at 8.31% examples, 335773 words/s
12-05 12:41 INFO : PROGRESS: at 11.15% examples, 337512 words/s
12-05 12:41 INFO : PROGRESS: at 14.02% examples, 339325 words/s
12-05 12:41 INFO : PROGRESS: at 16.90% examples, 340459 words/s
12-05 12:41 INFO : PROGRESS: at 19.78% examples, 341743 words/s
12-05 12:41 INFO : PROGRESS: at 22.59% examples, 341638 words/s
12-05 12:41 INFO : PROGRESS: at 25.43% examples, 341867 words/s
12-05 12:41 INFO : PROGRESS: at 28.02% examples, 339104 words/s
12-05 12:41 INFO : PROGRESS: at 30.78% examples, 338263 words/s
12-05 12:41 INFO : PROGRESS: at 33.07% examples, 333368 words/s
12-05 12:41 INFO : PROGRESS: at 35.88% examples, 333812 words/s
12-05 12:41 INFO : PROGRESS: at 38.79% exam

 stars ( 437718 )
4

12-05 12:42 INFO : training model with 10 workers on 62365 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:42 INFO : PROGRESS: at 1.41% examples, 337820 words/s
12-05 12:42 INFO : PROGRESS: at 2.82% examples, 339245 words/s
12-05 12:42 INFO : PROGRESS: at 4.21% examples, 337125 words/s
12-05 12:42 INFO : PROGRESS: at 5.61% examples, 337132 words/s
12-05 12:42 INFO : PROGRESS: at 7.04% examples, 340507 words/s
12-05 12:42 INFO : PROGRESS: at 8.43% examples, 340020 words/s
12-05 12:42 INFO : PROGRESS: at 9.88% examples, 341591 words/s
12-05 12:42 INFO : PROGRESS: at 11.32% examples, 341232 words/s
12-05 12:42 INFO : PROGRESS: at 12.74% examples, 340876 words/s
12-05 12:42 INFO : PROGRESS: at 14.18% examples, 341469 words/s
12-05 12:42 INFO : PROGRESS: at 15.61% examples, 341856 words/s
12-05 12:42 INFO : PROGRESS: at 16.98% examples, 341337 words/s
12-05 12:42 INFO : PROGRESS: at 18.39% examples, 341824 words/s
12-05 12:42 INFO : PROGRESS: at 19.81% examples

 stars ( 883235 )
5

12-05 12:43 INFO : training model with 10 workers on 62365 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:43 INFO : PROGRESS: at 1.57% examples, 327312 words/s
12-05 12:43 INFO : PROGRESS: at 3.17% examples, 334067 words/s
12-05 12:43 INFO : PROGRESS: at 4.76% examples, 334867 words/s
12-05 12:43 INFO : PROGRESS: at 6.38% examples, 335429 words/s
12-05 12:43 INFO : PROGRESS: at 7.99% examples, 336165 words/s
12-05 12:43 INFO : PROGRESS: at 9.61% examples, 336851 words/s
12-05 12:43 INFO : PROGRESS: at 11.17% examples, 334831 words/s
12-05 12:43 INFO : PROGRESS: at 12.71% examples, 334538 words/s
12-05 12:43 INFO : PROGRESS: at 14.31% examples, 334732 words/s
12-05 12:43 INFO : PROGRESS: at 15.92% examples, 334778 words/s
12-05 12:43 INFO : PROGRESS: at 17.54% examples, 335339 words/s
12-05 12:43 INFO : PROGRESS: at 19.17% examples, 335760 words/s
12-05 12:43 INFO : PROGRESS: at 20.80% examples, 336410 words/s
12-05 12:43 INFO : PROGRESS: at 22.42% example

 stars ( 799704 )


In [8]:
train(yelpmodel, 'wtv_models_yelp_only.pkl')

12-05 12:44 INFO : training model with 4 workers on 24620 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:44 INFO : PROGRESS: at 5.98% examples, 465942 words/s
12-05 12:44 INFO : PROGRESS: at 12.02% examples, 466217 words/s
12-05 12:44 INFO : PROGRESS: at 18.07% examples, 466308 words/s
12-05 12:44 INFO : PROGRESS: at 24.14% examples, 467289 words/s
12-05 12:44 INFO : PROGRESS: at 30.23% examples, 469811 words/s
12-05 12:44 INFO : PROGRESS: at 36.32% examples, 470883 words/s
12-05 12:44 INFO : PROGRESS: at 42.34% examples, 469387 words/s
12-05 12:44 INFO : PROGRESS: at 48.16% examples, 467538 words/s
12-05 12:44 INFO : PROGRESS: at 53.98% examples, 465373 words/s
12-05 12:44 INFO : PROGRESS: at 59.79% examples, 464293 words/s
12-05 12:44 INFO : PROGRESS: at 65.49% examples, 462601 words/s
12-05 12:44 INFO : PROGRESS: at 71.17% examples, 460943 words/s
12-05 12:44 INFO : PROGRESS: at 77.10% examples, 460729 words/s
12-05 12:44 INFO : PROGRESS: at 82.71% exa

1 stars ( 246207 )
2

12-05 12:44 INFO : training model with 4 workers on 24620 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:44 INFO : PROGRESS: at 4.91% examples, 455256 words/s
12-05 12:44 INFO : PROGRESS: at 9.58% examples, 449566 words/s
12-05 12:44 INFO : PROGRESS: at 14.37% examples, 450971 words/s
12-05 12:44 INFO : PROGRESS: at 19.23% examples, 452119 words/s
12-05 12:44 INFO : PROGRESS: at 24.11% examples, 453643 words/s
12-05 12:44 INFO : PROGRESS: at 28.82% examples, 452553 words/s
12-05 12:44 INFO : PROGRESS: at 33.69% examples, 452697 words/s
12-05 12:44 INFO : PROGRESS: at 38.47% examples, 452028 words/s
12-05 12:44 INFO : PROGRESS: at 43.11% examples, 450978 words/s
12-05 12:44 INFO : PROGRESS: at 48.11% examples, 453256 words/s
12-05 12:44 INFO : PROGRESS: at 53.13% examples, 454723 words/s
12-05 12:44 INFO : PROGRESS: at 57.70% examples, 453036 words/s
12-05 12:44 INFO : PROGRESS: at 62.14% examples, 450491 words/s
12-05 12:45 INFO : PROGRESS: at 66.98% exam

 stars ( 295371 )
3

12-05 12:45 INFO : training model with 4 workers on 24620 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:45 INFO : PROGRESS: at 3.21% examples, 454627 words/s
12-05 12:45 INFO : PROGRESS: at 6.31% examples, 447212 words/s
12-05 12:45 INFO : PROGRESS: at 9.49% examples, 450887 words/s
12-05 12:45 INFO : PROGRESS: at 12.73% examples, 452396 words/s
12-05 12:45 INFO : PROGRESS: at 15.60% examples, 443572 words/s
12-05 12:45 INFO : PROGRESS: at 17.61% examples, 416726 words/s
12-05 12:45 INFO : PROGRESS: at 19.86% examples, 403080 words/s
12-05 12:45 INFO : PROGRESS: at 22.96% examples, 407862 words/s
12-05 12:45 INFO : PROGRESS: at 25.43% examples, 401441 words/s
12-05 12:45 INFO : PROGRESS: at 28.27% examples, 402041 words/s
12-05 12:45 INFO : PROGRESS: at 30.93% examples, 399289 words/s
12-05 12:45 INFO : PROGRESS: at 33.74% examples, 399546 words/s
12-05 12:45 INFO : PROGRESS: at 36.95% examples, 403748 words/s
12-05 12:45 INFO : PROGRESS: at 40.30% examp

 stars ( 437718 )
4

12-05 12:45 INFO : training model with 4 workers on 24620 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:45 INFO : PROGRESS: at 1.43% examples, 401882 words/s
12-05 12:45 INFO : PROGRESS: at 3.07% examples, 432000 words/s
12-05 12:45 INFO : PROGRESS: at 4.27% examples, 399975 words/s
12-05 12:45 INFO : PROGRESS: at 5.50% examples, 386108 words/s
12-05 12:45 INFO : PROGRESS: at 7.10% examples, 400580 words/s
12-05 12:45 INFO : PROGRESS: at 8.63% examples, 405131 words/s
12-05 12:45 INFO : PROGRESS: at 10.33% examples, 415875 words/s
12-05 12:45 INFO : PROGRESS: at 11.56% examples, 405957 words/s
12-05 12:45 INFO : PROGRESS: at 12.87% examples, 401654 words/s
12-05 12:45 INFO : PROGRESS: at 14.39% examples, 404052 words/s
12-05 12:45 INFO : PROGRESS: at 15.55% examples, 397279 words/s
12-05 12:45 INFO : PROGRESS: at 17.06% examples, 400301 words/s
12-05 12:45 INFO : PROGRESS: at 18.72% examples, 405931 words/s
12-05 12:45 INFO : PROGRESS: at 19.85% examples

 stars ( 883235 )
5

12-05 12:46 INFO : training model with 4 workers on 24620 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
12-05 12:46 INFO : PROGRESS: at 1.91% examples, 464719 words/s
12-05 12:46 INFO : PROGRESS: at 3.76% examples, 460775 words/s
12-05 12:46 INFO : PROGRESS: at 5.65% examples, 462176 words/s
12-05 12:46 INFO : PROGRESS: at 7.57% examples, 463932 words/s
12-05 12:46 INFO : PROGRESS: at 9.46% examples, 463954 words/s
12-05 12:46 INFO : PROGRESS: at 11.39% examples, 465013 words/s
12-05 12:46 INFO : PROGRESS: at 13.29% examples, 465597 words/s
12-05 12:46 INFO : PROGRESS: at 15.21% examples, 465901 words/s
12-05 12:46 INFO : PROGRESS: at 17.12% examples, 466337 words/s
12-05 12:46 INFO : PROGRESS: at 19.05% examples, 466416 words/s
12-05 12:46 INFO : PROGRESS: at 20.91% examples, 465590 words/s
12-05 12:46 INFO : PROGRESS: at 22.84% examples, 465539 words/s
12-05 12:46 INFO : PROGRESS: at 24.75% examples, 465480 words/s
12-05 12:46 INFO : PROGRESS: at 26.66% example

 stars ( 799704 )
