# Kaggle: Bag of Words Meets Bag of Popcorn Part 3: More Fun With Word Vectors

Following: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

## Numeric Representation of Words

Our Word2Vec model consists of a feature vector for each word in the vocabulary stored in a numpy array called `syn0`.

In [3]:
from gensim.models import Word2Vec

In [4]:
# model trained in part 2
model = Word2Vec.load('300features_40minwords_10context')

In [5]:
type(model.syn0)

numpy.ndarray

In [6]:
model.syn0.shape

(16489, 300)

In [7]:
print model['flower'][:10]
print model['flower'][-10:]

[-0.02445767  0.07111458 -0.02681037 -0.01930033  0.05002626  0.07517181
  0.0421546  -0.00371989 -0.00089626  0.02496198]
[-0.08400575 -0.03687724 -0.13674392 -0.02168166  0.03528847  0.01696873
  0.00772722 -0.00861694  0.0086038   0.09287051]


## From Words to Paragraphs, Attempt 1: Vector Averaging

We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review. For this purpose, we'll remove stop words, which just add noise.

In [8]:
import numpy as np

In [9]:
def makeFeatureVec(words, model, num_features):
    """
    average all the words in a paragraph
    """
    featureVec = np.zeros((num_features,), dtype='float32')
    nwords = 0.0
    # list with the names of the words in the model's vocabulary - convert to set for speed
    # (should pass this in for speed, or even us a global, etc.)
    index2word_set = set(model.index2word)
    # loop over words and add
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.0
            featureVec = np.add(featureVec, model[word])
    # divide by number of words to get the average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [10]:
def getAvgFeatureVecs(reviews, model, num_features):
    """
    calculate the average feature vector for a group of reviews and return 2D numpy array
    """
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype='float32')
    for review in reviews:
        if (counter + 1) % 1000 == 0:
            print "Review {} of {}".format(counter + 1, len(reviews))
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter += 1
        
    return reviewFeatureVecs

We need some code from part 2...

In [11]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

In [12]:
# no need to recompute this set for every function call
stops = set(stopwords.words('english'))

def review_to_wordlist(review_text, remove_stopwords=False):
    """
    here we will have already removed html for tokenizing into sentences
    """
    # remove non-letters
    review_text = re.sub(r'[^a-zA-Z]', ' ', review_text)
    # convert to lower-case and split
    words = review_text.lower().split()
    # remove stopwords?
    if remove_stopwords:
        words = [w for w in words if not w in stops]
    # return a list of words
    return words

In [13]:
import nltk.data

# load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [14]:
def review_to_sentences(review, tokenizer, remove_stopwords=False, parser='lxml'):
    # remove html first
    review_text = BeautifulSoup(review, parser).get_text()
    # use tokenizer to split paragraph into sentences
    raw_sentences = tokenizer.tokenize(review_text.strip())
    # loop over sentences...
    sentences = []
    for raw_sentence in raw_sentences:
        # skip empty
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
            
    return sentences

Now we need data from part 2...

In [16]:
train = pd.read_csv('./labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('./testData.tsv', header=0, delimiter='\t', quoting=3)
unlabeled_train = pd.read_csv('./unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

print 'read %d labeled, %d test, and %d unlabeled' % (train['review'].size,
                                                      test['review'].size,
                                                      unlabeled_train['review'].size)

read 25000 labeled, 25000 test, and 50000 unlabeled


Back to part 3 proper...

In [19]:
# create average vectors for each of the paragraphs
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))
 
# can't use this yet...
#for review in unlabeled_train['review']:
#    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))
    
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, model.syn0.shape[1])

Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000


In [20]:
print "creating average feature vecs for test review"

clean_test_reviews = []
for review in test['review']:
    clean_test_reviews.append(review_to_wordlist(review, remove_stopwords=True))
    
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, model.syn0.shape[1])

creating average feature vecs for test review
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 25000 of 25000


Next, use the average paragraph vectors to train a random forest.

In [21]:
from sklearn.ensemble import RandomForestClassifier

In [22]:
forest = RandomForestClassifier(n_estimators=100)

In [23]:
print "fitting a random forest to labeled training data"
forest = forest.fit(trainDataVecs, train['sentiment'])

fitting a random forest to labeled training data


In [24]:
result = forest.predict(testDataVecs)

In [25]:
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('Word2Vec_AverageVectors.csv', index=False, quoting=3)

In [26]:
result

array([1, 0, 1, ..., 1, 1, 0])

In [27]:
output

Unnamed: 0,id,sentiment
0,"""12311_10""",1
1,"""8348_2""",0
2,"""5828_4""",1
3,"""7186_2""",0
4,"""12128_7""",1
5,"""2913_8""",1
6,"""4396_1""",0
7,"""395_2""",0
8,"""10616_1""",0
9,"""9074_9""",1


In [28]:
forest.score

<bound method RandomForestClassifier.score of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)>

In [29]:
forest

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

We could possibly improve this with [tf-idf weights](https://en.wikipedia.org/wiki/Tf–idf), which provide a measure of word importance. `scikit-learn` has [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for this task, but it may not actually offer much improvement.

## From Words to Paragraphs, Attempt 2: Clustering