## Kaggle NLP Tutorial: [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors)
-----
# Part 3: More Fun With Word Vectors
# 1. Numeric Representations of Words
Each word is represented as a vector with N numerical values (here N=300).

The Word2Vec model trained in Part 2 consists of a feature vector for each word in the vocabulary, stored in a numpy array called **`syn0`**:

In [1]:
# Load the model that we created in Part 2
from gensim.models import Word2Vec
model = Word2Vec.load("300features_40minwords_10context")

In [4]:
print type(model.syn0)
print model.syn0.shape

<type 'numpy.ndarray'>
(16490, 300)


Setting the minimum word count to 40 gave us a total vocabulary of 16,492 words with 300 features apiece. 
- The number of rows: the number of words in the model's vocabulary
- The number of columns: the size of the feature vector, which we set in Part 2

Individual word vectors can be accessed in the following way:

In [5]:
model["flower"]

array([  2.63045412e-02,  -7.70469010e-02,   5.10520190e-02,
         1.04793541e-01,  -8.05942249e-03,   1.42707271e-04,
         4.70351242e-02,   3.08730341e-02,   1.56136565e-02,
         8.89217481e-02,  -7.53695518e-02,  -5.59975766e-03,
         5.96253527e-03,   6.15217909e-03,  -9.69854835e-03,
         8.10943637e-03,  -2.35759504e-02,  -3.36211994e-02,
        -2.19760146e-02,   3.30724761e-05,  -4.15189900e-02,
         1.42527252e-01,   3.37156020e-02,   8.60805437e-02,
        -2.27182470e-02,  -7.99016654e-03,  -7.55509958e-02,
        -1.01256482e-01,   3.30181271e-02,  -1.27396137e-01,
        -1.47526279e-01,  -7.87535012e-02,   5.81267923e-02,
        -8.56401306e-03,  -1.24435067e-01,   9.53752324e-02,
        -1.64249502e-02,  -6.64357329e-03,   7.04739988e-02,
         2.08331924e-02,   2.15384346e-02,   3.17606106e-02,
        -6.17580637e-02,  -8.08355287e-02,   3.97209302e-02,
        -3.23512033e-02,   4.61595096e-02,  -3.46926646e-03,
        -1.85832626e-03,

# 2. From Words To Paragraphs, Attempt 1: Vector Averaging
## The IMDB dataset has variable-length reviews.
- We need to find a way to **take individual word vectors and transform them into a feature set that is the same length for every review.**
- Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review. 
- One method we tried was to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

In [9]:
import numpy as np
num_features = 10
featureVec = np.zeros((num_features,),dtype="float32")

## Building the average paragraph vectors

In [26]:
import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    """ 
        Function to average all of the word vectors in a given paragraph
        - input(words): words in a review, we loop over each word in a review, 
                        and if the word is in the model library, added to the total
        - output: each review is represented as an average feature vector 
    """
    
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0.
    
    # Index2word is a list that contains the names of the words in the model's vocabulary. 
    # Convert it to a set, for speed 
    index2word_set = set(model.index2word)

    # Loop over each word in the review and, if it is in the model's vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            # vector sum of each word vector
            featureVec = np.add(featureVec,model[word]) # np.add: element-wise summation 
    
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords) # np.divide: element-wise division
    return featureVec

def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), 
    # calculate the average feature vector for each one and return a 2D numpy array 

    # Initialize a counter
    counter = 0.

    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")

    # Loop through the reviews
    for review in reviews:
    # Print a status message every 1000th review
        if counter%5000. == 0.:
            print "Review %d of %d" % (counter, len(reviews))
 
    # Call the function (defined above) that makes average feature vectors
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        
        # Increment the counter
        counter = counter + 1.
    return reviewFeatureVecs

In [22]:
# %load review_to_wordlist.py
# Import various modules for string cleaning
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist( review, remove_stopwords=False ):

    # Function to convert a document to a sequence of words, optionally removing stop words.  Returns a list of words.

    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()

    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)

    # 3. Convert words to lower case and split them
    words = review_text.lower().split()

    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    # 5. Return a list of words
    return(words)

In [19]:
import pandas as pd

# Read data from files 
train = pd.read_csv( "data/kaggle_nlp_labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "data/kaggle_nlp_testData.tsv", header=0, delimiter="\t", quoting=3)
unlabeled_train = pd.read_csv( "data/kaggle_nlp_unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
print "Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" % (train["review"].size, test["review"].size, unlabeled_train["review"].size )

Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews



In [27]:
# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. 

# Notice that we now use stop word removal.
clean_train_reviews = []
num_features = 300

for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review, remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print "Creating average feature vecs for test reviews"
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review, remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

Review 0 of 25000
Review 5000 of 25000
Review 10000 of 25000
Review 15000 of 25000
Review 20000 of 25000
Creating average feature vecs for test reviews
Review 0 of 25000
Review 5000 of 25000
Review 10000 of 25000
Review 15000 of 25000
Review 20000 of 25000




## Training the model using the average vectors (random forest)
We can only use the labeled training reviews to train the model. 

In [28]:
# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
# output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

Fitting a random forest to labeled training data...


We found that this produced results much better than chance, but underperformed Bag of Words by a few percentage points. Since the element-wise average of the vectors didn't produce spectacular results, perhaps we could do it in a more intelligent way? 

A standard way of weighting word vectors is to apply **tf-idf** weights, which measure **how important a given word is within a given set of documents**. 
- One way to extract tf-idf weights in Python is by using scikit-learn's **`TfidfVectorizer`**, which has an interface similar to the CountVectorizer that we used in Part 1. 
- However, when we tried weighting our word vectors in this way, we found no substantial improvement in performance.

# 3. From Words to Paragraphs, Attempt 2: Clustering
Word2Vec creates clusters of semantically related words, so another possible approach is to **exploit the similarity of words within a cluster**. Grouping vectors in this way is known as **vector quantization**. To accomplish this, we first need to find the centers of the word clusters, which we can do by using a clustering algorithm such as K-Means.

## K-Means clustering (*long run time because of high K*) 

In [33]:
from sklearn.cluster import KMeans
import time

start = time.time() # Start time

# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an average of 5 words per cluster
word_vectors = model.syn0
num_clusters = word_vectors.shape[0] / 5

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = num_clusters )
idx = kmeans_clustering.fit_predict( word_vectors )

# Get the end time and print how long the process took
end = time.time()
elapsed = end - start
print "Time taken for K Means clustering: ", elapsed, "seconds."

Time taken for K Means clustering:  1824.00886083 seconds.


In [34]:
# For convenience, we zip these into one dictionary as follows:
# Create a Word / Index dictionary, mapping each vocabulary word to a cluster number                                                                                            
word_centroid_map = dict(zip( model.index2word, idx ))

In [35]:
# Check words in the first 10 clusters
for cluster in xrange(0,10):
    # Print the cluster number  
    print "\nCluster %d" % cluster
    
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in xrange(0,len(word_centroid_map.values())):
        if( word_centroid_map.values()[i] == cluster ):
            words.append(word_centroid_map.keys()[i])
    print words


Cluster 0
[u'tired', u'bored', u'scared']

Cluster 1
[u'unexpectedly', u'engages', u'decidedly', u'unassuming', u'stable']

Cluster 2
[u'artifact', u'lowly', u'researching', u'detour', u'pact', u'conducting', u'dong', u'mp', u'interviewing', u'abbey', u'experimenting']

Cluster 3
[u'thespian', u'gutsy', u'meaty', u'asset']

Cluster 4
[u'tanner', u'valdez']

Cluster 5
[u'regional', u'specifically', u'descent', u'traditionally']

Cluster 6
[u'revisit', u'sneakers', u'skipping', u'tivo', u'heres', u'queue', u'nay', u'cough', u'til', u'counting']

Cluster 7
[u'illustrate', u'myriad']

Cluster 8
[u'gist', u'summarized', u'paragraph', u'condensed']

Cluster 9
[u'claws', u'maggots', u'swords', u'organs', u'chickens', u'tattoos', u'knives', u'bats']


## Bag of "centroids"
Now we have a cluster (or "centroid") assignment for each word. 
- We can define a function to convert reviews into bags-of-centroids. 
- This works just like Bag of Words but uses semantically related clusters instead of individual words.

In [36]:
def create_bag_of_centroids( wordlist, word_centroid_map ):
    
    # The number of clusters is equal to the highest cluster index in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1

    return bag_of_centroids

In [37]:
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (train["review"].size, num_clusters), dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids( review, word_centroid_map )
    counter += 1

# Repeat for test reviews 
test_centroids = np.zeros(( test["review"].size, num_clusters), dtype="float32" )

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids( review, word_centroid_map )
    counter += 1

## Model fitting (random forest)

In [38]:
# Fit a random forest and extract predictions 
forest = RandomForestClassifier(n_estimators = 100)

# Fitting the forest may take a few minutes
print "Fitting a random forest to labeled training data..."
forest = forest.fit(train_centroids,train["sentiment"])
result = forest.predict(test_centroids)

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
# output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )

Fitting a random forest to labeled training data...
