[Original article](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-3-more-fun-with-word-vectors)

In [25]:
import pandas as pd
import csv

train = pd.read_csv("../Part1-BOW/labeledTrainData.tsv", header=0, delimiter='\t', quoting=csv.QUOTE_NONE)
test = pd.read_csv("../Part1-BOW/testData.tsv", header=0, delimiter='\t', quoting=csv.QUOTE_NONE)

In [3]:
from gensim.models import Word2Vec;

In [5]:
model = Word2Vec.load("../Part2-Word-vectors/100features_40minwords_10context");

print(type(model.wv.vectors)) # or model.wv.syn0
print(model.wv.vectors.shape)

<class 'numpy.ndarray'>
(16490, 100)


The number of rows in **vectors** is the number of words in the model's vocabulary, and the number of columns corresponds to the size of the feature vector, which we set in Part 2.

In [13]:
model.wv["flower"].shape

(100,)

#### From Words To Paragraphs, Attempt 1: Vector Averaging
One challenge with the IMDB dataset is the variable-length reviews. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review.

Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review (for this purpose, we removed stop words, which would just add noise).

In [21]:
import numpy as np  # Make sure that numpy is imported
from tqdm import tqdm
def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

In [22]:
def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0.
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in tqdm(reviews):       
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
           num_features)
       #
       # Increment the counter
       counter = counter + 1.
    return reviewFeatureVecs

In [30]:
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.
from KaggleWord2VecUtility import KaggleWord2VecUtility

num_features = 100
clean_train_reviews = []
for review in tqdm(train["review"]):
    clean_train_reviews.append(KaggleWord2VecUtility.review_to_wordlist( review, \
        remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print("Creating average feature vecs for test reviews")

clean_test_reviews = []
for review in tqdm(test["review"]):
    clean_test_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, \
        remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
100%|███████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:57<00:00, 431.11it/s]
  0%|                                                                                                        | 0/25000 [00:00<?, ?it/s]


AttributeError: 'Word2Vec' object has no attribute 'index2word'