# Building a word vector using the skip-gram and CBOW models

Import the modules.

In [1]:
import re
import nltk
import gensim
import pandas as pd
from nltk.corpus import stopwords
from gensim.models import Word2Vec

Read the airline tweets sentiment dataset, which contains comments (text) related to airlines and their corresponding sentiment. The dataset can be obtained from https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

In [2]:
data = pd.read_csv('https://www.dropbox.com/s/8yq0edd4q908xqw/airline_sentiment.csv?dl=1')

A sample of the dataset looks as follows

In [3]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,@VirginAmerica plus you've added commercials t...
1,0,@VirginAmerica it's really aggressive to blast...
2,0,@VirginAmerica and it's a really big bad thing...
3,0,@VirginAmerica seriously would pay $30 a fligh...
4,1,"@VirginAmerica yes, nearly every time I fly VX..."


Preprocess the preceding text to do the following:
* Normalize every word to lower case.
* Remove punctuation and retain only numbers and alphabets.
* Remove stop words

In [4]:
stop = set(stopwords.words('english'))
def preprocess(text):
    text=text.lower()
    text=re.sub('[^0-9a-z]+',' ',text)
    words = text.split()
    words2 = [i for i in words if i not in stop]
    words3=' '.join(words2)
    return(words3)
data['text'] = data['text'].apply(preprocess)

After Preprocessing the Dataset looks as follows

In [5]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,virginamerica plus added commercials experienc...
1,0,virginamerica really aggressive blast obnoxiou...
2,0,virginamerica really big bad thing
3,0,virginamerica seriously would pay 30 flight se...
4,1,virginamerica yes nearly every time fly vx ear...


Split sentences into a list of tokens so that they can then be passed to gensim. The output of the first sentence should look as follows

In [6]:
data['text'][0].split()

['virginamerica', 'plus', 'added', 'commercials', 'experience', 'tacky']

loop through all the text we have and append it in a list, as follows.

In [7]:
list_words=[]
for i in range(len(data)):
    list_words.append(data['text'][i].split())

Let's inspect the first two lists within the list of lists:

In [8]:
list_words[:2]

[['virginamerica', 'plus', 'added', 'commercials', 'experience', 'tacky'],
 ['virginamerica',
  'really',
  'aggressive',
  'blast',
  'obnoxious',
  'entertainment',
  'guests',
  'faces',
  'amp',
  'little',
  'recourse']]

Build the Word2Vec model. Define the vector size, context window size to look into, and the minimum count of a word for it to be eligible to have a word vector
* size represents the size (dimension) of word vectors.
* window represents the context size of words that would be considered.
* min_count specifies the minimum frequency based on which a word is considered.
* sg represents whether skip-gram used (when sg=1) or CBOW (when sg = 0) used.

In [9]:
model = Word2Vec(size=100,window=5,min_count=30, sg=0, alpha = 0.025)

  "C extension not loaded, training will be slow. "


Once the model is defined, we will pass our list of lists to build a vocabulary

In [10]:
model.build_vocab(list_words)
model.corpus_count

11541

Once the vocabulary is built, the final words that would be left after filtering out the words that occur fewer than 30 times in the whole corpus can be found as follows

In [11]:
model.wv.vocab.keys()

dict_keys(['virginamerica', 'plus', 'experience', 'really', 'amp', 'little', 'big', 'bad', 'thing', 'seriously', 'would', 'pay', '30', 'flight', 'seats', 'flying', 'yes', 'every', 'time', 'fly', 'go', 'away', 'well', 'amazing', 'arrived', 'hour', 'early', 'good', 'lt', '3', 'pretty', 'much', 'better', 'great', 'deal', 'already', '2nd', 'trip', 'even', '1st', 'yet', 'u', 'take', 'travel', 'http', 'co', 'thanks', 'sfo', 'still', 'mia', 'first', 'lax', 'mco', 'heard', 'nothing', 'things', 'flew', 'nyc', 'last', 'week', 'sit', 'seat', 'due', 'two', 'either', 'help', 'know', 'awesome', 'bos', 'fll', 'please', 'want', 'may', 'three', 'times', 'available', 'love', 'feel', 'guys', 'gave', 'free', 'status', 'weeks', 'called', 'response', 'happened', '2', 'ur', 'food', 'options', 'least', 'say', 'site', 'able', 'anything', 'next', '6', 'hrs', 'fail', 'get', 'air', 'hi', 'cool', 'add', 'name', 'booking', 'problems', 'left', 'iad', 'today', 'one', 'answering', 'f', 'number', 'return', 'phone', 'ca

Train the model by specifying the total number of examples (lists) that need to be considered and the number of epochs to be run
* list_words (the list of words) is the input.
* total_examples represents the total number of lists to be considered.
* epochs is the number of epochs to be run.

In [12]:
model.train(list_words, total_examples=model.corpus_count, epochs=100)

(6807709, 12585800)

Extract the word vectors of a given word (month), as follows.

In [13]:
model['month']

  """Entry point for launching an IPython kernel.


array([ 1.8686405 ,  0.6634897 ,  0.11039618,  0.9978436 ,  2.7559671 ,
       -0.0642437 , -0.8825282 ,  0.46670377,  0.644586  , -0.3504269 ,
        1.4642848 ,  0.0852625 , -0.9671676 ,  0.82810646, -2.6629035 ,
       -0.5137696 , -0.15381114, -0.5468838 ,  1.5480644 ,  1.1200687 ,
       -0.02062936, -0.81567293,  0.59633356, -2.3973215 ,  0.44290587,
        1.0673373 ,  0.33287972, -1.367445  , -2.3053613 , -2.91107   ,
        0.8846327 ,  0.2501367 ,  1.7979026 ,  0.44279793, -0.62871486,
        1.9821651 , -0.37631744,  0.60704523,  0.11040852, -0.27083874,
        2.1312585 , -1.2533792 , -4.6724534 ,  0.801622  ,  0.32228664,
        1.0946869 ,  0.15452932, -0.01446615, -0.17676151, -0.39306486,
        0.21955755,  0.08783851,  0.3999704 ,  0.1595587 ,  0.70441324,
        1.6002792 , -1.2129657 ,  1.5463489 ,  2.9143138 , -2.3764787 ,
       -1.400456  ,  1.3655556 , -1.2308736 , -0.30149913, -0.404759  ,
        0.68080676,  1.5798203 ,  0.13690098,  0.43629053,  2.78

The similarity between two words can be calculated as follows.

In [14]:
model.similarity('month','year')

  """Entry point for launching an IPython kernel.


0.49794415

The words that are most similar to a given word is calculated as follows.

In [15]:
model.most_similar('month')

  """Entry point for launching an IPython kernel.


[('year', 0.4979441165924072),
 ('week', 0.48616230487823486),
 ('months', 0.3903491497039795),
 ('weeks', 0.35901960730552673),
 ('leg', 0.3421023488044739),
 ('row', 0.29487115144729614),
 ('days', 0.2835123836994171),
 ('account', 0.26402658224105835),
 ('lt', 0.26270467042922974),
 ('miles', 0.25241610407829285)]

Note that, while these similarities look low and some of the most similar words do not look intuitive, it will be more realistic once we train on a huge dataset than the 11,000-tweet dataset that we have.

let's see the output of most similar words to the word "month", when we run the model for a few number of epochs.

In [16]:
model = Word2Vec(size=100,window=5,min_count=30, sg=0)
model.build_vocab(list_words)
model.train(list_words, total_examples=model.corpus_count, epochs=5)
model.most_similar('month')

  after removing the cwd from sys.path.


[('years', 0.9995688796043396),
 ('high', 0.9995490312576294),
 ('sw', 0.9995419383049011),
 ('means', 0.9995293617248535),
 ('around', 0.9995282292366028),
 ('fault', 0.9995271563529968),
 ('extremely', 0.9995233416557312),
 ('also', 0.9995226860046387),
 ('asking', 0.9995217323303223),
 ('case', 0.9995102882385254)]