The objective of this experiment is to understand word2vec, by seeing it in action.

In this experiment we will use **Mahabharata** as our text corpus

#### Keywords

* Word2Vec
* Representation
* Stemming


The problem with count-based representations is that 

1.  they are costly in terms of memory

2. they discard all context and meaning of words


A better way to do this is by using a representation called "Word2Vec" with transforms each word into 300-dimensional vectors.

#### Importing the required packages

In [1]:
#vector space modeling and topic modeling toolkit
import gensim

# Operating System
import os

# Regular Expression
import re

# nltk packages
from nltk.stem.snowball import SnowballStemmer

# Basic Packages
import numpy as np, pandas as pd
import warnings
warnings.filterwarnings("ignore")

**Snowball** is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. 

#### Creating a new instance of a language specific subclass.

In [10]:
stemmer = SnowballStemmer("english")
#very similar to Porter stemmer. A little faster though. And a little more aggresive while stemming

### Preprocessing

1. Cleaning dataset for text encoding issues :- Very useful when dealing with non-unicode characters. Most often when you read files prepared on Windows, in a Linux/Unix machine
2. Creating a set of vocabulary excluding the stopwords
3. Stemming a word.

In [20]:
print(stemmer.stem("running"))
print(stemmer.stem("ran"))
print(stemmer.stem("runs"))

print(stemmer.stem("authorize"))
print(stemmer.stem("authorized"))
print(stemmer.stem("authority"))
print(stemmer.stem("authorization"))
print(stemmer.stem("authorizing"))

print(stemmer.stem("matrices"))
print(stemmer.stem("matrix"))
print(stemmer.stem("police"))
print(stemmer.stem("policy"))
print(stemmer.stem("european"))
print(stemmer.stem("europe"))
print(stemmer.stem("stocking"))
print(stemmer.stem("stocks"))

run
ran
run
author
author
author
author
author
matric
matrix
polic
polici
european
europ
stock
stock


In [21]:
stopWords = pd.read_csv('stopwords.txt').values

In [22]:
class Load_Data(object):
    def __init__(self, fnamelist):
        self.fnamelist = fnamelist
        # Creating a set of vocabulary
        self.vocabulary = set([])

    def __iter__(self):
        for fname in self.fnamelist:
            for line in open(fname, encoding='latin1'):
                words = re.findall(r'(\b[a-z][a-z]*\b)', line.lower())
                words = [word for word in words if not word in stopWords]
                for word in words:
                    self.vocabulary.add(word)
                yield words

In [24]:
%%time
MB_txt = Load_Data(['MB.txt'])
model = gensim.models.Word2Vec(MB_txt, min_count=100)
#min_count (int, optional) – Ignores all words with total frequency lower than this.

CPU times: user 7min 35s, sys: 3.83 s, total: 7min 39s
Wall time: 7min 39s


In [2]:
?gensim.models.Word2Vec

In [25]:
model.save("MB2Vec_Without_stemmer.bin")

In [31]:
krishna5_without_stemmer = model.wv.most_similar('krishna')[:5]

for name, similarity in krishna5_without_stemmer:
    print("Name: {} similarity: {}".format(name, round(similarity,2)))

print("\n\nChecking if similarity is symmetric")
krishna5_without_stemmer = model.wv.most_similar('arjuna')[:5]

for name, similarity in krishna5_without_stemmer:
    print("Name: {} similarity: {}".format(name, round(similarity,2)))

Name: kesava similarity: 0.87
Name: vasudeva similarity: 0.78
Name: govinda similarity: 0.76
Name: madhava similarity: 0.75
Name: arjuna similarity: 0.71


Checking if similarity is symmetric
Name: partha similarity: 0.89
Name: dhananjaya similarity: 0.85
Name: kama similarity: 0.84
Name: bhima similarity: 0.81
Name: vrikodara similarity: 0.79


In [33]:
class Load_Data_stemmed(object):
    def __init__(self, fnamelist):
        self.fnamelist = fnamelist
        # Creating a set of vocabulary
        self.vocabulary = set([])

    def __iter__(self):
        for fname in self.fnamelist:
            for line in open(fname, encoding='latin1'):
                words = re.findall(r'(\b[a-z][a-z]*\b)', line.lower())
                # Stemming a word.
                words = [ stemmer.stem(word) for word in words if not word in stopWords]
                for word in words:
                    self.vocabulary.add(word)
                yield words

Now, Let us read the data using an iterator in the class defined above, which is a memory-friendly iterator. Save the pretrained vectors using Gensim

In [34]:
%%time
MB_txt_stemmed = Load_Data_stemmed(['MB.txt'])
model = gensim.models.Word2Vec(MB_txt_stemmed, min_count=100)

CPU times: user 11min 47s, sys: 1.5 s, total: 11min 49s
Wall time: 11min 56s


In [35]:
model.save("MB2Vec_With_stemmer.bin")

Now Let us see what are the similar words related to certain characters names.

In [36]:
krishna5_with_stemmer =  model.wv.most_similar('krishna')[:5]
for name, similarity in krishna5_with_stemmer:
  print("Name: {} similarity: {}".format(name, round(similarity,2)))

Name: kesava similarity: 0.83
Name: madhava similarity: 0.77
Name: vasudeva similarity: 0.76
Name: govinda similarity: 0.76
Name: arjuna similarity: 0.72
