## Vector Representations: word2vec in Python 3.6

DOCUMENTATION 
https://radimrehurek.com/gensim/models/word2vec.html

https://rare-technologies.com/word2vec-tutorial/

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

### Importing Packages

In [1]:
import numpy as np
import pandas as pd

# --- NLTK PACKAGE ---
import nltk
# Tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize, PunktSentenceTokenizer, RegexpTokenizer
# Stemming and Lemmatizing
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Stopwords
from nltk.corpus import stopwords, state_union, brown, movie_reviews, treebank

# --- GENSIM PACKAGE ---
import gensim, logging
from gensim.models import Word2Vec, Doc2Vec

### Loading Datasets/Inputs

In [2]:
# Sentences
brown_sents = brown.sents()
movie_sents= movie_reviews.sents()
treebank_sents = treebank.sents()

In [3]:
test_input = '''My name is Pranjal Pathak. 
                My gender is Male. I am 23 years old. 
                I live in Bangalore. I like driving. 
                I have lived in Varanasi before but I like Bangalore more. 
                Phani is a nice girl. Her gender is Female.'''

### Tokenizing

In [4]:
my_sents = sent_tokenize(test_input)

In [5]:
my_sent_words = []

for line in my_sents:
    my_sent_words.append(word_tokenize(line))

In [13]:
my_sent_words

[['My', 'name', 'is', 'Pranjal', 'Pathak', '.'],
 ['My', 'gender', 'is', 'Male', '.'],
 ['I', 'am', '23', 'years', 'old', '.'],
 ['I', 'live', 'in', 'Bangalore', '.'],
 ['I', 'like', 'driving', '.'],
 ['I',
  'have',
  'lived',
  'in',
  'Varanasi',
  'before',
  'but',
  'I',
  'like',
  'Bangalore',
  'more',
  '.'],
 ['Phani', 'is', 'a', 'nice', 'girl', '.'],
 ['Her', 'gender', 'is', 'Female', '.']]

### MODEL

In [14]:
''' MODEL ARCHITECTURE

    Vocab(V) = {word1, word2, word3,...., wordV}; Set of all unique words in the input doc
    
                  Input = Word1 [1,0,0,0,.....0]; V dim
           Hidden Layer = 600 Neurons; Weights  = word1: w1,w2,w3,....wn; N dim weights
    Second Hidden Layer = 600 Neurons; Weights' = w'1,w'2,w'3,....w'n; N dim weights'
        Output(Softmax) = [0.78, 0.21, 0.11, ....]; V dim (Prob of relation of word1 with other words)
        
    KEY--
    `size` is the dimensionality of the feature vectors = 100; 100 weights or features(w0,w1,w2......w99)
    `window` is the maximum distance between the current and predicted word within a sentence.
    `min_count` = ignore all words with total frequency lower than this.
'''

## Training our model with our input data
model_word2vec = Word2Vec(my_sent_words, size = 100, window = 10, hs=1, negative=0, workers = 4, min_count=1)

### Word2Vec Methods

In [15]:
# Most Similar n words with prob
model_word2vec.most_similar('Varanasi', topn=5)

[('.', 0.17852124571800232),
 ('but', 0.16687831282615662),
 ('name', 0.11936233937740326),
 ('gender', 0.11596344411373138),
 ('driving', 0.102081798017025)]

In [None]:
model_word2vec.doesnt_match("Pranjal")

In [None]:
# Comparison between two words
model_word2vec.similarity('Bangalore', 'My')*100

In [None]:
# Array of Vectors
model_word2vec['Bangalore']

In [None]:
model_word2vec.score(["My name is Pranjal".split()])[0]

In [None]:
model_word2vec.most_similar(positive=['Male', 'Female'], negative=['Pranjal'])

In [None]:
model_word2vec.most_similar_cosmul(positive=['Male', 'Female'], negative=['Pranjal'])

#### NLTK Corporas

In [None]:
b = Word2Vec(brown_sents)
mr = Word2Vec(movie_sents)
t = Word2Vec(treebank_sents)

In [None]:
b.most_similar('king', topn=5)