## word2vec

**Using largr amounts of unannotated plan text, word2vec learns linear relationship btw words automatically. The output are vectors, one vector per word with remarkable linear relationships hat allow us to do things like:**

***1.vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)***

***2.vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) =~ vec(“Toronto Maple Leafs”).***

***Word2Vec as an improvement over traditional bag-of-words***

## Review: Bag-of-words

**This model transforms each document to a fixed-length vector of integers. For example, given the sentences:**

***"John likes to watch movies. Mary likes movies too".***

***"John also likes to watch football games. Mary hates football".***

**The model outputs the vectors:***

**[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]**

**[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]**

***Each vector has 10 elements, where each element counts the number of times a particular word occurred in the document. The order of elements is arbitrary. In the example above, the order of the elements corresponds to the words:***

***["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]***

### Bag-of-words models are surprisingly effective, but have several weaknesses.

**First, they lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. There is a solution: bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.**

**Second, the model does not attempt to learn the meaning of the underlying words, and as a consequence, the distance between vectors doesn’t always reflect the difference in meaning.**

***The Word2Vec model addresses this second problem.***


## Introducing: the Word2Vec Model:

**word2vec is recent model that embeds words in a lower dimensional vector space using a shallow neural network.***

**The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.**

***For example, strong and powerful would be close together and strong and Paris would be relatively far.***

### They are two versions of this model and Word2Vec class implements them both:

***Skip-grams (SG)***

***Continuous-bag-of-words (CBOW)***


## Skip-grams (SG):

***The word2vec skip-grams model works for example, it take in pairs(word1, word2) generated by moving a window across the  text data*** 

***[Here window means parameter in word2vec algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).]***

***And trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.***

***for deep intution about skip-gram visit this [site](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) ***

## Continuous-bag-of-words (CBOW):

***Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings***


## Lets try word2vec with sample data

### Preparing the Input

**Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):**

In [2]:
import pandas as pd
import gensim



In [4]:
#im taking fack data set to perfom word2vec 

df = pd.read_csv("Fake[1]-Copy1.csv")
df.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
#perfoming word2vec on text feature

df.text[1]

'House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys  don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Tr

### Simple Preprocessing & Tokenization
***The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.***

***Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.***

In [6]:
#preprocessing the text column using genism library with simple processes

clean_text = df.text.apply(gensim.utils.simple_preprocess)
clean_text

0        [donald, trump, just, couldn, wish, all, ameri...
1        [house, intelligence, committee, chairman, dev...
2        [on, friday, it, was, revealed, that, former, ...
3        [on, christmas, day, donald, trump, announced,...
4        [pope, francis, used, his, annual, christmas, ...
                               ...                        
23476    [st, century, wire, says, as, wire, reported, ...
23477    [st, century, wire, says, it, familiar, theme,...
23478    [patrick, henningsen, st, century, wireremembe...
23479    [st, century, wire, says, al, jazeera, america...
23480    [st, century, wire, says, as, wire, predicted,...
Name: text, Length: 23481, dtype: object

In [7]:
clean_text[1]

['house',
 'intelligence',
 'committee',
 'chairman',
 'devin',
 'nunes',
 'is',
 'going',
 'to',
 'have',
 'bad',
 'day',
 'he',
 'been',
 'under',
 'the',
 'assumption',
 'like',
 'many',
 'of',
 'us',
 'that',
 'the',
 'christopher',
 'steele',
 'dossier',
 'was',
 'what',
 'prompted',
 'the',
 'russia',
 'investigation',
 'so',
 'he',
 'been',
 'lashing',
 'out',
 'at',
 'the',
 'department',
 'of',
 'justice',
 'and',
 'the',
 'fbi',
 'in',
 'order',
 'to',
 'protect',
 'trump',
 'as',
 'it',
 'happens',
 'the',
 'dossier',
 'is',
 'not',
 'what',
 'started',
 'the',
 'investigation',
 'according',
 'to',
 'documents',
 'obtained',
 'by',
 'the',
 'new',
 'york',
 'times',
 'former',
 'trump',
 'campaign',
 'adviser',
 'george',
 'papadopoulos',
 'was',
 'drunk',
 'in',
 'wine',
 'bar',
 'when',
 'he',
 'revealed',
 'knowledge',
 'of',
 'russian',
 'opposition',
 'research',
 'on',
 'hillary',
 'clinton',
 'on',
 'top',
 'of',
 'that',
 'papadopoulos',
 'wasn',
 'just',
 'covfefe'

***Its not that effective but its enough to perfom word2vec***

### Training the Word2Vec Model
***Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.***

***Workers define how many CPU threads to be used.***

In [13]:
#building the model

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)


### Build Vocabulary(Unique words from data)

In [15]:
model.build_vocab(clean_text, progress_per=1000)

In [16]:
model.epochs

5

In [17]:
model.corpus_count

23481

### Train the Word2Vec Model

In [18]:
model.train(clean_text, total_examples=model.corpus_count, epochs=model.epochs)

(38464089, 48127995)

### Finding Similar Words and Similarity between words

In [19]:

model.wv.most_similar("committee")

[('committees', 0.7663607597351074),
 ('subcommittee', 0.734991729259491),
 ('oversight', 0.7155461311340332),
 ('judiciary', 0.7100284695625305),
 ('nunes', 0.6490797400474548),
 ('chairman', 0.6353448033332825),
 ('chairs', 0.6306191682815552),
 ('select', 0.6213317513465881),
 ('senate', 0.6109331250190735),
 ('bvewdcklwe', 0.6025410294532776)]

In [21]:
model.wv.similarity(w1="russian", w2="australian")

0.07881066