# Word Embedding Approaches

Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss three of them here:
1. Bag of Words
2. TF-IDF Scheme
3. Word2Vec

## Bag of Words

The bag of words approach is one of the simplest word embedding approaches. 

### Pros and Cons of Bag of Words

pros:
The main advantage of the bag of words approach is that you do not need a very huge corpus of words to get good results.

cons:
A major drawback of the bag of words approach is the fact that we need to create huge vectors with empty spaces in order to represent a number (sparse matrix) which consumes memory and space. 

Another major issue with the bag of words approach is the fact that it doesn't maintain any context information. It doesn't care about the order in which the words appear in a sentence. 

A type of bag of words approach, known as n-grams, can help maintain the relationship between words. N-gram refers to a contiguous sequence of n words.  Although then-grams approach is capable of capturing relationships between words, the size of the feature set grows exponentially with too many n-grams.

## TF-IDF Scheme

The TF-IDF scheme is a type of bag words approach where instead of adding zeros and ones in the embedding vector, you add floating numbers that contain more useful information compared to zeros and ones. 

The idea behind TF-IDF scheme is the fact that words having a high frequency of occurrence in one document, and less frequency of occurrence in all the other documents, are more crucial for classification.

TF-IDF is a product of two values: Term Frequency (TF) and Inverse Document
Frequency (IDF).

### Pros and Cons of TF-IDF

Though TF-IDF is an improvement over the simple bag of words approach and yields better results for common NLP tasks, the overall pros and cons remain the same. We still need to create a huge sparse matrix, which also takes a lot more computation than the simple bag of words approach.

## Word2Vec

Word2Vec approach uses deep learning and neural networks-based techniques to
convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.

Word2Vec returns some astonishing results. Word2Vec's ability to maintain semantic relation is reflected by a classic example.

Word2Vec model comes in two flavors: Skip Gram Model and Continuous Bag of
Words Model (CBOW).

Skip Gram model
In the Skip Gram model, the context words are predicted using the base word. For instance, given a sentence "I love to dance in the rain", the skip gram model will predict "love" and "dance" given the word "to" as input.

CBOW model
the CBOW model will predict "to", if the context words "love" and
"dance" are fed as input to the model. The model learns these relationships using deep neural networks.

### Pros and Cons of Word2Vec

Word2Vec has several advantages over bag of words and IF-IDF scheme. Word2Vec
retains the semantic meaning of different words in a document. The context
information is not lost. 

Another great advantage of Word2Vec approach is that the size of the embedding vector is very small. Each dimension in the embedding vector contains information about one aspect of the word. We do not need huge sparse vectors, unlike the bag of words and TF-IDF approaches.

# Word2Vec in Python with Gensim Library

## scrape the Wikipedia article (Creating Corpus)

we will create a Word2Vec model using a Single Wikipedia article. Our model will not be as good as Google's Word2Vec model (https://code.google.com/archive/p/word2vec/). Although, it is good enough to explain how Word2Vec model can be implemented using the Gensim library

In [5]:
import bs4 as bs
import urllib.request
import re
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec

In [2]:
# download the Wikipedia article
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

#  read the article content and parse it
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')

article_text = ""
for p in paragraphs:
    article_text += p.text

## Preprocessing

In [4]:
# Cleaning the text
processed_article = article_text.lower() #convert all the text to lowercase
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article ) #remove all the digits, special characters
processed_article = re.sub(r'\s+', ' ', processed_article) #extra spaces from the text

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

## Creating Word2Vec Model

We need to specify the value for the parameter. A value of 2 for specifies to include only those words in the Word2Vec model that appear at least twice in the corpus.

In [6]:
word2vec = Word2Vec(all_words, min_count=2)

In [7]:
# To see the dictionary of unique words that exist at least twice in the corpus
vocabulary = word2vec.wv.vocab
print(vocabulary)

{'computer': <gensim.models.keyedvectors.Vocab object at 0x0000019DA2B9F348>, 'science': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA388>, 'artificial': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA588>, 'intelligence': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA608>, 'ai': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA788>, 'sometimes': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA8C8>, 'called': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCAA88>, 'machine': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCAA08>, 'demonstrated': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA448>, 'machines': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA888>, 'contrast': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCAC08>, 'natural': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCACC8>, 'displayed': <gensim.models.keyedvectors.Vocab object at 0x0000019DA7DCA0C8

# Model Analysis

### Finding Vectors for a Word

the Word2Vec model converts words to their corresponding vectors.
Let's see how we can view vector representation of any particular word

In [9]:
v1 = word2vec.wv['artificial']

By default, a hundred dimensional vector is created by Gensim Word2Vec. 
If the minimum frequency of occurrence is set to 1, the size of the bag of words vector will further increase. On the other hand, vectors generated through Word2Vec are not affected by the size of the vocabulary.

### Finding Similar Words

In [11]:
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

[('human', 0.7360125184059143),
 ('ai', 0.730013370513916),
 ('artificial', 0.6493045091629028),
 ('many', 0.6187971830368042),
 ('machine', 0.6171556711196899),
 ('research', 0.6143630743026733),
 ('problem', 0.6074798107147217),
 ('example', 0.6073349118232727),
 ('one', 0.6064658761024475),
 ('used', 0.5999341607093811)]