## Word Embeddings in Python with Gensim

In this, you will practice how to train and load word embedding models for natural language processing applications in Python using Gensim.



1. How to train your own word2vec word embedding model on text data.
2. How to visualize a trained word embedding model using Principal Component Analysis.
3. How to load pre-trained word2vec word embedding models.

### Run the below two commands to install gensim and the wiki dataset

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

In [2]:
!pip install --upgrade gensim --user

Requirement already up-to-date: gensim in c:\users\bananth\appdata\local\continuum\anaconda3\lib\site-packages (3.6.0)


In [3]:
!pip install wikipedia --user



In [4]:
import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
#                    level=logging.INFO)
import pandas as pd

### Import gensim

In [5]:
import gensim

2018-11-04 15:17:23,677 : INFO : 'pattern' package not found; tag filters are not available for English


### Obtain Text

#### Import search and page functions from wikipedia module

search(/key word/): search function takes keyword as argument and gives top 10 article titles matching the given keyword.


page(/title of article/): page function takes page title as argument and gives content in the output.

In [6]:
## Usage: 

from wikipedia import search, page
titles = search("Machine Learning")
wikipage = page(titles[0])
print (wikipage.content)

Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.The name machine learning was coined in 1959 by Arthur Samuel. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders, and computer vision.
Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the us

### Print the top 10 titles for the keyword `Machine Learning`

In [7]:
titles[0:10]

['Machine learning',
 'Active learning (machine learning)',
 'List of datasets for machine learning research',
 'Boosting (machine learning)',
 'Deep learning',
 'Outline of machine learning',
 'Support vector machine',
 'Supervised learning',
 'Extreme learning machine',
 'Learning']

### Get the content from the first title from the above obtained 10 titles.

In [8]:
print(page(titles[0]).content)

Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.The name machine learning was coined in 1959 by Arthur Samuel. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders, and computer vision.
Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the us

### Create a list with name `documents` and append all the words in the 10 pages' content using the above 10 titles.

In [9]:
documents = []

#Iterate over each review
for i in range(10):
    pages = page(titles[i])
    documents.append(pages.content.split(' '))
    
print(len(documents))

10


In [10]:
print (documents[0])

['Machine', 'learning', '(ML)', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'uses', 'statistical', 'techniques', 'to', 'give', 'computer', 'systems', 'the', 'ability', 'to', '"learn"', '(e.g.,', 'progressively', 'improve', 'performance', 'on', 'a', 'specific', 'task)', 'from', 'data,', 'without', 'being', 'explicitly', 'programmed.The', 'name', 'machine', 'learning', 'was', 'coined', 'in', '1959', 'by', 'Arthur', 'Samuel.', 'Machine', 'learning', 'explores', 'the', 'study', 'and', 'construction', 'of', 'algorithms', 'that', 'can', 'learn', 'from', 'and', 'make', 'predictions', 'on', 'data', '–', 'such', 'algorithms', 'overcome', 'following', 'strictly', 'static', 'program', 'instructions', 'by', 'making', 'data-driven', 'predictions', 'or', 'decisions,', 'through', 'building', 'a', 'model', 'from', 'sample', 'inputs.', 'Machine', 'learning', 'is', 'employed', 'in', 'a', 'range', 'of', 'computing', 'tasks', 'where', 'designing', 'and', 'programming', 'explicit', 'alg

### Build the gensim model for word2vec with by considering all the words with frequency >=1 with embedding size=50

In [11]:
model = gensim.models.Word2Vec(documents, #Word list
                               min_count=1, #Ignore all words with total frequency lower than this                           
                               workers=4, #Number of CPUs
                               size=50,  #Embedding size
                               window=5, #Maximum Distance between current and predicted word
                               iter=10   #Number of iterations over the text corpus
                              )  

2018-11-04 15:18:11,222 : INFO : collecting all words and their counts
2018-11-04 15:18:11,224 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-04 15:18:11,247 : INFO : collected 8539 word types from a corpus of 106528 raw words and 10 sentences
2018-11-04 15:18:11,248 : INFO : Loading a fresh vocabulary
2018-11-04 15:18:11,273 : INFO : effective_min_count=1 retains 8539 unique words (100% of original 8539, drops 0)
2018-11-04 15:18:11,274 : INFO : effective_min_count=1 leaves 106528 word corpus (100% of original 106528, drops 0)
2018-11-04 15:18:11,302 : INFO : deleting the raw counts dictionary of 8539 items
2018-11-04 15:18:11,304 : INFO : sample=0.001 downsamples 12 most-common words
2018-11-04 15:18:11,305 : INFO : downsampling leaves estimated 33388 word corpus (31.3% of prior 106528)
2018-11-04 15:18:11,326 : INFO : estimated required memory for 8539 words and 50 dimensions: 7685100 bytes
2018-11-04 15:18:11,327 : INFO : resetting layer weights


### Exploring the model

#### Check how many words in the model

In [12]:
model.wv.syn0.shape

  """Entry point for launching an IPython kernel.


(8539, 50)

### Get an embedding for word `SVM`

In [13]:
model.wv['SVM']

array([ 1.0260665 , -1.7938051 ,  0.2126215 ,  0.6050254 , -0.30921805,
       -1.2751584 , -0.3204025 , -0.43595523,  0.09558893,  0.20286699,
       -1.0056553 ,  1.1495671 ,  0.11439896, -0.55809253, -0.03964158,
       -1.0480316 , -1.382683  ,  0.09614433, -0.76734585, -0.1614878 ,
        0.6822112 ,  0.07033832,  0.5604986 ,  0.20448871, -0.2612005 ,
       -1.0575078 ,  0.60800934,  0.810642  ,  0.05581933, -0.46731398,
        0.13011657, -0.46610388, -0.16299851,  0.22774479, -0.12299313,
       -0.0458829 ,  0.45188767,  0.06850111, -1.0703039 ,  0.8813559 ,
       -0.78896797, -0.19473368,  0.76143134,  0.596582  ,  0.42050046,
        0.28942633,  0.4351296 ,  0.10019887,  0.47400358,  0.16303706],
      dtype=float32)

### Finding most similar words for word `learning`

In [14]:
model.wv.most_similar('learning')

2018-11-04 15:18:12,215 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('concept', 0.9993912577629089),
 ('deep', 0.9992951154708862),
 ('shown', 0.9992802143096924),
 ('machine', 0.9991883039474487),
 ('_{i=1}^{n}c_{i}-{\\frac', 0.9991559982299805),
 ('success', 0.9991374015808105),
 ('belief', 0.9987298250198364),
 ('part', 0.9987019300460815),
 ('applied', 0.9986998438835144),
 ('type', 0.9986853003501892)]

### Find the word which is not like others from `machine, svm, ball, learning`

In [15]:
model.doesnt_match("machine svm ball learning".split())

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


'ball'

### Save the model with name `word2vec-wiki-10`

In [16]:
model.save('word2vec-wiki-10')

2018-11-04 15:18:12,255 : INFO : saving Word2Vec object under word2vec-wiki-10, separately None
2018-11-04 15:18:12,256 : INFO : not storing attribute vectors_norm
2018-11-04 15:18:12,258 : INFO : not storing attribute cum_table
2018-11-04 15:18:12,336 : INFO : saved word2vec-wiki-10


### Load the model `word2vec-wiki-10`

In [17]:
model = gensim.models.Word2Vec.load('word2vec-wiki-10')

2018-11-04 15:18:12,343 : INFO : loading Word2Vec object from word2vec-wiki-10
2018-11-04 15:18:12,401 : INFO : loading wv recursively from word2vec-wiki-10.wv.* with mmap=None
2018-11-04 15:18:12,403 : INFO : setting ignored attribute vectors_norm to None
2018-11-04 15:18:12,403 : INFO : loading vocabulary recursively from word2vec-wiki-10.vocabulary.* with mmap=None
2018-11-04 15:18:12,407 : INFO : loading trainables recursively from word2vec-wiki-10.trainables.* with mmap=None
2018-11-04 15:18:12,408 : INFO : setting ignored attribute cum_table to None
2018-11-04 15:18:12,409 : INFO : loaded word2vec-wiki-10
