# NLP Series - Part 2

#### Word Embeddings - In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves the mathematical embedding from space with many dimensions per word to a continuous vector space with a much lower dimension.

##### Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe, GN-GloVe, Flair embeddings, AllenNLP's ELMo, BERT, fastText, Gensim, Indra and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

#### Word2Vec

##### Word2Vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Unlike BoW & Tfidf where semantic information is not stored & uncommon words are also given importance (hence increasing chances of overfitting), in Word2Vec each word is represented as a vector of dimension 32 or more instead of a single number. Here the semantic info & relation between different words is also preserved.

#### https://jalammar.github.io/illustrated-word2vec/
##### When dealing with vectors, a common way to calculate a similarity score is cosine_similarity
##### Words get their embeddings by us looking at which other words they tend to appear next to. The mechanics of that is that :

##### 1) We get a lot of text data (say, all Wikipedia articles, for example). then
##### 2) We have a window (say, of three words) that we slide against all of that text.
##### 3) The sliding window generates training samples for our model

##### We take the first two words to be features, and the third word to be a label

In [1]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk

from gensim.models import Word2Vec
from nltk.corpus import stopwords

import re

In [3]:
paragraph = "Doctor Strange is a 2016 American superhero film based on the Marvel Comics character of the same name. Produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures, it is the 14th film in the Marvel Cinematic Universe (MCU). The film was directed by Scott Derrickson from a screenplay he wrote with Jon Spaihts and C. Robert Cargill, and stars Benedict Cumberbatch as neurosurgeon Stephen Strange along with Chiwetel Ejiofor, Rachel McAdams, Benedict Wong, Michael Stuhlbarg, Benjamin Bratt, Scott Adkins, Mads Mikkelsen, and Tilda Swinton. In the film, Strange learns the mystic arts after a career-ending car crash."

In [4]:
text = paragraph.lower()
text

'doctor strange is a 2016 american superhero film based on the marvel comics character of the same name. produced by marvel studios and distributed by walt disney studios motion pictures, it is the 14th film in the marvel cinematic universe (mcu). the film was directed by scott derrickson from a screenplay he wrote with jon spaihts and c. robert cargill, and stars benedict cumberbatch as neurosurgeon stephen strange along with chiwetel ejiofor, rachel mcadams, benedict wong, michael stuhlbarg, benjamin bratt, scott adkins, mads mikkelsen, and tilda swinton. in the film, strange learns the mystic arts after a career-ending car crash.'

In [5]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)
sentences

['doctor strange is a 2016 american superhero film based on the marvel comics character of the same name.',
 'produced by marvel studios and distributed by walt disney studios motion pictures, it is the 14th film in the marvel cinematic universe (mcu).',
 'the film was directed by scott derrickson from a screenplay he wrote with jon spaihts and c. robert cargill, and stars benedict cumberbatch as neurosurgeon stephen strange along with chiwetel ejiofor, rachel mcadams, benedict wong, michael stuhlbarg, benjamin bratt, scott adkins, mads mikkelsen, and tilda swinton.',
 'in the film, strange learns the mystic arts after a career-ending car crash.']

In [6]:
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
sentences

[['doctor',
  'strange',
  'is',
  'a',
  '2016',
  'american',
  'superhero',
  'film',
  'based',
  'on',
  'the',
  'marvel',
  'comics',
  'character',
  'of',
  'the',
  'same',
  'name',
  '.'],
 ['produced',
  'by',
  'marvel',
  'studios',
  'and',
  'distributed',
  'by',
  'walt',
  'disney',
  'studios',
  'motion',
  'pictures',
  ',',
  'it',
  'is',
  'the',
  '14th',
  'film',
  'in',
  'the',
  'marvel',
  'cinematic',
  'universe',
  '(',
  'mcu',
  ')',
  '.'],
 ['the',
  'film',
  'was',
  'directed',
  'by',
  'scott',
  'derrickson',
  'from',
  'a',
  'screenplay',
  'he',
  'wrote',
  'with',
  'jon',
  'spaihts',
  'and',
  'c.',
  'robert',
  'cargill',
  ',',
  'and',
  'stars',
  'benedict',
  'cumberbatch',
  'as',
  'neurosurgeon',
  'stephen',
  'strange',
  'along',
  'with',
  'chiwetel',
  'ejiofor',
  ',',
  'rachel',
  'mcadams',
  ',',
  'benedict',
  'wong',
  ',',
  'michael',
  'stuhlbarg',
  ',',
  'benjamin',
  'bratt',
  ',',
  'scott',
  'adki

In [7]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
sentences

[['doctor',
  'strange',
  '2016',
  'american',
  'superhero',
  'film',
  'based',
  'marvel',
  'comics',
  'character',
  'name',
  '.'],
 ['produced',
  'marvel',
  'studios',
  'distributed',
  'walt',
  'disney',
  'studios',
  'motion',
  'pictures',
  ',',
  '14th',
  'film',
  'marvel',
  'cinematic',
  'universe',
  '(',
  'mcu',
  ')',
  '.'],
 ['film',
  'directed',
  'scott',
  'derrickson',
  'screenplay',
  'wrote',
  'jon',
  'spaihts',
  'c.',
  'robert',
  'cargill',
  ',',
  'stars',
  'benedict',
  'cumberbatch',
  'neurosurgeon',
  'stephen',
  'strange',
  'along',
  'chiwetel',
  'ejiofor',
  ',',
  'rachel',
  'mcadams',
  ',',
  'benedict',
  'wong',
  ',',
  'michael',
  'stuhlbarg',
  ',',
  'benjamin',
  'bratt',
  ',',
  'scott',
  'adkins',
  ',',
  'mads',
  'mikkelsen',
  ',',
  'tilda',
  'swinton',
  '.'],
 ['film',
  ',',
  'strange',
  'learns',
  'mystic',
  'arts',
  'career-ending',
  'car',
  'crash',
  '.']]

In [8]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)

In [9]:
vocab = model.wv.index_to_key
vocab

[',',
 '.',
 'film',
 'strange',
 'marvel',
 'scott',
 'studios',
 'benedict',
 'crash',
 '14th',
 'cinematic',
 '(',
 'universe',
 'motion',
 'mcu',
 ')',
 'directed',
 'pictures',
 'distributed',
 'disney',
 'walt',
 'screenplay',
 'produced',
 'name',
 'character',
 'comics',
 'based',
 'superhero',
 'american',
 '2016',
 'derrickson',
 'wrote',
 'car',
 'jon',
 'career-ending',
 'arts',
 'mystic',
 'learns',
 'swinton',
 'tilda',
 'mikkelsen',
 'mads',
 'adkins',
 'bratt',
 'benjamin',
 'stuhlbarg',
 'michael',
 'wong',
 'mcadams',
 'rachel',
 'ejiofor',
 'chiwetel',
 'along',
 'stephen',
 'neurosurgeon',
 'cumberbatch',
 'stars',
 'cargill',
 'robert',
 'c.',
 'spaihts',
 'doctor']

In [10]:
# Finding Word Vectors
vector = model.wv['strange']
vector

array([-8.2378900e-03,  9.3050422e-03, -1.8904406e-04, -1.9478335e-03,
        4.5926250e-03, -4.1094120e-03,  2.7474777e-03,  6.9532893e-03,
        6.0589323e-03, -7.5170416e-03,  9.3914494e-03,  4.6602450e-03,
        3.9810021e-03, -6.2400880e-03,  8.4662586e-03, -2.1586721e-03,
        8.8269133e-03, -5.3755287e-03, -8.1217652e-03,  6.8169250e-03,
        1.6841607e-03, -2.2070918e-03,  9.5095355e-03,  9.4906222e-03,
       -9.7848363e-03,  2.5084158e-03,  6.1491951e-03,  3.8709536e-03,
        2.0149658e-03,  4.2894550e-04,  6.8211055e-04, -3.8191921e-03,
       -7.1303686e-03, -2.0918027e-03,  3.9249924e-03,  8.8153053e-03,
        9.2613259e-03, -5.9715398e-03, -9.4125913e-03,  9.7590070e-03,
        3.4332643e-03,  5.1673115e-03,  6.2914477e-03, -2.7980648e-03,
        7.3128161e-03,  2.8209265e-03,  2.8691192e-03, -2.3808109e-03,
       -3.1178603e-03, -2.3660229e-03,  4.2701620e-03,  7.3232382e-05,
       -9.5860884e-03, -9.6698040e-03, -6.1387368e-03, -1.2979444e-04,
      

In [11]:
# Most similar words
similar = model.wv.most_similar("marvel")
similar

[('motion', 0.25324559211730957),
 ('based', 0.20136503875255585),
 ('swinton', 0.1956038922071457),
 ('derrickson', 0.17586500942707062),
 ('film', 0.17048057913780212),
 ('universe', 0.15092208981513977),
 ('doctor', 0.1497604250907898),
 ('learns', 0.14596502482891083),
 ('scott', 0.13915996253490448),
 ('directed', 0.10890490561723709)]

#### GloVe coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford and was launched in 2014. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.