# Introduction to NLP: Semantic Analysis

-----

One of the most elusive goals in natural language processing is the identification of semantic meaning from text. In this case, we don't want to simply use text to identify topics or classify documents. Instead we want to use the relationships between words in documents to gain insight from context. One simple way to understand this concept is to see that when words occur in similar arrangements or patterns across documents, the pattern conveys meaning beyond the simple appearance of the words themselves. 

In this notebook, we explore semantic analysis. First we use the wordnet corpus to identify similar words based on pre-defined relationships. Second, we use the _gensim_ library to create topic models that can be used to compute similarity measures based on the inherent patterns of words within a corpus. Finally, we explore the _word2vec_ algorithm, where a neural network is trained on a large corpus to identify relationships between words and phrase.

-----

## Table of Contents

[Wordnet](#N-Grams)

- [Word Similarities](#Word-Similarities)

[Gensim](#Gensim)

[Word2Vec](#Word2Vec)

-----

Before proceeding with the rest of this notebook, we first define our _home_ directory.

-----

In [1]:
# First we find our HOME directory
home_dir = !echo $HOME

# Define data directory
home = home_dir[0] +'/'

-----

[[Back to TOC]](#Table-of-Contents)

## Wordnet

[Wordnet][wdn] is an English lexical database that groups words into synsets, which is shorthand for synonym sets. The database was created under at Princeton University and has been distributed with an open license. Given its nature, it is a different corpus than the Treebank or Brown corpora analyzed in the Topic Modeling notebook. A wordnet entry can have multiple definitions for a word, associated synonyms, lemmas, and other information that can be used to algorithmically identify relationships between words. In the next few Code cells we explore the wordnet corpus, before moving on to using it to compute word similarities.

-----
[wdn]: https://en.wikipedia.org/wiki/WordNet 

In [2]:
# Explore WordNet synonym rings
from nltk.corpus import wordnet as wn

# We limit the number of items in ring to display
max_entries = 5

# Choose a word, change this to see different results.
the_word = 'drive'
the_synsets = wn.synsets(the_word)

# Display summary stats
print(f'{len(the_synsets)} total entries ', end='')
print(f'in synonym ring for {the_word}. ', end='')
print('Only showing top {max_entries} entries.')
print(70*'-')

# Step through ring and display data
for synset in the_synsets[:max_entries]:
    vals = synset.name().split('.')
    print(f'{vals[0]} ({vals[1]}): ', end='')
    print(synset.definition())
    print(70*'-')

34 total entries in synonym ring for drive. Only showing top {max_entries} entries.
----------------------------------------------------------------------
drive (n): the act of applying force to propel something
----------------------------------------------------------------------
drive (n): a mechanism by which force or power is transmitted in a machine
----------------------------------------------------------------------
campaign (n): a series of actions advancing a principle or tending toward a particular end
----------------------------------------------------------------------
driveway (n): a road leading up to a private house
----------------------------------------------------------------------
drive (n): the trait of being highly motivated
----------------------------------------------------------------------


In [3]:
# Now we display the synonyms, definitions and lemmas.

for synset in the_synsets[:max_entries]:
    
    vals = synset.name().split('.')
    print('{0} ({1}): '.format(vals[0], vals[1]), end='')
    if synset.examples():
        print('Example: {0}'.format(synset.examples()[0]))
        
    for lma in synset.lemmas():
        print('    {0}'.format(lma))

    print(60*'-')

drive (n): Example: after reaching the desired velocity the drive is cut off
    Lemma('drive.n.01.drive')
    Lemma('drive.n.01.thrust')
    Lemma('drive.n.01.driving_force')
------------------------------------------------------------
drive (n): Example: a variable speed drive permitted operation through a range of speeds
    Lemma('drive.n.02.drive')
------------------------------------------------------------
campaign (n): Example: he supported populist campaigns
    Lemma('campaign.n.02.campaign')
    Lemma('campaign.n.02.cause')
    Lemma('campaign.n.02.crusade')
    Lemma('campaign.n.02.drive')
    Lemma('campaign.n.02.movement')
    Lemma('campaign.n.02.effort')
------------------------------------------------------------
driveway (n): Example: they parked in the driveway
    Lemma('driveway.n.01.driveway')
    Lemma('driveway.n.01.drive')
    Lemma('driveway.n.01.private_road')
------------------------------------------------------------
drive (n): Example: his drive and energ

-----

[[Back to TOC]](#Table-of-Contents)

### Word Similarities

We can use wordnet to compute word similarities. The wordnet corpus has tagged tokens, which can be used to compute paths between two words. Shorter paths generally correspond to more similar words. In the following examples, we start with several simple wordnet synonym rings, and compute similarities between these words. 

-----

In [4]:
# Define some words
man = wn.synset('man.n.01')
woman = wn.synset('woman.n.01')
horse = wn.synset('horse.n.01')
bird = wn.synset('bird.n.01')

In [5]:
# Now we print similarity measures.
fmt_str = '{1} to {2}: {0:4.3f}'

print('Path Similarity:')
print(40*'-')
print(fmt_str.format(wn.path_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.path_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.path_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.path_similarity(woman, woman), 'woman', 'woman'))

Path Similarity:
----------------------------------------
man to woman: 0.333
man to horse: 0.077
man to bird: 0.125
woman to woman: 1.000


In [6]:
print('Leacock-Chodorow Similarity:')
print(40*'-')
print(fmt_str.format(wn.lch_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.lch_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.lch_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.lch_similarity(woman, woman), 'woman', 'woman'))

Leacock-Chodorow Similarity:
----------------------------------------
man to woman: 2.539
man to horse: 1.073
man to bird: 1.558
woman to woman: 3.638


In [7]:
print('Wu-Palmer Similarity:')
print(40*'-')
print(fmt_str.format(wn.wup_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.wup_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.wup_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.wup_similarity(woman, woman), 'woman', 'woman'))

Wu-Palmer Similarity:
----------------------------------------
man to woman: 0.667
man to horse: 0.500
man to bird: 0.632
woman to woman: 1.000


-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we used wordnet to compute word similarities. Now that you have run the notebook, go back and make the following changes to see how the results change.

1. Change the value of `the_word` to a different word, like _ship_ or _throw_. How many entries does the new word have in the wordnet synset?
2. Compute the path similarity for a different set of words, like _cat_, _dog_, _boy, and _girl_.

-----

[[Back to TOC]](#Table-of-Contents)

## Gensim

We can use the topic models constructed by using the gensim library to look for similarity. In this case, we build the topic model and use these models to create a similarity matrix. By transforming new text into this vector space, we can compute similarity measures by using the model learned from the original text. In the following Code cells, we build a topic model by using LDA from our course description text. Finally, we use this new model to compute similarity measures between the original text and new sample text.

-----

In [8]:
# As a text example, we use the following course description.
info_course = ['Advanced Data Science: This class is an asynchronous, ',
               'online course. This course will introduce advanced ',
               'data science concepts by building ',
               'on the foundational concepts presented in the ',
               'prerequisite course: Foundations of Data Science. ', 
               'Students will first learn how to perform more ',
               'statistical data exploration and constructing and ',
               'evaluating statistical models. Next, students will ',
               'learn machine learning techniques including supervised ',
               'and unsupervised learning, dimensional reduction, and ',
               'cluster finding. An emphasis will be placed on the ',
               'practical application of these techniques to ',
               'high-dimensional numerical data, time series data, ',
               'image data, and text data. Finally, students will ',
               'learn to use relational databases and cloud computing ',
               'software components such as Hadoop, Spark, and NoSQL ',
               'data stores. Students must have access to a fairly ',
               'modern computer, ideally that supports hardware ',
               'virtualization, on which they can install software.', 
               'This class is open to sophomores, juniors, seniors ',
               'and graduate students in any discipline who have ',
               'either taken a previous data science course or ',
               'have received instructor permission.']

# Simple stop words
stop_words = set('for a of the and to in on an'.split())

# Parse text into words, make lowercase and remove stop words
txts = [[word for word in sentance.lower().split() if word not in stop_words]
        for sentance in info_course]

# Keep only those words appearing more than once
# Easy with a Counter, but need a flat list
from collections import Counter
frequency = Counter([word for txt in txts for word in txt])

# Now grab tokens that appear more than once
tokens = [[token for token in txt if frequency[token] > 1]
          for txt in txts]

from gensim import corpora
dict_gensim = corpora.Dictionary(tokens)

crps = [dict_gensim.doc2bow(txt) for txt in txts]

from gensim import models

tfidf = models.TfidfModel(crps)

crps_tfidf = tfidf[crps]

lda_gs = models.LdaModel(corpus=crps_tfidf, id2word=dict_gensim, num_topics=5, passes=25)

In [9]:
# Create bag of words vector space representation for sample text.

txt = 'Master Data Science'
vec_bow = dict_gensim.doc2bow(txt.lower().split())
vec_lda = lda_gs[vec_bow]

# Display BOW representation
print(vec_lda)

[(0, 0.73147033332519029), (1, 0.066899613032101776), (2, 0.066689431422646564), (3, 0.068263404515770718), (4, 0.066677217704290678)]


In [10]:
# Compute similarity matrix from the topic model

from gensim import similarities
index = similarities.MatrixSimilarity(lda_gs[crps_tfidf])

In [11]:
# Display sample text values from similarity matrix

index[vec_lda]

array([ 0.99992508,  0.20426318,  0.23270291,  0.28520346,  0.98996985,
        0.22014055,  0.25401837,  0.21876368,  0.24158329,  0.60134262,
        0.28373903,  0.28374803,  0.28392529,  0.21921808,  0.98929203,
        0.60134262,  0.51379514,  0.60134262,  0.60134262,  0.21971339,
        0.23921844,  0.99952495,  0.28356701], dtype=float32)

In [12]:
# Find similar sentances to given set of words.

import operator

def find_similar_sentances(mdl, dct, sml_idx, txt, prct = 0.9):
    vec_bow = dct.doc2bow(txt.lower().split())
    vec_lda = mdl[vec_bow]
    
    sml = sorted(enumerate(sml_idx[vec_lda]), \
                 key=operator.itemgetter(1), reverse=True)
    print('Score| Text')
    print(4*'-', '|', 73*'-')
    
    for idx, val in sml:
        if val > prct:
            print('{0:4.3f}: {1}'.format(val, info_course[idx]))
            print(4*'-', '|', 73*'-')

In [13]:
find_similar_sentances(lda_gs, dict_gensim, index, txt)

Score| Text
---- | -------------------------------------------------------------------------
1.000: Advanced Data Science: This class is an asynchronous, 
---- | -------------------------------------------------------------------------
1.000: either taken a previous data science course or 
---- | -------------------------------------------------------------------------
0.990: prerequisite course: Foundations of Data Science. 
---- | -------------------------------------------------------------------------
0.989: learn to use relational databases and cloud computing 
---- | -------------------------------------------------------------------------


In [14]:
txt = 'evaluate statistical plots'
find_similar_sentances(lda_gs, dict_gensim, index, txt, 0.5)

Score| Text
---- | -------------------------------------------------------------------------
1.000: practical application of these techniques to 
---- | -------------------------------------------------------------------------
0.998: statistical data exploration and constructing and 
---- | -------------------------------------------------------------------------
0.997: learn machine learning techniques including supervised 
---- | -------------------------------------------------------------------------
0.709: and unsupervised learning, dimensional reduction, and 
---- | -------------------------------------------------------------------------
0.709: software components such as Hadoop, Spark, and NoSQL 
---- | -------------------------------------------------------------------------
0.709: modern computer, ideally that supports hardware 
---- | -------------------------------------------------------------------------
0.709: virtualization, on which they can install software.
---- | --

In [15]:
txt = 'learn computing'
find_similar_sentances(lda_gs, dict_gensim, index, txt, 0.75)

Score| Text
---- | -------------------------------------------------------------------------
1.000: learn to use relational databases and cloud computing 
---- | -------------------------------------------------------------------------
1.000: prerequisite course: Foundations of Data Science. 
---- | -------------------------------------------------------------------------
0.993: either taken a previous data science course or 
---- | -------------------------------------------------------------------------
0.987: Advanced Data Science: This class is an asynchronous, 
---- | -------------------------------------------------------------------------


-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we build an LDA model from our course description text. Now that you have run the notebook, go back and make the following changes to see how the results change.

1. Try using a different set of words, do the similar sentences make sense? Can you explain why?

2. Try using a different corpus (like the twenty newsgroup data set) to build the LDA model. Do word similarities make more or less sense with this new model

-----

[[Back to TOC]](#Table-of-Contents)


## Word2Vec

Word2vec is a neural network model that was developed several years ago within Google to provide an efficient continuous bag of words and skip-gram algorithms for word-vector representations. By using these approaches, words can be directly compared by using their vector representations to compute similarity measures. The continuous bag of words can be used to predict a context given a word, while a skip gram can be used to predict a word given a context. In this notebook we use the word2vec implementation provided in the gensim library. We first create the model, in this case we start with the parsed course description. Given this model, we can compute word similarities.

-----

In [16]:
from gensim.models import Word2Vec

model = Word2Vec(txts, window=2, min_count=1)

In [17]:
# Compute cosine similarity between two words.

def get_similarity(mdl, w1, w2):
    sml = mdl.similarity(w1, w2)
    print(f'{sml:6.3f}: {w1} to {w2}')

get_similarity(model, 'data', 'data')
get_similarity(model, 'data', 'science')
get_similarity(model, 'image', 'data')
get_similarity(model, 'students', 'computing')

 1.000: data to data
 0.127: data to science
 0.035: image to data
 0.097: students to computing


-----

The previous example demonstrated how to use _word2vec_ by using the gensim library. But given the small size of the text document, this didn't really demonstrate the full power of this approach. We now switch to the NLTK movie review corpus, and build a word2vec model from this text data. First, we read the data into the notebook, before tokenizing the data and building the vector space model. Given the large number of words in this corpus, we can compute similarities between a larger number of words, as well as explore relationships between words, all based on the relative occurrences of words in the original corpus.

-----

In [18]:
# Load movie review corpus
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = home + 'nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)

In [19]:
# Tokenize movie reviews by using a word-punctuation tokenizer

import string

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

new_mvr = []

# We explicitly cull out punctuation tokens
for mvr in mvr.data:
    wtks = tokenizer.tokenize(mvr.decode('utf-8'))
    new_mvr.append([wtk for wtk in wtks if wtk not in string.punctuation])

print(f'Number of reviews in Corpus = {len(new_mvr)}')

Number of reviews in Corpus = 2000


In [20]:
# Build word2vec model from movie reviews
model = Word2Vec(new_mvr, window=5, min_count=5)

In [21]:
# Compute Cosine Similarities from Corpus
get_similarity(model, 'girl', 'girl')
get_similarity(model, 'boy', 'girl')
get_similarity(model, 'woman', 'man')
get_similarity(model, 'woman', 'girl')

 1.000: girl to girl
 0.948: boy to girl
 0.892: woman to man
 0.912: woman to girl


In [22]:
# simple function to display words that are similar to a given word
def show_words(vals, type='Cosine Similarity'):
    print(f'{"Word":14s}: {type}')
    print(40*'-')
    for val in vals:
        print(f'{val[0]:14s}: {val[1]:6.3f}')

In [23]:
#Compute cosine similarity between two words.

vals = model.most_similar('boy', topn=5)
show_words(vals)

Word          : Cosine Similarity
----------------------------------------
girl          :  0.948
jack          :  0.873
woman         :  0.868
man           :  0.858
girlfriend    :  0.852


In [24]:
# Identify words that don't belong

wrd_lst = ['boy', 'horse', 'cow', 'pig']

model.doesnt_match(wrd_lst)

'boy'

In [25]:
# Compute cosine similarity between two sets of words.

val = model.n_similarity(['boy', 'girl'], ['man', 'woman'])

print(f'Cosine Similarity = {val:6.3f}')

Cosine Similarity =  0.908


In [26]:
# Find similar words (Cosine Similarity)

vals = model.most_similar(positive=['woman', 'girl'], 
                          negative=['man'], topn=5)
show_words(vals)

Word          : Cosine Similarity
----------------------------------------
husband       :  0.839
mother        :  0.834
son           :  0.831
boy           :  0.830
julia         :  0.820


In [27]:
# Find similar words (Multiplicative Combination method)

vals = model.most_similar_cosmul('actor', topn=5)
show_words(vals, 'CosMul Similarity')

Word          : CosMul Similarity
----------------------------------------
actress       :  0.929
oscar         :  0.921
performance   :  0.905
role          :  0.896
nomination    :  0.892


-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding cells, we applied the word2vec model to the movie review data set. Before proceeding, do the results make sense (feel free to discuss in the class forums). Now that you have run the notebook, go back and make the following changes to see how the results change.

1. Change the `count` and `window` values. How do the results change?
2. Try exploring the relationship between other words, like _cat_, _dog_, _bird_, and _horse_; or other word combinations that are likely to appear in movie reviews.
3. Can you apply word2vec to a different corpus, like the Brown corpus in NLTK? How do the similarity measures change with this new corpus for the same set of words?

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. Wikipedia article on [Vector Space Models][wvsm]
1. Wikipedia article on [Latent Semantic Analysis (LSA)][wlsa] 
1. Wikipedia article on [Semantic Similarity][wss]
1. Blog discussing the use of [semantic analysis for fashion][bwe] 
1. Implementation of [word2vec in python][wip], via gensim
1. Blog article discussing the use of [word2vec for text analysis][wta]
1. Original Google [word2vec][gw2v] implementation. 

-----

[bwe]: http://developers.lyst.com/2014/11/11/word-embeddings-for-fashion/

[wvsm]: https://en.wikipedia.org/wiki/Vector_space_model
[ww2v]: https://en.wikipedia.org/wiki/Word2vec
[wlsa]: https://en.wikipedia.org/wiki/Latent_semantic_analysis
[wss]: https://en.wikipedia.org/wiki/Semantic_similarity

[gw2v]: https://code.google.com/archive/p/word2vec/
[wip]: http://radimrehurek.com/gensim/models/word2vec.html
[wta]: http://blog.dato.com/practical-text-analysis-using-deep-learning

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode