<DIV ALIGN=CENTER>

# Introduction to NLP: Semantic Analysis
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

Semantic Analysis
Derive meaning from words, chunks, tokens.

- wordnet
- gensim
- word2vec


-----



In [1]:
from nltk.corpus import wordnet as wn

max_entries = 5
the_word = 'drive'
the_synsets = wn.synsets(the_word)

print('{0} total entries in synonym ring for {1}. '.format(len(the_synsets), the_word), end='')
print('Only showing top {0} entries.'.format(max_entries))
print(70*'-')

for synset in the_synsets[:max_entries]:
    vals = synset.name().split('.')
    print('{0} ({1}): '.format(vals[0], vals[1]), end='')
    print(synset.definition())
    print(70*'-')

34 total entries in synonym ring for drive. Only showing top 5 entries.
----------------------------------------------------------------------
drive (n): the act of applying force to propel something
----------------------------------------------------------------------
drive (n): a mechanism by which force or power is transmitted in a machine
----------------------------------------------------------------------
campaign (n): a series of actions advancing a principle or tending toward a particular end
----------------------------------------------------------------------
driveway (n): a road leading up to a private house
----------------------------------------------------------------------
drive (n): the trait of being highly motivated
----------------------------------------------------------------------


In [2]:
for synset in the_synsets[:max_entries]:
    
    vals = synset.name().split('.')
    print('{0} ({1}): '.format(vals[0], vals[1]), end='')
    if synset.examples():
        print('Example: {0}'.format(synset.examples()[0]))
        
    for lma in synset.lemmas():
        print('    {0}'.format(lma))

    print(60*'-')

drive (n): Example: after reaching the desired velocity the drive is cut off
    Lemma('drive.n.01.drive')
    Lemma('drive.n.01.thrust')
    Lemma('drive.n.01.driving_force')
------------------------------------------------------------
drive (n): Example: a variable speed drive permitted operation through a range of speeds
    Lemma('drive.n.02.drive')
------------------------------------------------------------
campaign (n): Example: he supported populist campaigns
    Lemma('campaign.n.02.campaign')
    Lemma('campaign.n.02.cause')
    Lemma('campaign.n.02.crusade')
    Lemma('campaign.n.02.drive')
    Lemma('campaign.n.02.movement')
    Lemma('campaign.n.02.effort')
------------------------------------------------------------
driveway (n): Example: they parked in the driveway
    Lemma('driveway.n.01.driveway')
    Lemma('driveway.n.01.drive')
    Lemma('driveway.n.01.private_road')
------------------------------------------------------------
drive (n): Example: his drive and energ

In [3]:
# Define some words
man = wn.synset('man.n.01')
woman = wn.synset('woman.n.01')
horse = wn.synset('horse.n.01')
bird = wn.synset('bird.n.01')

In [4]:
fmt_str = '{1} to {2}: {0:4.3f}'

print('Path Similarity:')
print(40*'-')
print(fmt_str.format(wn.path_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.path_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.path_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.path_similarity(woman, woman), 'woman', 'woman'))

Path Similarity:
----------------------------------------
man to woman: 0.333
man to horse: 0.077
man to bird: 0.125
woman to woman: 1.000


In [5]:
print('Leacock-Chodorow Similarity:')
print(40*'-')
print(fmt_str.format(wn.lch_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.lch_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.lch_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.lch_similarity(woman, woman), 'woman', 'woman'))

Leacock-Chodorow Similarity:
----------------------------------------
man to woman: 2.539
man to horse: 1.073
man to bird: 1.558
woman to woman: 3.638


In [6]:
print('Wu-Palmer Similarity:')
print(40*'-')
print(fmt_str.format(wn.wup_similarity(man, woman), 'man', 'woman'))
print(fmt_str.format(wn.wup_similarity(man, horse), 'man', 'horse'))
print(fmt_str.format(wn.wup_similarity(man, bird), 'man', 'bird'))
print(fmt_str.format(wn.wup_similarity(woman, woman), 'woman', 'woman'))

Wu-Palmer Similarity:
----------------------------------------
man to woman: 0.667
man to horse: 0.500
man to bird: 0.632
woman to woman: 0.667


-----

### Student Activity

In the preceding cells, we . Now that you have run the Notebook,
go back and make the following changes to see how the results change.

1. Change 
2. Try 
3. Can 

----------

## Gensim

Now swich to gensim

build topic models and use those for similarity

-----

In [7]:
# As a text example, we use the course description for INFO490  SP16.
info_course = ['Advanced Data Science: This class is an asynchronous, online course.', 
               'This course will introduce advanced data science concepts by building on the foundational concepts presented in INFO 490: Foundations of Data Science.', 
               'Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.', 
               'Next, students will learn machine learning techniques including supervised and unsupervised learning, dimensional reduction, and cluster finding.', 
               'An emphasis will be placed on the practical application of these techniques to high-dimensional numerical data, time series data, image data, and text data.', 
               'Finally, students will learn to use relational databases and cloud computing software components such as Hadoop, Spark, and NoSQL data stores.', 
               'Students must have access to a fairly modern computer, ideally that supports hardware virtualization, on which they can install software.', 
               'This class is open to sophomores, juniors, seniors and graduate students in any discipline who have either taken a previous INFO 490 data science course or have received instructor permission.']

# Simple stop words
stop_words = set('for a of the and to in on an'.split())

# Parse text into words, make lowercase and remove stop words
txts = [[word for word in sentance.lower().split() if word not in stop_words]
        for sentance in info_course]

# Keep only those words appearing more than once
# Easy with a Counter, but need a flat list
from collections import Counter
frequency = Counter([word for txt in txts for word in txt])

# Now grab tokens that appear more than once
tokens = [[token for token in txt if frequency[token] > 1]
          for txt in txts]

from gensim import corpora
dict_gensim = corpora.Dictionary(tokens)

crps = [dict_gensim.doc2bow(txt) for txt in txts]

from gensim import models

tfidf = models.TfidfModel(crps)

crps_tfidf = tfidf[crps]

lda_gs = models.LdaModel(corpus=crps_tfidf, id2word=dict_gensim, num_topics=5, passes=25)

In [8]:
txt = 'Master Data Science'
vec_bow = dict_gensim.doc2bow(txt.lower().split())
vec_lda = lda_gs[vec_bow]
print(vec_lda)

[(0, 0.066800291490970901), (1, 0.066799581386506865), (2, 0.068840948453476519), (3, 0.068986284952324783), (4, 0.72857289371672085)]


In [9]:
from gensim import similarities
index = similarities.MatrixSimilarity(lda_gs[crps_tfidf])



In [10]:
index[vec_lda]

array([ 0.9999097 ,  0.19364852,  0.28261018,  0.21369374,  0.25487536,
        0.21060194,  0.25424922,  0.99906713], dtype=float32)

In [11]:
import operator

def find_similar_sentances(mdl, dct, sml_idx, txt, prct = 0.9):
    vec_bow = dct.doc2bow(txt.lower().split())
    vec_lda = mdl[vec_bow]
    
    sml = sorted(enumerate(sml_idx[vec_lda]), \
                 key=operator.itemgetter(1), reverse=True)
    print('Score| Text')
    print(4*'-', '|', 73*'-')
    
    for idx, val in sml:
        if val > prct:
            print('{0:4.3f}: {1}'.format(val, info_course[idx]))
            print(4*'-', '|', 73*'-')

In [12]:
find_similar_sentances(lda_gs, dict_gensim, index, txt)

Score| Text
---- | -------------------------------------------------------------------------
1.000: Advanced Data Science: This class is an asynchronous, online course.
---- | -------------------------------------------------------------------------
0.999: This class is open to sophomores, juniors, seniors and graduate students in any discipline who have either taken a previous INFO 490 data science course or have received instructor permission.
---- | -------------------------------------------------------------------------


In [13]:
txt = 'evaluate statistical plots'
find_similar_sentances(lda_gs, dict_gensim, index, txt, 0.5)

Score| Text
---- | -------------------------------------------------------------------------
0.987: This course will introduce advanced data science concepts by building on the foundational concepts presented in INFO 490: Foundations of Data Science.
---- | -------------------------------------------------------------------------
0.953: Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.
---- | -------------------------------------------------------------------------


In [14]:
txt = 'learn computing'
find_similar_sentances(lda_gs, dict_gensim, index, txt, 0.75)

Score| Text
---- | -------------------------------------------------------------------------
0.992: Next, students will learn machine learning techniques including supervised and unsupervised learning, dimensional reduction, and cluster finding.
---- | -------------------------------------------------------------------------
0.991: Finally, students will learn to use relational databases and cloud computing software components such as Hadoop, Spark, and NoSQL data stores.
---- | -------------------------------------------------------------------------


-----

### Student Activity

In the preceding cells, we . Now that you have run the Notebook,
go back and make the following changes to see how the results change.

1. Change 
2. Try 
3. Can 


-----

## Word2Vec


-----

In [15]:
import gensim

model = gensim.models.Word2Vec(txts, window=2, min_count=1)

In [16]:
# Compute cosine similarity between two words.

def get_similarity(mdl, w1, w2):
    sml = mdl.similarity(w1, w2)
    print('{0:6.3f}: {1} to {2}'.format(sml, w1, w2))

get_similarity(model, 'data', 'data')
get_similarity(model, 'data', 'science')
get_similarity(model, 'image', 'data')
get_similarity(model, 'students', 'computing')

 1.000: data to data
-0.017: data to science
 0.041: image to data
-0.008: students to computing


-----

Larger corpus, use movie reviews

-----

In [17]:
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = '/home/data_scientist/data/nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)

In [18]:
import string

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

new_mvr = []

for mvr in mvr.data:
    wtks = tokenizer.tokenize(mvr.decode('utf-8'))
    new_mvr.append([wtk for wtk in wtks if wtk not in string.punctuation])

print('Number of reviews in Corpus = {0}'.format(len(new_mvr)))

Number of reviews in Corpus = 2000


In [19]:
# Build word2vec model from movie reviews
model = gensim.models.Word2Vec(new_mvr, window=5, min_count=5)

In [20]:
# Compute Cosine Similarities from Corpus
get_similarity(model, 'girl', 'girl')
get_similarity(model, 'boy', 'girl')
get_similarity(model, 'woman', 'man')
get_similarity(model, 'woman', 'girl')

 1.000: girl to girl
 0.771: boy to girl
 0.829: woman to man
 0.804: woman to girl


In [21]:
def show_words(vals, type='Cosine Similarity'):
    print('{0:14s}: {1}'.format('Word', type))
    print(40*'-')
    for val in vals:
        print('{0:14s}: {1:6.3f}'.format(val[0], val[1]))

In [22]:
#Compute cosine similarity between two words.

vals = model.most_similar('boy', topn=5)
show_words(vals)

Word          : Cosine Similarity
----------------------------------------
girl          :  0.771
woman         :  0.735
man           :  0.725
immigrant     :  0.724
hairdresser   :  0.697


In [23]:
# Identify words that don't belong

wrd_lst = ['boy', 'horse', 'cow', 'pig']

model.doesnt_match(wrd_lst)

'boy'

In [24]:
#Compute cosine similarity between two sets of words.

val = model.n_similarity(['boy', 'girl'], ['man', 'woman'])

print('Cosine Similarity = {0:6.3f}'.format(val))

Cosine Similarity =  0.810


In [25]:
# Find similar words (Cosine Similarity)

vals = model.most_similar(positive=['woman', 'girl'], negative=['man'], topn=5)
show_words(vals)

Word          : Cosine Similarity
----------------------------------------
fat           :  0.694
drinks        :  0.664
approaches    :  0.660
prostitute    :  0.657
benefactor    :  0.656


In [26]:
# Find similar words (Multiplicative Combination method)

vals = model.most_similar_cosmul('actor', topn=5)
show_words(vals, 'CosMul Similarity')

Word          : CosMul Similarity
----------------------------------------
actress       :  0.825
oscar         :  0.808
nomination    :  0.798
schwimmer     :  0.797
performance   :  0.797


-----

## Student Activity

In the preceding cells, we . Now that you have run the Notebook,
go back and make the following changes to see how the results change.

1. Change 
2. Try 
3. Can 

-----