# Week 4 notes

Semantic Text Similarity

- Useful when grouping words by their meaning  
- Useful as a building block in national language understanding tasks  
    - Textual entailment  
    - Paraphrasing

# WordNet

- a symantic dictionary, interlinked by semantic relations  
- Includes rich linguistic information- part of speech, word sense, synonyms, etc  
- Machine-readable, freely available  
- Organizes information in a hierarchy. Calculate path similarity: Find the shortest path between two concepts, similarity is inversely related to path distance  
    - Lowest common subsumer (LCS): the closest ancestor to two separate concepts  
    - Lin similarity: Uses LCS to measure similarity
    
How to do all this in Python?  
- WordNet is easily imported through NLTK

In [1]:
import nltk
from nltk.corpus import wordnet as wn

## Find path similarity

In [4]:
# Find the sense of the word- n.01 = first noun meaning
deer = wn.synset('deer.n.01')
elk  = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

In [3]:
deer.path_similarity(elk)

0.5

In [5]:
deer.path_similarity(horse)

0.14285714285714285

## Find Lin similarity

In [9]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat') # Import brown clusters

In [7]:
deer.lin_similarity(elk, brown_ic)

0.8623778273893673

In [8]:
deer.lin_similarity(horse,brown_ic)

0.7726998936065773

## Collocations and distributional similarity

Words that frequently appear in similar contexts are more likely to be semantically related. Options:  
- Look for words within a window before or after each other. 
- Parts of speech that occur with the target word  
- Syntatic relation  
- Words in the same sentence, same document, etc...

You can calculate the strength of the association between words

Pointwise Mutual Information = Chance of seeing the words together / (chance of seeing first word * chance of seeing second word)

How to do it in Python:

    import nltk
    from nltk.collocations import *
    bigrams_measures = nltk.collocations.BigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(text)
    finder.nbest(bigram_measures.pmi,10)

You can also use `finder` for frequency filtering  

    finder.apply_freq_filter(10)

# Latent Dirichlet Allocation (LDA)

A generative model used extensively for modeling large text corpa

    import genism  
    from genism import corpora, models  
    dictionary = corpora.Dictionary(doc_set)  
    corpus = [dictionary.doc2bow(doc) for doc in doc_set]   
    ldamodel = genism.models.ldamodel.LdaModel(corpus, num_topics = 4, id2word = dictionary, passes=50)  
    print(ldamodel.print_topics(num_topics=4, num_words=5)  