# Natural Language Processing with Python and NLTK


Language is pretty easy to humans, but really difficult to computers.  

### Word sense ambiguity

##### "He served the dish"
**serve**: help with food or drink; hold an office; put ball into play  
**dish**: plate; course of a meal; communications device  

##### "... by ..."

- The lost children were found by the **searchers** (agentive)
- The lost children were found by the **mountain** (locative)
- The lost children were found by the **afternoon** (temporal)

### Pronoun resolution
- The thieves stole the paintings. **They** were subsequently *sold*.
- The thieves stole the paintings. **They** were subsequently *caught*.
- The thieves stole the paintings. **They** were subsequently *found*.





In [None]:
import nltk

In [None]:
# nltk.download()

<img src="img/nltk_download.png">

In [None]:
from nltk.book import *

In [None]:
monty_python = text6

## `concordance()`
Search text and view context

In [None]:
monty_python.concordance('shrubbery')

In [None]:
monty_python.concordance('Camelot')

## `similar()`
Find other words used in similar context

In [None]:
monty_python.similar('castle')

## `dispersion_plot()`
graph the location where a word was used

In [None]:
inaugural_address = text4
inaugural_address.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

## `bigrams()`
generates 2-grams

In [None]:
list(nltk.bigrams(['to', 'be', 'or', 'not', 'to', 'be']))

## `collocations()`
Find frequent bigrams in text

In [None]:
inaugural_address.collocations()

In [None]:
monty_python.collocations()

## `FreqDist()` - Frequency Distribution
An object consisting of the frequency of each vocabulary

In [None]:
inaugural_distribution = nltk.FreqDist(inaugural_address)

In [None]:
inaugural_distribution.plot(50, cumulative=True)

In [None]:
inaugural_distribution['America']

### Crude markov chain generator. Uses most common distrubtion in a bigram

In [None]:
def generate_model(cfdist, word, num=15): 
    for i in range(num):
        print word,
        word = cfdist[word].max()
     
bigram = nltk.bigrams(inaugural_address)
cfd = nltk.ConditionalFreqDist(bigram)

print cfd['America']
generate_model(cfd, 'America')

## Part of Speech Tagger - `pos_tag()`

- CC: coordinating conjunction
- RB: adverbs
- IN: preposition
- NN: noun
- JJ: adjective
- VBD: verb, past tense
- DT: determiner

In [None]:
sentence = nltk.word_tokenize("The quick brown fox jumped over the lazy dog")
nltk.pos_tag(sentence)

## Jaccard Similiarity
$$J(A,B) = \frac{| A \cap B |}{| A \cup B |}$$

In [None]:
def generate_char_ngram(string, n):
    length = len(string)
    
    ngram_list = list()
    
    for index in range(length-1):
        ngram_list.append(string[index:index+2])
        
    return ngram_list

In [None]:
def jaccard_similiarty(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    
    return len(set1.intersection(set2)) / float(len(set1.union(set2)))

In [None]:
string1 = 'kitten'
string2 = 'sitting'

ngram1 = generate_char_ngram(string1, 2)
ngram2 = generate_char_ngram(string2, 2)

jaccard_similiarty(ngram1, ngram2)

In [None]:
string3 = 'google'
string4 = 'googleinc'

ngram3 = generate_char_ngram(string3, 2)
ngram4 = generate_char_ngram(string4, 2)

jaccard_similiarty(ngram3, ngram4)

## Cosine Similarity
$$ similarity = cos(\theta) = \frac {A \cdot B}{||A||\cdot||B||} $$


In [None]:
def vectorize_strings(string1, string2):
    ngram1 = generate_char_ngram(string1, 2)
    ngram2 = generate_char_ngram(string2, 2)
    
    ## Set are unordered, but that's ok for our purposes
    n = set(ngram1).union(set(ngram2))
    
    element1 = [0] * len(n)
    element2 = [0] * len(n)
    
    ## Loop through the union set of ngrams and check if they show up in string 1 and 2
    for index, ngram in enumerate(n):
        if ngram in ngram1:
            element1[index] = 1
        else:
            element1[index] = 0
            
        if ngram in ngram2:
            element2[index] = 1
        else:
            element2[index] = 0
            
    return (element1, element2)

In [None]:
import numpy as np

def cosine_similarity(vector1, vector2):
    a = np.asarray(vector1)
    b = np.asarray(vector2)
    result = np.dot(a, b) / float(np.linalg.norm(a)) / float(np.linalg.norm(b))
    
    return result
    

In [None]:
v1, v2 = vectorize_strings('google', 'googleinc')
cosine_similarity(v1, v2)

## Levenshtein / Edit Distance

Borrowed from Wikipedia...

In [None]:
# Christopher P. Matthews
# christophermatthews1985@gmail.com
# Sacramento, CA, USA

def levenshtein_distance(s, t):
        ''' From Wikipedia article; Iterative with two matrix rows. '''
        if s == t: return 0
        elif len(s) == 0: return len(t)
        elif len(t) == 0: return len(s)
        v0 = [None] * (len(t) + 1)
        v1 = [None] * (len(t) + 1)
        for i in range(len(v0)):
            v0[i] = i
        for i in range(len(s)):
            v1[0] = i + 1
            for j in range(len(t)):
                cost = 0 if s[i] == t[j] else 1
                v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
            for j in range(len(v0)):
                v0[j] = v1[j]
                
        return v1[len(t)]

In [None]:
levenshtein_distance('kitten', 'sitting')