# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the <a href='http://www.cs.cmu.edu/~ark/personas/'>CMU Movie Summary Corpus</a>. All data is released under a <a href='https://creativecommons.org/licenses/by-sa/3.0/us/legalcode'>Creative Commons Attribution-ShareAlike License</a>. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Plots Summary Text

In [14]:
import nltk
import pandas as pd
import numpy as np

nltk.data.path.append("assets/")

# If you would like to work with the raw text you can use 'plots_raw'
with open('assets/plots.txt', 'r') as f:
    plots_raw = f.read()

# If you would like to work with the novel in nltk.Text format you can use 'text1'
plots_tokens = nltk.word_tokenize(plots_raw)
text1 = nltk.Text(plots_tokens)

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [15]:
def example_one():
    
    return len(nltk.word_tokenize(plots_raw)) # or alternatively len(text1)

example_one()

374441

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [16]:
def example_two():
    
    return len(set(nltk.word_tokenize(plots_raw))) # or alternatively len(set(text1))

example_two()

25933

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [17]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

21760

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [18]:
print(text1)

<Text: Shlykov , a hard-working taxi driver and Lyosha...>


In [19]:
def answer_one():

    n_unique = len(set(text1))
    total_words = len(text1)
    
    lex_div = n_unique / total_words
    
    # raise NotImplementedError()
    return lex_div

answer_one()

0.06925790712021386

### Question 2

What percentage of tokens is 'love'or 'Love'?

*This function should return a float.*

In [44]:
def answer_two():

#     token_list = ['love','Love']
    
#     love_words = [x for x in text1 if x in token_list]
#     love_tokens = len(love_words)
#     total_words = len(text1)
    
#     perc = love_tokens / total_words
#     return perc
    
    
    token_dict = nltk.FreqDist(plots_tokens)
    return (((token_dict['love'] + token_dict['Love'])*100)/float(len(nltk.word_tokenize(plots_raw))))
    
    #raise NotImplementedError()


answer_two()

0.12391805384559917

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [21]:
def answer_three():

    dist = nltk.FreqDist(text1)
    
    sorted_words = sorted(dist.items(), key=lambda x:x[1], reverse = True)
    
    top_twenty = sorted_words[:20]
    
    #raise NotImplementedError()
    return top_twenty

answer_three()

[(',', 19420),
 ('the', 18698),
 ('.', 16624),
 ('to', 12149),
 ('and', 11400),
 ('a', 8979),
 ('of', 6510),
 ('is', 5699),
 ('in', 5109),
 ('his', 4693),
 ("'s", 3682),
 ('her', 3674),
 ('he', 3556),
 ('that', 3517),
 ('with', 3293),
 ('him', 2570),
 ('for', 2433),
 ('by', 2321),
 ('The', 2234),
 ('on', 1925)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 200?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [22]:
def answer_four():

    # YOUR CODE HERE
    
    dist = nltk.FreqDist(text1)
    vocab = dist.keys()
    
    word_list = [w for w in vocab if len(w) > 5 and dist[w] > 200]
    sorted_word_list = sorted(word_list)
    # raise NotImplementedError()
    return sorted_word_list

answer_four()

['However',
 'Meanwhile',
 'another',
 'because',
 'becomes',
 'before',
 'begins',
 'daughter',
 'decides',
 'escape',
 'family',
 'father',
 'friend',
 'friends',
 'himself',
 'killed',
 'leaves',
 'mother',
 'people',
 'police',
 'returns',
 'school',
 'through']

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [23]:
def answer_five():

    dict = { word : len(word) for word in text1 }
    sorted_words = sorted(dict.items(), key=lambda x:x[1], reverse = True)
    longest_word = sorted_words[0]
    
    #raise NotImplementedError()
    return longest_word

answer_five()

('live-for-today-for-tomorrow-we-die', 34)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [50]:
def answer_six():
    
    import operator
    text2 = [x.lower() for x in plots_tokens]
    token_dict  = nltk.FreqDist(text2)
    res_lis = {}
    for w in token_dict.keys() :
        if w.isalpha() and token_dict[w] > 2000 :
            res_lis[w] = token_dict[w]    
    sorted_res_list = sorted(res_lis.items(), key=operator.itemgetter(1))
    sorted_res_list.reverse()
    result = [(f,w) for w,f in sorted_res_list]
    return result
    
#     # text2 = [x.lower() for x in text1]
#     # dist = nltk.FreqDist(text2)
    
#     dist = nltk.FreqDist(text1)
    
#     vocab = dist.keys()
#     vocab = [v for v in vocab if v.isalpha()] #remove punctuation
    
#     word_list = [w for w in vocab if dist[w] > 2000]
#     sorted_word_list = sorted(word_list)

#     #raise NotImplementedError()
#     return sorted_word_list

answer_six()

[(20935, 'the'),
 (12229, 'to'),
 (11432, 'and'),
 (9325, 'a'),
 (6521, 'of'),
 (5711, 'is'),
 (5568, 'in'),
 (4852, 'his'),
 (4647, 'he'),
 (3764, 'her'),
 (3571, 'that'),
 (3396, 'with'),
 (2574, 'him'),
 (2465, 'for'),
 (2340, 'by'),
 (2297, 'she'),
 (2198, 'as'),
 (2076, 'on')]

### Question 7

Continuing from the previous question on `text1`, tokenize `text1` by splitting on whitespace, and find the length of each sentence. Report the average number of tokens per sentence.


*This function should return a float.*

In [46]:
def answer_seven():

#     # YOUR CODE HERE
#     sentences = nltk.sent_tokenize(plots_raw)
    
#     dict = { sentence : len(sentence.split(' ')) for sentence in sentences }
    
#     total_tokens = sum(dict.values())
#     total_sentences = len(sentences)
    
    
#     # raise NotImplementedError()
#     return total_tokens / total_sentences

    sen_tokens = nltk.sent_tokenize(plots_raw)
    return len(plots_tokens)/len(sen_tokens)

answer_seven()

22.31737990225295

### Question 8

What are the 5 most frequent parts of speech in `text1`? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [47]:
def answer_eight():
    from collections import Counter
    
#     word_pos = nltk.pos_tag(text1)
#     counts = Counter(tag for word,tag in word_pos)
#     d = dict(counts)
#     sorted_pos = sorted(d.items(), key=lambda x:x[1], reverse = True)
#     final = sorted_pos[:5]
    
#     # raise NotImplementedError()
#     return final
    pos_token = nltk.pos_tag(text1)
    pos_counts = Counter((subl[1] for subl in pos_token))
    return pos_counts.most_common(5)

answer_eight()

[('NN', 51452), ('IN', 39225), ('NNP', 38361), ('DT', 34471), ('VBZ', 23799)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [27]:
from nltk.corpus import words

correct_spellings = words.words()

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [38]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):

    from nltk.util import ngrams
    l = []
    
    for entry in entries:
        temp = [(nltk.jaccard_distance(set(ngrams(entry, 3)), set(ngrams(w, 3))),w) for w in correct_spellings if w[0]==entry[0]]
        l.append(sorted(temp, key = lambda val:val[0])[0][1])
    
#     recommendation = ""
#     max_jd = 0
#     l = []
    
#     for entry in entries:
#         for word in correct_spellings:
#             jd = nltk.jaccard_distance(set(entry),set(word))
            
#             if jd > max_jd:
#                 recommendation = word
#                 max_jd = jd
        
#         l.append(recommendation)
        
    
#     #raise NotImplementedError()
    return l
    
answer_nine()

['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

Refer to:
- [NLTK Jaccard distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=jaccard_distance#nltk.metrics.distance.jaccard_distance)
- [NLTK ngrams](https://www.nltk.org/api/nltk.util.html?highlight=ngrams#nltk.util.ngrams)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [39]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    
    from nltk.util import ngrams
    l = []
    
    for entry in entries:
        temp = [(nltk.jaccard_distance(set(ngrams(entry, 4)), set(ngrams(w, 4))),w) for w in correct_spellings if w[0]==entry[0]]
        l.append(sorted(temp, key = lambda val:val[0])[0][1])
    
    
    # raise NotImplementedError()
    return l
    
answer_ten()

['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

Refer to:
- [NLTK edit distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=edit_distance#nltk.metrics.distance.edit_distance)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [43]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    
    l = []
    
    for entry in entries:
        temp = [(nltk.edit_distance(entry, w),w) for w in correct_spellings if w[0]==entry[0]]
        l.append(sorted(temp, key = lambda val:val[0])[0][1])

    #raise NotImplementedError()
    return l 
    
answer_eleven()

['corpulent', 'intendence', 'validate']