<font color="green">

## Home task: NLTK (Natural Language Toolkit) 
</font>

In [1]:
import nltk
from nltk.corpus import gutenberg 

In [2]:
nltk.download('gutenberg', quiet=True)

True

In [3]:
moby_raw = gutenberg.raw('melville-moby_dick.txt') 

<font color = green >

### Example 1

</font>

How many tokens (words and punctuation symbols) are in `moby_raw`?
<br>*This function should return an integer.*

In [4]:
from nltk.tokenize import word_tokenize

def example_one():
    return len(word_tokenize(moby_raw)) 

<font color = blue >

### Check result

</font>


In [5]:
print('{:,}'.format(example_one()))

255,028


<font color = green >

### Example 2

</font>

How many unique tokens (unique words and punctuation) does `moby_raw` have?
<br>*This function should return an integer.*

In [6]:
def example_two():    
    return len(set(nltk.word_tokenize(moby_raw)))

<font color = blue >

### Check result

</font>


In [7]:
print('{:,}'.format(example_two()))

20,742


<font color = green >

### Example 3

</font>

After lemmatizing the verbs, how many unique tokens does `moby_raw` have?
<br>*This function should return an integer.*


In [8]:
from nltk.stem import WordNetLemmatizer

def example_three():
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w, 'v') for w in nltk.word_tokenize(moby_raw)]
    return len(set(lemmatized))

<font color = blue >

### Check result

</font>


In [9]:
print('{:,}'.format(example_three()))

16,887


<font color = green >

### Question 1

</font>


What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)
<br>*This function should return a float.*


In [10]:
from nltk.tokenize import word_tokenize

def answer_one():
    tokens = word_tokenize(moby_raw)
    
    # Divide the number of unique tokens by the number
    # of all tokens to calculate the lexical diversity
    return len(set(tokens)) / len(tokens)

<font color = blue >

### Check result

</font>


In [11]:
print(answer_one())

0.08133224587104161


<font color = blue >

### Expected Output

</font>

`0.08139566804842562`


<font color = green >

### Question 2

</font>

What percentage of tokens is 'whale' or 'Whale'?
<br>*This function should return a float.*

In [12]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

def answer_two():    
    tokens = word_tokenize(moby_raw)

    # Dictionary where keys are tokens, and values are their frequencies
    vocab = FreqDist(tokens)

    # Multiply by 100 to get percentage
    return 100 * (vocab['whale'] + vocab['Whale']) / len(tokens)

<font color = blue >

### Check result

</font>


In [13]:
print(answer_two())

0.41250372508116756


<font color = blue >

### Expected Output

</font>

`0.4125668166077752`


<font color = green >

### Question 3

</font>

What are the 10 most frequently occurring (unique) tokens in the text? What is their frequency?
<br>*This function should return a list of 10 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [14]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

def answer_three():
    words = word_tokenize(moby_raw)
    vocab = FreqDist(words)

    # Get 10 words with the largest frequency values
    return vocab.most_common(10)

<font color = blue >

### Check result

</font>


In [15]:
print(answer_three())

[(',', 19204), ('the', 13715), ('.', 7306), ('of', 6513), ('and', 6010), ('a', 4545), ('to', 4515), (';', 4173), ('in', 3908), ('that', 2978)]


<font color = blue >

### Expected Output

</font>

`[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978)]`


<font color = green >

### Question 4

</font>

What tokens have a length of greater than 5 and frequency of more than 150?
<br>*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [16]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

def answer_four():
    words = word_tokenize(moby_raw)
    vocab = FreqDist(words)

    # Filter words according to the task, then sort them
    words = [word for word, freq in vocab.items() if len(word) > 5 and freq > 150]
    return sorted(words)

<font color = blue >

### Check result

</font>


In [17]:
print(answer_four())

['Captain', 'Pequod', 'Queequeg', 'Starbuck', 'almost', 'before', 'himself', 'little', 'seemed', 'should', 'though', 'through', 'whales', 'without']


<font color = blue >

### Expected Output

</font>

`['Captain', 'Pequod', 'Queequeg', 'Starbuck', 'almost', 'before', 'himself', 'little', 'seemed', 'should', 'though', 'through', 'whales', 'without']`


<font color = green >

### Question 5

</font>

Find the longest word in text and that word's length.
<br>
*This function should return a tuple `(longest_word, length)`.*


In [18]:
from nltk.tokenize import word_tokenize

def answer_five():
    words = set(word_tokenize(moby_raw))

    # Sort unique words by their length and get the longest one
    longest = sorted(words, key=len)[-1]
    return (longest, len(longest)) 

<font color = blue >

### Check result

</font>


In [19]:
print(answer_five())

("twelve-o'clock-at-night", 23)


<font color = blue >

### Expected Output

</font>

`("twelve-o'clock-at-night", 23)`


<font color = green >

### Question 6

</font>

What unique words have a frequency of more than 2000? What is their frequency?
<br>*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*


In [20]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist

def answer_six():

    # Remove punctuation characters 
    alphanum_words = filter(str.isalnum, word_tokenize(moby_raw))

    # Filter words by their frequencies, and sort them by frequency values
    vocab = [(freq, word) for word, freq in FreqDist(alphanum_words).items() if freq > 2000]
    return sorted(vocab, key=lambda x: x[0], reverse=True)

<font color = blue >

### Check result

</font>


In [21]:
print(answer_six())

[(13715, 'the'), (6513, 'of'), (6010, 'and'), (4545, 'a'), (4515, 'to'), (3908, 'in'), (2978, 'that'), (2459, 'his'), (2196, 'it'), (2113, 'I')]


<font color = blue >

### Expected Output

</font>

`[(13715, 'the'), (6513, 'of'), (6010, 'and'), (4545, 'a'), (4515, 'to'), (3908, 'in'), (2978, 'that'), (2459, 'his'), (2196, 'it'), (2097, 'I')]`


<font color = green >

### Question 7

</font>

What is the average number of tokens per sentence?
<br>*This function should return a float.*

In [22]:
from nltk.tokenize import word_tokenize, sent_tokenize
import numpy as np 

def answer_seven():

    # Get the number of tokens for each sentence, and compute their average
    sentences = [len(word_tokenize(sentence)) for sentence in sent_tokenize(moby_raw)]
    return np.mean(sentences)

<font color = blue >

### Check result

</font>


In [23]:
print(answer_seven())

25.88591149005278


<font color = blue >

### Expected Output

</font>

`25.881952902963864`


<font color = green >

### Question 8

</font>

What are the 5 most frequent parts of speech in this text? What is their frequency?
<br>*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [24]:
nltk.download('averaged_perceptron_tagger', quiet=True)

True

In [25]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, FreqDist

def answer_eight():
    words = word_tokenize(moby_raw)

    # Get the PoS tags for each word in the text
    tags = [tag for _, tag in pos_tag(words)]

    # Get the 5 most frequent PoS tags
    pos_vocab = FreqDist(tags)
    return pos_vocab.most_common(5)

<font color = blue >

### Check result

</font>


In [26]:
print(answer_eight())

[('NN', 32727), ('IN', 28662), ('DT', 25879), (',', 19204), ('JJ', 17613)]


<font color = blue >

### Expected Output

</font>

`[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]`


<font color = green >

### Question 9

</font>

Create spelling recommender, that take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find the word in `correct_spellings` that has the shortest `edit distance` (you may need  to use `nltk.edit_distance(word_1, word_2, transpositions=True)`), and starts with the same letter as the misspelled word, and return that word as a recommendation.

Recommender should provide recommendations for the three words: `['cormulent', 'incendenece', 'validrate']`.
<br>*This function should return a list of length three:
`['cormulent_recommendation', 'incendenece_recommendation', 'validrate_recommendation']`.*

In [27]:
nltk.download('words', quiet=True)

True

In [28]:
from nltk.corpus import words
from nltk import edit_distance

def answer_nine(default_words=['cormulent', 'incendenece', 'validrate']):
    correct_spellings = words.words()

    recommendations = []
    for word in default_words:

        # Find the correct words which starts with starts with the same letter as the misspelled word
        correct_words = filter(lambda x: x.startswith(word[0]), correct_spellings)

        # Compute the edit distances between correct words and misspelled word,
        # and recommend word with the smallest distance
        distances = {
            edit_distance(word, correct_word, transpositions=True): correct_word for correct_word in correct_words
        }
        recommendations.append(distances[min(distances)])

    return recommendations

<font color = blue >

### Check result

</font>


In [29]:
print(answer_nine())

['corpulent', 'intendence', 'validate']


<font color = blue >

### Expected Output

</font>

`['corpulent', 'intendence', 'validate']`