# Text Analysis and Spelling Recommender
For this project, we used the `nltk` library to explore the <a href='http://www.cs.cmu.edu/~ark/personas/'>CMU Movie Summary Corpus</a>. All data is released under a Creative Commons Attribution-ShareAlike License. 

We also created a spelling recommender function that uses the `nltk` library to find words similar to the misspelling. 

In [69]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.corpus import words
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance

import pandas as pd
import numpy as np

from collections import Counter

# nltk downloads
# nltk.download('wordnet')

## 1. Exploratory Analysis of Plot Summaries

In [2]:
# Load / prep raw data
df = pd.read_csv('./assets/plot_summaries.txt', sep='\t', header=None)
df = df[[1]]
df = df.rename(columns={1 : 'Plot'})
df.head()

Unnamed: 0,Plot
0,"Shlykov, a hard-working taxi driver and Lyosha..."
1,The nation of Panem consists of a wealthy Capi...
2,Poovalli Induchoodan is sentenced for six yea...
3,"The Lemon Drop Kid , a New York City swindler,..."
4,Seventh-day Adventist Church pastor Michael Ch...


In [9]:
# Combine plots into single string
plot_string = df.sum()
# Tokenize that string
tokens = nltk.word_tokenize(plot_string['Plot'])

### How many tokens, including both words and punctuation marks, are in the text?

In [30]:
# Find number of tokens
num_tokens = len(tokens)

# Format number of tokens
str_num_tokens = f'{num_tokens : ,}'
print('There are ', str_num_tokens, ' words and punctuation marks in the text.')

There are   14,837,137  words and punctuation marks in the text.


### How many unique tokens are there in text?

In [32]:
# Find number of unique tokens
num_unique = len(set(tokens))
                 
# Format number of unique tokens
str_num_unique = f'{num_unique : ,}'
print('There are ', str_num_unique, ' unique words and punctuation marks in the text.')

There are   226,452  unique words and punctuation marks in the text.


### After lemmatizing the verbs, how many unique tokens does the text have?

In [33]:
# Create lemmatizer 
lemmatizer = WordNetLemmatizer()
# Lemmatize verbs 
lemmatized = [lemmatizer.lemmatize(w,'v') for w in tokens]

# Find number of unique lemmatized verbs
num_lemmatized = len(set(lemmatized))
# Format number of unique lemmatized verbs
str_num_lemmatized = f'{num_lemmatized : ,}'

print('There are ', str_num_lemmatized, ' unique lemmatized verbs in the text.')

There are   213,430  unique lemmatized verbs in the text.


### What is the lexical diversity of the given text input? 
The lexical diversity of a text is the ratio of unique tokens to the total number of tokens.

In [34]:
# Calculate lexical diversity
lex_diversity = round(num_unique/num_tokens, 3)

print('The lexical diversity is: ', str(lex_diversity))

The lexical diversity is:  0.015


This level of diversity is rare for texts because it indicates that the content is repetitive or simplistic, like children’s rhymes or chants. However, lexical diversity is sensitive to text length. So large texts, like this one, naturally have lower diversity. 

### What percentage of tokens is 'love'or 'Love'?

In [38]:
# Frequency distribution of tokens
fdist = FreqDist(tokens)
# Add frequencies for 'love' and 'Love'
freq_love = fdist.get('love') + fdist.get('Love')

# Find the percentage
perc_love = round((freq_love / num_tokens) * 100, 2)
# Convert to string 
str_perc_love = f'{perc_love : ,}'

print("The percentage of tokens that are 'love' or 'Love' is ", str_perc_love, '%.') 

The percentage of tokens that are 'love' or 'Love' is   0.12 %.


### Which 10 unique tokens appear most frequently in the text? What is their frequencies?

In [40]:
# Get 10 most frequent tokens
most_freq = fdist.most_common(10)

# Create data frame with results
top_10 = pd.DataFrame(most_freq, columns=['Token', 'Frequency'])

top_10

Unnamed: 0,Token,Frequency
0,",",787499
1,the,737267
2,.,619554
3,to,478162
4,and,455448
5,a,362327
6,of,261133
7,is,225160
8,in,201022
9,his,190723


### What tokens have a length of greater than 5 and frequency of more than 10,000?
The tokens should be sorted alphabetically. 

In [44]:
# Get tokens 
unique_tokens = fdist.keys()

# Find tokens with len > 5 and frequency > 10,000 
long_freq_tokens = [t for t in unique_tokens if len(t) > 5 and fdist[t] > 10000]

# Sort list 
sorted_long_freq_tokens = sorted(freq_tokens)
# Create data frame
sorted_long_freq_df = pd.DataFrame(sorted_long_freq_tokens, columns=['Frequent Long Tokens'])

sorted_long_freq_df

Unnamed: 0,Frequent Long Tokens
0,However
1,becomes
2,before
3,begins
4,family
5,father
6,friend
7,himself
8,killed
9,mother


### Find the longest token. What's its length?

In [49]:
# Sort in reverse order by length 
sorted_tokens = sorted(unique_tokens, key=len, reverse=True)

# Find length of the longest token
len_longest = len(sorted_tokens[0])

print('The longest token is "', sorted_tokens[0], '" and it is ', len_longest, 'characters long.')

The longest token is " //www.rottentomatoes.com/m/new_brooklyn/articles/1801319/exquisitely_constructed_it_gives_viewers_a_consistently_satisfying_experience_which_works_on_every_level_but_lewin_steals_the_movie_be_ready_to_have_your_heart_break " and it is  222 characters long.


### What unique *words* have a frequency of more than 100,000? What is their frequency?

In [55]:
# Find words with frequency > 100,000 
freq_words = [(fdist[t], t) for t in unique_tokens if t.isalpha() and fdist[t] > 100000]
# Sort this list of tuples 
sorted_freq_words = sorted(freq_words, key=lambda t: t[0], reverse=True)

# Create data frame with most frequent words in descending order of frequency 
sorted_freq_words_df = pd.DataFrame(sorted_freq_words, columns=['Frequency', 'Word'])
sorted_freq_words_df

Unnamed: 0,Frequency,Word
0,737267,the
1,478162,to
2,455448,and
3,362327,a
4,261133,of
5,225160,is
6,201022,in
7,190723,his
8,147668,her
9,139296,he


### What are the 5 most frequent parts of speech in the text? What is their frequency?

In [59]:
# Identify part of speech
pos = nltk.pos_tag(tokens)
# Count parts of speech 
pos_count = Counter(tup[1] for tup in pos)

# Five most frequent 
pos_top_5 = pos_count.most_common(5)
# Create data frame with five most frequent pos
pos_top_5_df = pd.DataFrame(pos_top_5, columns=['Part of Speech', 'Frequency'])

pos_top_5_df

Unnamed: 0,Part of Speech,Frequency
0,NN,2046873
1,IN,1551928
2,NNP,1514305
3,DT,1358231
4,VBZ,926175


So the most common parts of speech are singular nouns (NN), prepositions or subordinating conjunctions (IN), singular proper nouns (NNP), determiners or articles (DT), and third person singular verbs (VBZ).

## 2. Spelling Recommender
For this part project, we created three different spelling recommenders. Each takes a list of misspelled words and recommends a correctly spelled word for every word in the list.

In [61]:
# A comprehensive corpus of words 
correct_spellings = words.words()

### Recommender Using Jaccard Similarity Index on Trigrams
The first recommender uses the Jaccard Similarity index and character trigrams to recommend a correct spelling. For each of the misspelled words, we find the character trigrams for each of them, and find the trigrams for each of the words in the corpus. Then we calculate the Jaccard Similarity index between the misspelled-word trigram set and all of the character trigram sets for the corpus words. The recommender suggests that the corpus word with the highest index is the correct spelling.   

In [63]:
def recommender_one(entries):

    # List of recommendations
    recommendations = []
    # Size of n-grams 
    n = 3
    
    # Loop through arguments 
    for entry in entries:
        # List of tuples with correct_spelling words and Jaccard distance
        jaccard_dists = []
        # n-gram of the argument
        entry_n_gram = set(ngrams(entry, n))
        # First letter of argument 
        entry_start_letter = entry[0]
        
        # Loop through words in correct_spellings 
        for word in correct_spellings:
            
            # To save time, check if starts with same letter 
            if word.startswith(entry_start_letter):
                # Create n-gram of word in correct_spelling
                word_n_gram = set(ngrams(word, n))
                # Calculate Jaccard distance with argument 
                jaccard_dist = jaccard_distance(entry_n_gram, word_n_gram)
                # Add tuple with word and its distance score
                jaccard_dists.append((word, jaccard_dist))
        
        # Sort by Jaccard distances 
        sorted_jaccard_dists = sorted(jaccard_dists, key=lambda t: t[1])
        # Top recommendation is the first word 
        recommendations.append(sorted_jaccard_dists[0][0]) 
    
    # Return recommendations 
    return recommendations

In [65]:
# List of misspelled words 
misspelled_words = ['cormulent', 'incendenece', 'validrate'] 
# Find recommendations
recommendations_1 = recommender_one(misspelled_words)

print('For the list: ', misspelled_words) 
print('these are recommended spellings: ', recommendations_1) 

For the list:  ['cormulent', 'incendenece', 'validrate']
these are recommended spellings:  ['corpulent', 'indecence', 'validate']


### Recommender Using Jaccard Distance on Quadrigrams
For the second recommender, we use same process as the first recommender with one exception. Instead finding the trigrams of the words, we find the character quadrigrams. The recommender again suggests that the corpus-word quadrigram set that's most similar to that of the misspelled words is correct spelling. 

In [66]:
def recommender_two(entries):
    
    # List of recommendations
    recommendations = []
    # Size of n-grams 
    n = 4
    
    # Loop through arguments 
    for entry in entries:
        # List of tuples with correct_spelling words and Jaccard distance
        jaccard_dists = []
        # n-gram of the argument
        entry_n_gram = set(ngrams(entry, n))
        # First letter of argument 
        entry_start_letter = entry[0]
        
        # Loop through words in correct_spellings 
        for word in correct_spellings:
            # To save time, check if starts with same letter 
            if word.startswith(entry_start_letter):
                # Create n-gram of word in correct_spelling
                word_n_gram = set(ngrams(word, n))
                # Calculate Jaccard distance with argument 
                jaccard_dist = jaccard_distance(entry_n_gram, word_n_gram)
                # Add tuple with word and its distance score
                jaccard_dists.append((word, jaccard_dist))
                
        # Sort by Jaccard distances 
        sorted_jaccard_dists = sorted(jaccard_dists, key=lambda t: t[1])
        # Top recommendation is the first word 
        recommendations.append(sorted_jaccard_dists[0][0]) 
        
    # Return recommendations 
    return recommendations

In [68]:
# Find recommendations for same misspelled words
recommendations_2 = recommender_two(misspelled_words)

print('For the list: ', misspelled_words) 
print('these are recommended spellings: ', recommendations_2) 

For the list:  ['cormulent', 'incendenece', 'validrate']
these are recommended spellings:  ['cormus', 'incendiary', 'valid']


### Recommender Using Edit Distance 
For each misspelled word, we calculate edit distance with tranposition between those words and the corpus words. We suggest the corpus word with the shortest edit distance is the correct spelling.

In [70]:
def recommender_three(entries):

    # List of recommendations
    recommendations = []
    
    # Loop through arguments 
    for entry in entries:
        # List of words in correct_spellings and edit distances 
        edit_dists = []
        # First letter of argument 
        entry_start_letter = entry[0]

        # Loop through words in correct_spellings 
        for word in correct_spellings:
            # To save time, check if starts with same letter 
            if word.startswith(entry_start_letter):
                # Calculate the edit distance
                edit_dist = edit_distance(entry, word, transpositions=True)
                # Add word and its distance as a tuple 
                edit_dists.append((word, edit_dist))
                
        # Sort by edit distances 
        sorted_edit_dists = sorted(edit_dists, key=lambda t: t[1])
        # Recommendation is the first word 
        recommendations.append(sorted_edit_dists[0][0])
        
    # Return recommendations 
    return recommendations

In [71]:
# Find recommendations for same misspelled words
recommendations_3 = recommender_three(misspelled_words)

print('For the list: ', misspelled_words) 
print('these are recommended spellings: ', recommendations_3) 

For the list:  ['cormulent', 'incendenece', 'validrate']
these are recommended spellings:  ['corpulent', 'intendence', 'validate']


For the misspelled words `cormulent`, `incendenece`, and `validrate`, the recommenders using the Jaccard Similarity index on character trigrams and the edit distance suggested the same correct spellings, which seem to the most logical recommendations: `corpulent`, `intendence`, `validate`. 

The recommender using the Jaccard Similarity index on character quadrigrams suggested these words: `cormus`, `incendiary`, `valid`. While similar to the misspelled words, they don't seem to correct words. I think this is due to overfitting. In general, as the length of the character n-grams increase, the recommender performance will improve up to a point at which the suggestions will be less useful. It seems that the tipping point for this corpus is for n-grams of length 4.  