## Spelling Recommender

For this project, we will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [14]:
# Importing libraries
from nltk.corpus import words

# Loading a vocabulary with 'corrected words'
correct_spellings = words.words()

### Recommender using Jaccard distance

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [15]:
def get_ngrams_3(text, n):
    
    '''
        Description: This function receives a text and the number of n-grams to extract from it.
        
        Args: 
            - text (string): the target text
            - n (int): the number of n-grams to extract from the input text
        
        Returns: 
            - joined_n_grams (string): the joined n-grams extracted from the original text
    
    '''
    
    # Applying n-gram in nltk
    n_grams = nltk.ngrams(text, n)
    
    # Joining tokens
    joined_n_grams = [ ' '.join(grams) for grams in n_grams]
    
    # Returning string with joined tokens
    return joined_n_grams

In [16]:
def recommender_jaccard_3_grams(entries=['cormulent', 'incendenece', 'validrate']):
    
    '''
        Description: This function receives three misspelled words and apply the Jaccard distance to them with 3-grams to recommend the corrected word for each one.
        
        Args:
            - entries (list): a list of length three with the misspelled words
        
        Returns:
            - result_list (list): a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
    
    '''
    
    # Load the vocabulary
    dictionary = [w for w in correct_spellings if w[0].lower() == entries[0][0].lower() or w[0].lower() == entries[1][0].lower()
                  or w[0].lower() == entries[2][0].lower()]
    
    # List to append recommendations for the first word
    cols = []
    
    # List to append recommendations for the second word
    cols2 = []
    
    # List to append recommendations for the third word
    cols3 = []
    
    # Iterating over the dictionary to find similar items to the fisrt word
    for the_item in dictionary:
        # Applying the 3-gram to the first misspelled word
        set1 = get_ngrams_3(entries[0], 3)
        
        # Apllying the 3-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_3(the_item, 3)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols.append({"word": the_item, "distance": float(res.real)})
        
    # Generating a dataframe from the previous list
    df1 = pd.DataFrame(cols)

    # Iterating over the dictionary to find similar items to the second word
    for the_item in dictionary:
        # Applying the 3-gram to the second misspelled word
        set1 = get_ngrams_3(entries[1], 3)
        
        # Apllying the 3-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_3(the_item, 3)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols2.append({"word": the_item, "distance": float(res.real)})
        
    # Generating a dataframe from the previous list
    df2 = pd.DataFrame(cols2)

    # Iterating over the dictionary to find similar items to the third word
    for the_item in dictionary:
        # Applying the 3-gram to the third misspelled word
        set1 = get_ngrams_3(entries[2], 3)
        
        # Apllying the 3-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_3(the_item, 3)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols3.append({"word": the_item, "distance": float(res.real)})
    
    # Generating a dataframe from the previous list
    df3 = pd.DataFrame(cols3)
    
    # Appending the recommendation for each misspelled word to the result list
    result_list = [df1.sort_values('distance')['word'].values[0],
            df2.sort_values('distance')['word'].values[0],
            df3.sort_values('distance')['word'].values[0]]

    # Returning the result list
    return result_list
    
recommender_jaccard_3_grams()

  app.launch_new_instance()


['corpulent', 'indecence', 'validate']

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [17]:
def get_ngrams_4(text, n):
    
    '''
        Description: This function receives a text and the number of n-grams to extract from it.
        
        Args: 
            - text (string): the target text
            - n (int): the number of n-grams to extract from the input text
        
        Returns: 
            - joined_n_grams (string): the joined n-grams extracted from the original text
    
    '''
    # Applying n-gram in nltk
    n_grams = nltk.ngrams(text, n)
    
    # Joining tokens
    joined_n_grams = [ ' '.join(grams) for grams in n_grams]
    
    return joined_n_grams

In [18]:
def recommender_jaccard_4_grams(entries=['cormulent', 'incendenece', 'validrate']):
    
    '''
        Description: This function receives three misspelled words and apply the Jaccard distance to them with 4-grams to recommend the corrected word for each one.
        
        Args:
            - entries (list): a list of length three with the misspelled words
        
        Returns:
            - result_list (list): a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
    
    '''
    
    # Load the dictionary
    dictionary = [w for w in correct_spellings if w[0].lower() == entries[0][0].lower() or w[0].lower() == entries[1][0].lower()
                  or w[0].lower() == entries[2][0].lower()]
    
    # List to append recommendations for the first word
    cols = []
    
    # List to append recommendations for the second word
    cols2 = []
    
    # List to append recommendations for the third word
    cols3 = []
    
    # Iterating over the dictionary to find similar items to the fisrt word
    for the_item in dictionary:
        # Applying the 4-gram to the first misspelled word
        set1 = get_ngrams_4(entries[0], 4)
        
        # Apllying the 4-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_4(the_item, 4)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols.append({"word": the_item, "distance": float(res.real)})
    
    # Generating a dataframe from the previous list
    df1 = pd.DataFrame(cols)

    # Iterating over the dictionary to find similar items to the second word
    for the_item in dictionary:
        # Applying the 4-gram to the second misspelled word
        set1 = get_ngrams_4(entries[1], 4)
        
        # Apllying the 4-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_4(the_item, 4)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols2.append({"word": the_item, "distance": float(res.real)})
    
    # Generating a dataframe from the previous list
    df2 = pd.DataFrame(cols2)

    # Iterating over the dictionary to find similar items to the third word
    for the_item in dictionary:
        # Applying the 4-gram to the third misspelled word
        set1 = get_ngrams_4(entries[2], 4)
        
        # Apllying the 4-gram to each item of the vocabulary of corrected words
        set2 = get_ngrams_4(the_item, 4)
        
        # Calculating the Jaccard distance for each word
        res = nltk.distance.jaccard_distance(set(set1),set(set2))
        
        # Appending the word and its respective distance to the list
        cols3.append({"word": the_item, "distance": float(res.real)})
        
    # Generating a dataframe from the previous list
    df3 = pd.DataFrame(cols3)
        
    # Appending the recommendation for each misspelled word to the result list
    result_list = [df1.sort_values('distance')['word'].values[0],
            df2.sort_values('distance')['word'].values[0],
            df3.sort_values('distance')['word'].values[0]]
    
    # Returning the result list
    return result_list

recommender_jaccard_4_grams()

  app.launch_new_instance()


['cormus', 'incendiary', 'valid']

### Recommender using Levenshtein distance

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [19]:
def recommender_levenshtein(entries=['cormulent', 'incendenece', 'validrate']):
    
     '''
        Description: This function receives three misspelled words and apply the Levenshtein distance with transpositions.
        
        Args:
            - entries (list): a list of length three with the misspelled words
        
        Returns:
            - result_list (list): a list of length three: ['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']
    
    '''
    
    # List to append recommendations for the first word
    cols = []
    
    # List to append recommendations for the second word
    cols2 = []
    
    # List to append recommendations for the third word
    cols3 = []
    
     # Load the dictionary
    dictionary = [w for w in correct_spellings if w[0].lower() == entries[0][0].lower() or w[0].lower() == entries[1][0].lower()
                      or w[0].lower() == entries[2][0].lower()]
    
    # Iterating over the dictionary to find similar items to the fisrt word
    for the_item in dictionary:
        # Appending the word and its respective distance to the list
        cols.append({"word": the_item, "distance": nltk.edit_distance(entries[0], the_item, transpositions=True)})
        
    # Generating a dataframe from the previous list
    result1 = pd.DataFrame(cols)

    # Iterating over the dictionary to find similar items to the second word
    for the_item in dictionary:
        # Appending the word and its respective distance to the list
        cols2.append({"word": the_item, "distance": nltk.edit_distance(entries[1], the_item, transpositions=True)})
        
    # Generating a dataframe from the previous list
    result2 = pd.DataFrame(cols2)

    # Iterating over the dictionary to find similar items to the third word
    for the_item in dictionary:
        # Appending the word and its respective distance to the list
        cols3.append({"word": the_item, "distance": nltk.edit_distance(entries[2], the_item, transpositions=True)})
    
    # Generating a dataframe from the previous list
    result3 = pd.DataFrame(cols3)
    
    # Appending the recommendation for each misspelled word to the result list
    result_list = [result1.sort_values('distance')['word'].values[0],
             result2.sort_values('distance')['word'].values[0],
             result3.sort_values('distance')['word'].values[0]] 
    
    # Returning the result list
    return result_list
    
recommender_levenshtein()

['corpulent', 'intendence', 'validate']