## Alice's Adventures in <b style='color:green'>Word</b>land 
![my_image](./aliceimage.png)

### <b style='color:purple'> – Part1 – </b>

Alice likes playing a word game. When she is given a word, she will make a suggestion that she thinks it's closest word from her vocabulary. Her vocabulary is limited to words from story of _"Alice's Adventures in Wonderland"_. 

When she knows the given word, she will give the same word as suggestion. On the other hand, if she doesn't know, she will find the "closest word" based on the following steps;
    
1) ignore any punctuations such as comma(,), period(.), colon(:), exclamation point(!), etc. <br>
2) consider 'A' and 'a' as same character; not sensitive to case <br>
3) among words the lengh of them are +/- 1 letter away from given word <br>
4) a one that has biggiest number of matching letter <br>
5) if there are two or more words that has same number of matching letters, pick a one that has more occurence in the story <br>
             
          [ for example ]
            Let's say she is given a word "cha" for a word suggestion. 
            In her vocabulary, 2~4 letters long words are, for example, act, been, chai, chip, hhhh, and map.
              
            If we compare "cha" to "act", 'c' and 'a' from "cha" exist in "act".
            Therefore:
                        cha(2) --> act
                        cha(0) --> been
                        cha(3) --> chai
                        cha(2) --> chip
                        cha(1) --> hhhh
                        cha(1) --> map
                            ↳ Numbers in parenthesis means the number of matching letters.
                               In words, one letter in "cha" is in "map", for example.

* Note that __Alice is checking whether each letter in a given word can be found in Alice's vocab words__. This reduces the chance of the getting not-so-ideal suggestion from Alice. If we were counting the number of matching letter in Alice's word to the given word, in reverse, the number of matching letters of "hhhh" in Alice's vocabulary is 4 instead of 1 since four letters in "hhhh" exist in the given word "cha"
   
                        act(2)  --> cha
                        been(0) --> cha
                        chai(3) --> cha
                        chip(2) --> cha
                        hhhh(4) --> cha
                        map(1)  --> cha          

![my_image](./cat.png)

How do you think about the last two steps from her approaches of finding the closest-word? Should a one get precedence for the other?

What would you like to get as a suggested word between "that" or "the" when you give her "tha". Following the current step, Alice will give you "that" since there are 3 matching letters compared to 2 matching letters in "the". However, one might argue that "the" would be the better suggestion since we use it more often in sentences. On the other hand, the other might say it's possible that "tha" is on the way to completing the full word "that" like we prematually pressed the enter on keyboard before finish typing. 

Let's say a word "bel" is given to Alice. In her vocabulary, limited to this example, words in length of range 2 to 4 are like below. The number of matching letter and word frequency in the vocabulary is listed repectively {# of matching letter, frequency}. Please note, as it was discussed above, we are counting the number of matching letters in the given word comapred to the Alice's vocabulary words.

                        bel --> be   {2, 1042}
                        bel --> been {2, 355}
                        bel --> eel  {2, 8}
                        bel --> bell {3, 109}
                        bel --> belt {3, 76}

If Alice were giving precedence to most occuring word to be the closest-word, "be" will get picked although there are other two words that has more matching letters which seems better suggestion. In fact, "be" is second most frequently appearing word in written english according to an analysis of the Oxford English Corpus(https://en.wikipedia.org/wiki/Most_common_words_in_English). Therefore, Alice will give us "the", "be", "to", etc. as a word suggestion for any given words look similar to them. This will limit her word suggestion pool to those most occuring ones in written english as long as it's in her vocabulary. 

For this reason, __Alice will consider word frequency only when there are multiple words in her vocabulrary that has same number of matching letters to the given word__. 

![my_image](./rabbitclock.png)

_Another thing about her current approach is it's not sensitive to the order of letters. As a result, she will think "each" is a good suggestion for a given word "cha" rather than "chai" or "chat" if "each" appears more frequent than other two. For now, we will implement something that give decent suggeustion._

<br>
_This project similar to word auto-correction but assuming that the english words in this world is limited to those word from the story._
<br>
<br>

text file source: https://archive.org/stream/alicesadventures19033gut/19033.txt <br>
image source: https://blog.whsmith.co.uk/alices-adventures-in-wonderland-free-colouring-downloads/

### Let's first construct Alice's vocabulary from Alice's Adventures in Wonderland story.

In [5]:
def story_to_sentences(text_file):
    """
    Takes a text file(whole story) and put each sentence to a list in lower case 
    without punctuations nor special characters.
    
    INPUT : text file
    OUTPUT: a list, ['sentence', 'sentence', 'sentence', ...]
    """
    import re

    ## Opening the text file
    with open(text_file) as alice_story:  
        
        ## [List] create one big list that each item are single sentence
        ## each item in list is separated by \n but doesn't contain \n as part of string
        story_lines = alice_story.read().splitlines() ## length: 3736
        
        ## [List] put each sentence into another one new big list 
        ## each sentence is converted to lowercase and all special characters are removed
        cleaned_lines = []                            ## length: 3736
        for sentence in story_lines:
            ## keeping whitespaces between words in each sentence
            cleaned_lines.append(re.sub('[^A-Za-z0-9 ]+', '', sentence).lower()) 
            
    return cleaned_lines

After running the story_to_sentences function, a text file partially looks like this at story_lines level.

![my_image](./story_lines.png)

and at cleaned_lines level, it looks like below.

![my_image](./cleaned_lines.png)

In [6]:
def list_to_dict(sentences_lst):
    """
    Takes a list of sentences and goes through every words in each sentence.
    Creates a dictionary which takes word as the key and word length and frequency in story as a value.
    
    INPUT : text file
    OUTPUT: a dictionary, {'word' : [length of word, occurrence], ... }
    """
    ## {Dictionary} split into single word and put it in a dictionary
    words_dict = {}                               # length: 3254
    for sentence in sentences_lst:
        ## [List] "words" is a list with words from one sentence, there will be 3736 lists
        words = sentence.split(" ")  
        for w in words:
            if w in words_dict: 
                ## when already exists in words_dict, just increment the freq
                words_dict[w][1] += 1
            else:               
                ## add a new entry in words_dict with word length and freq info
                words_dict[w] = [len(w), 1]

    return words_dict
    

In [7]:
sentences = story_to_sentences("aliceText.txt")

In [8]:
alice_vocab = list_to_dict(sentences)

Below is glimpse of how Alice's vocabulary dictionary going to be look like after processing the text file.

![my_image](./alicevoca.png)

If we were to look up word from Alice's dictionary, a list will be returned. The first element will be useful to determine whether a word is +/-1 long from search word which is step 3 from Alice's approach. The second element is useful for step 5.

In [7]:
alice_vocab['remarkable']

[10, 2]

At above, I mentioned that "be" is second most frequently appearing word in written english according to an analysis of the Oxford English Corpus. Shall we see what are the top ten most frequently appearing word in the story of Alice's Adventures in Wonderland?

In [24]:
def nested_list(dictionary):
    """
    Creates one big list that contains all the sublists from a dictionary that has list as value.
    
    INPUT : a dictionary
    OUTPUT: a list, [[], [], [], [], [], ... ]
    """
    result = []
    for key in dictionary:
        result.append(dictionary[key])
        
    return result

In [26]:
# concatenating all the sublists to one list

import numpy as np
concat_list = np.concatenate(nested_list(alice_vocab))

In [16]:
# getting top 10 most frequent words 

import heapq
top_ten = heapq.nlargest(10, concat_list)

In [27]:
# printing out each item

order = 1
for freq in top_ten:
    for key in alice_vocab:
        if alice_vocab[key][1] == freq:
            print ("%d) %s – %s times" % (order, key, freq))
            order += 1
        

1)  – 2319 times
2) the – 1804 times
3) and – 912 times
4) to – 801 times
5) a – 684 times
6) of – 625 times
7) it – 541 times
8) she – 538 times
9) said – 462 times
10) you – 429 times


The most frequently occuring word, except the white space, is "the" which is also the number 1 from most common words based on an analysis of the Oxford English Corpus. Most of from top ten words are conjunction, preposition, or pronoun.

### Now, let's set up a procedure for Alice to look up a "cloest word" from her vocab list.

In [9]:
def find_max_match(words_dict, user_word):
    """
    Counts matching letters for every words from words_dict that are +/- 1 long of user_word.
    
    INPUT : a dictionary, {'word' : [length of word, occurrence], ...}
    OUTPUT: a string, 'word'    
    """
    
    closest_word = ""
    closest_match = 0
    
    SHORTEST = len(user_word) -1
    LONGEST = len(user_word) + 1

    for w in words_dict:
        matching_char = 0 # reset the matching_char value for every word candidate
#         proper_length = "nope"     # dubugging purpose 
        
        ## count matching letter only when the candidate word's length is proper 
        if SHORTEST <= words_dict[w][0] <= LONGEST :
#             proper_length = "good"  # dubugging purpose 
            
            ## check whether each letter in user_word exists in a word in words_dict
            for i in range(len(user_word)):
                if user_word[i] in w:
                    matching_char += 1
                    
            ## if we were doing like this instead of above, 
            ## this counts the number of matching letter in candidate word to user_word
            ## as discussed at the beginning, this draws not-so-ideal results
#             for i in range(len(w)):
#                 if w[i] in user_word:
#                     matching_char += 1
        
        ## update closest_match and closest_word if applicable(step 4)
        if matching_char > closest_match:
            closest_match = matching_char
            closest_word = w
        ## if new candidate word's matching_char is same as current closest_match, 
        ## select the higher freq one (step 5)
        elif matching_char == closest_match: 
            if words_dict[w][1] > words_dict[closest_word][1]:
                closest_word = w
                
        ## dubugging purpose
#         print ("Tested:", w, words_dict[w], proper_length, 
#                "–––– matches: " , closest_match, "word suggestion: ", closest_word)

    return closest_word

### Time to play word suggestion game with Alice!

In [10]:
def user_word_input():
    """
    When invoked, ask user to enter a word. 
    Remove any special case of the input and convert it to lowercase.
    
    INPUT : none 
    OUTPUT: a string, 'word'
    """
    import re
    
    # takes user-input
    raw_word = input("Which word would you like to ask Alice for a suggestion: ")      
    
    # checking whether the user gave valid input 
    not_valid = True
    
    while not_valid:
        if " " in raw_word:
            raw_word = input("Seems like it's not a single word. Please type a word: ") 
        else:
            # remove special case but keeping whitespace and convert it to lower case
            cleaned = re.sub('[^A-Za-z0-9 ]+', '', raw_word).lower()            
            not_valid = False

    return cleaned  

In [8]:
search_word = user_word_input()
print ("Alice here: How about '%s' instead of '%s'?" 
       % (find_max_match(alice_vocab, search_word), search_word))

Which word would you like to ask Alice for a suggestion: eerepcr rvler/b qlpnf
Seems like it's not a single word. Please type a word: eerepcr
Alice here: How about 'project' instead of 'eerepcr'?


In [9]:
search_word = user_word_input()
print ("Alice here: How about '%s' instead of '%s'?" 
       % (find_max_match(alice_vocab, search_word), search_word))

Which word would you like to ask Alice for a suggestion: cha
Alice here: How about 'each' instead of 'cha'?


In [10]:
search_word = user_word_input()
print ("Alice here: How about '%s' instead of '%s'?" 
       % (find_max_match(alice_vocab, search_word), search_word))

Which word would you like to ask Alice for a suggestion: tha
Alice here: How about 'that' instead of 'tha'?


Wondering how Alice got to the point of suggesting "that" for "tha"?  <br>
Her journey to that conclusion is pretty long but here is the sneak peak of it.

![my_image](./suggestionsteps.png)

### <b style='color:purple'> – Part2 – </b>

### How about we increase scope a little bigger and ask Alice to give us better sentence instead of only a word?

For that, we need user_input function be able to take sentence.

In [17]:
def user_sentence_input():
    """
    When invoked, ask user to enter a sentence. 
    Remove any special case of the input and convert it to lowercase.
    
    INPUT : none 
    OUTPUT: a string, 'some sentence'
    """
    import re
    
    # takes user-input
    raw_word = input(">>>> Give Alice a sentence that you'd like for suggestion: \n")
    print ("–"*95)
    
    # remove special case but keeping whitespace and convert it to lower case
    cleaned = re.sub('[^A-Za-z0-9 ]+', '', raw_word).lower()            

    return cleaned  

In [12]:
def alice_suggestion(dictionary):
    
    # takes user_input here
    usinput = user_sentence_input()
    
    # process the string to a list
    usinput_listed = usinput.split(" ")

    # find each word's suggestion and reprocess it to a string
    result = ""
    changed = 0
    unchanged = 0
    total_words = len(usinput_listed)
    
    for search_word in usinput_listed:
        suggestion = find_max_match(dictionary, search_word)
        if search_word == suggestion:
            unchanged += 1
        else:
            changed += 1
        
        result += suggestion + " "
    
    # The count of changed and unchanged words is incensitive to case and special cases
    print ("Total Words: %d, Unchanged Words: %d, Changed Words: %d" % (total_words, unchanged, changed))        
    return result

In [18]:
print (">>>> Alice's Suggestion: \n", alice_suggestion(alice_vocab))

>>>> Give Alice a sentence that you'd like for suggestion: 
eerepcr rvler/b qlpnf
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 3, Unchanged Words: 0, Changed Words: 3
>>>> Alice's Suggestion: 
 project lobster fallen 


In [19]:
print (">>>> Alice's Suggestion: \n", alice_suggestion(alice_vocab))

>>>> Give Alice a sentence that you'd like for suggestion: 
oUNec uPNOW a TVMWE
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 4, Unchanged Words: 1, Changed Words: 3
>>>> Alice's Suggestion: 
 once upon a twelve 


In [20]:
# The count of changed and unchanged words is incensitive to case and special cases
print (">>>> Alice's Suggestion: \n", alice_suggestion(alice_vocab))

>>>> Give Alice a sentence that you'd like for suggestion: 
Alice's Adventure in Wonderland 
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 5, Unchanged Words: 4, Changed Words: 1
>>>> Alice's Suggestion: 
 alices adventures in wonderland  


### Since single sentence is possible, Alice would be able to handle multiple sentences.

Let's give her first paragraph(5 sentences) from Wikipedia about Alice's Adventures in Wonderland. 

source: https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland

In [21]:
print (">>>> Alice's Suggestion: \n", alice_suggestion(alice_vocab))

>>>> Give Alice a sentence that you'd like for suggestion: 
Alice's Adventures in Wonderland (commonly shortened to Alice in Wonderland) is an 1865 fantasy novel written by English mathematician Charles Lutwidge Dudgson under the pseudonym Lewis Carroll. It tells of a girl named Alice falling through a rabbit hole into a fantasy world populated by peculiar, anthropomorphic creatures. The tale plays with logic, giving the story lasting popularity with adults as well as with children.[1] It is considered to be one of the best examples of the literary nonsense genre.[1][2] Its narrative course and structure, characters and imagery have been enormously influential[2] in both popular culture and literature, especially in the fantasy genre.
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 103, Unchanged Words: 57, Changed Words: 46
>>>> Alice's Suggestion: 
 alices adventures in wonderland complying directions to alice in wonderland

Also multiple paragraph is possible.

From the same Wiki page, I gave her Background part which is composition of 5 paragraphs. 

In [22]:
print (">>>> Alice's Suggestion: \n", alice_suggestion(alice_vocab))

>>>> Give Alice a sentence that you'd like for suggestion: 
Alice was published in 1865, three years after Charles Lutwidge Dodgson and the Reverend Robinson Duckworth rowed a boat up the Isis on 4 July 1862[3] (this popular date of the "golden afternoon"[4] might be a confusion or even another Alice-tale, for that particular day was cool, cloudy, and rainy[5]) with the three young daughters of Henry Liddell (the Vice-Chancellor of Oxford University and Dean of Christ Church): Lorina Charlotte Liddell (aged 13, born 1849, "Prima" in the book's prefatory verse); Alice Pleasance Liddell (aged 10, born 1852, "Secunda" in the prefatory verse); Edith Mary Liddell (aged 8, born 1853, "Tertia" in the prefatory verse).[6]  The journey began at Folly Bridge in Oxford and ended 3 miles (5 km) north-west in the village of Godstow. During the trip, Dodgson told the girls a story that featured a bored little girl named Alice who goes looking for an adventure. The girls loved it, and Alice Liddell a

### <b style='color:purple'> – Part3 – </b>

### Let's extend this and play the word suggestion game with english word master.

In Part 1 and 2, Alice was giving suggestions and her vocabulary was limited to those words in the story of Alice In Wonderland. Now, I'd like to play the game with a word master who knows 466k English words and get suggestion from him. 

466k English words source: https://github.com/dwyl/english-words

In [19]:
import json

with open("words_dictionary.json") as word_dict:
    word_master = json.load(word_dict)

To give you an idea, the word master's dictionary looks like below. <br>
It's different from Alice's dictionary that has length and frequency info as a value.

![my_image](./wordmaster.png)

In [20]:
word_master['aaron']

1

In [21]:
'aaron' in word_master

True

We need a little bit of modification on find_max_match function becuase Word Master's dictionary doesn't have each word's length info and no appearing frequency info.

In [22]:
def find_max_match(words_dict, user_word):
    """
    Counts matching letters for every words from words_dict that are +/- 1 long of user_word.
    
    INPUT : a dictionary, {'word' : [length of word, occurrence], ...}
    OUTPUT: a string, 'word'    
    """
    
    closest_word = ""
    closest_match = 0
    
    SHORTEST = len(user_word) -1
    LONGEST = len(user_word) + 1

    for w in words_dict:
        matching_char = 0 
        
        if SHORTEST <= len(w) <= LONGEST : # modified here
            
            for i in range(len(user_word)):
                if user_word[i] in w:
                    matching_char += 1

        if matching_char > closest_match:
            closest_match = matching_char
            closest_word = w

#         elif matching_char == closest_match: 
#             if words_dict[w][1] > words_dict[closest_word][1]:
#                 closest_word = w

    return closest_word

In [23]:
print (">>>> Word Master's Suggestion: \n", alice_suggestion(word_master))

>>>> Give Alice a sentence that you'd like for suggestion: 
eerepcr rvler/b qlpnf
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 3, Unchanged Words: 0, Changed Words: 3
>>>> Word Master's Suggestion: 
 accepter barvel panfil 


In [24]:
print (">>>> Word Master's Suggestion: \n", alice_suggestion(word_master))

>>>> Give Alice a sentence that you'd like for suggestion: 
oUNec uPNOW a TVMWE
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Total Words: 4, Unchanged Words: 1, Changed Words: 3
>>>> Word Master's Suggestion: 
 bounce unplow a evomit 


Word Master's suggestion seems not as good as Alice's suggestion. <br>
For example, when the input was "oUNec uPNOW a TVMWE", Alice's suggestion was closer to "once upon a time" where as Word Master's suggesion seems a little far off from the origianl form.

The reason might be following..
1. When there are multiple words that has equal number of maching char, the modified find_max_match function doesn't give priority for words that appears more frequently.
2. A word that became a closest_word at the early stage of word search being kept unless there is other word that has more number of maching charactor.
3. As discussed at the beginning of this document, the algorithm is not sensitive to the order of letters within a word.

I would like to perceive these as improvement points and hope to build algorithm that could solve issue later.

![my_image](./clock.png)