## Alice's Adventures in <b style='color:green'>Word</b>land 
<br>
![my_image](./aliceimage.png)

Alice likes playing a word game. When she is given a word, she will make a suggestion that she thinks it's closest word from her vocabulary. Her vocabulary is limited to words from story of _"Alice's Adventures in Wonderland"_. 

When she knows the given word, she will give the same word as suggestion. On the other hand, if she doesn't know, she will find the "closest word" based on the following..
    
1) ignore any punctuations such as comma(,), period(.), colon(:), exclamation point(!), etc. <br>
2) consider 'A' and 'a' as same character; not sensitive to case <br>
3) among words the lengh of them are +/- 1 letter away from given word <br>
4) a one that has biggiest number of matching letter <br>
5) if there are two or more words that has same number of matching letters, pick a one that has more occurence in the story <br>
             
          [ for example ]
            Let's say she is given a word "cha" for a word suggestion. 
            In her vocabulary, 2~4 letters long words are, for example, act, been, chai, chip, hhhh, and map.
              
            If we compare "cha" to "act", 'c' and 'a' from "cha" exist in "act".
            Therefore:
                        cha(2) --> act
                        cha(0) --> been
                        cha(3) --> chai
                        cha(2) --> chip
                        cha(1) --> hhhh
                        cha(1) --> map
                            ↳ Numbers in parenthesis means the number of matching letters.
                               In words, one letter in "cha" is in "map", for example.

* Note that __Alice is checking whether each letter in a given word can be found in Alice's vocab words__. This reduces the chance of the getting not-so-ideal suggestion from Alice. If we were counting the number of matching letter in Alice's word to the given word, in reverse, the number of matching letters of "hhhh" in Alice's vocabulary is 4 instead of 1 since four letters in "hhhh" exist in the given word "cha"
   
                        act(2)  --> cha
                        been(0) --> cha
                        chai(3) --> cha
                        chip(2) --> cha
                        hhhh(4) --> cha
                        map(1)  --> cha          

![my_image](./cat.png)

How do you think about the last two steps from her approaches of finding the closest-word? Should a one get precedence for the other?

What would you like to get as a suggested word between "that" or "the" when you give her "tha". Following the current step, Alice will give you "that" since there are 3 matching letters compared to 2 matching letters in "the". However, one might argue that "the" would be the better suggestion since we use it more often in sentences. On the other hand, the other might say it's possible that "tha" is on the way to completing the full word "that" like we prematually pressed the enter on keyboard before finish typing. 

Let's say a word "bel" is given to Alice. In her vocabulary, limited to this example, words in length of range 2 to 4 are like below. The number of matching letter and word frequency in the vocabulary is listed repectively {# of matching letter, frequency}. Please note, as it was discussed above, we are counting the number of matching letters in the given word comapred to the Alice's vocabulary words.

                        bel --> be   {2, 1042}
                        bel --> been {2, 355}
                        bel --> eel  {2, 8}
                        bel --> bell {3, 109}
                        bel --> belt {3, 76}

If Alice were giving precedence to most occuring word to be the closest-word, "be" will get picked although there are other two words that has more matching letters which seems better suggestion. In fact, "be" is second most frequently appearing word in written english according to an analysis of the Oxford English Corpus(https://en.wikipedia.org/wiki/Most_common_words_in_English). Therefore, Alice will give us "the", "be", "to", etc. as a word suggestion for any given words look similar to them. This will limit her word suggestion pool to those most occuring ones in written english as long as it's in her vocabulary. 

For this reason, __Alice will consider word frequency only when there are multiple words in her vocabulrary that has same number of matching letters to the given word__. 

![my_image](./rabbitclock.png)

_Another thing about her current approach is it's not sensitive to the order of letters. As a result, she will think "each" is a good suggestion for a given word "cha" rather than "chai" or "chat" if "each" appears more frequent than other two. We will improve this in the second part of this project. For now, we will implement something that give decent suggeustion._

<br>
_This project similar to word auto-correction but assuming that the english words in this world is limited to those word from the story._
<br>
<br>

text file source: https://archive.org/stream/alicesadventures19033gut/19033.txt <br>
image source: https://blog.whsmith.co.uk/alices-adventures-in-wonderland-free-colouring-downloads/

#### Let's first construct Alice's vocabulary from Alice's Adventures in Wonderland story 

In [29]:
def story_to_sentences(text_file):
    """
    Takes a text file(whole story) and put each sentence to a list in lower case 
    without punctuations nor special characters.
    
    INPUT : text file
    OUTPUT: a list, ['sentence', 'sentence', 'sentence', ...]
    """
    import re

    ## Opening the text file
    with open(text_file) as alice_story:  
        
        ## [List] create one big list that each item are single sentence
        ## each item in list is separated by \n but doesn't contain \n as part of string
        story_lines = alice_story.read().splitlines() # length: 3736
        
        ## [List] put each sentence into another one new big list 
        ## each sentence is converted to lowercase and all special characters are removed
        cleaned_lines = []                            # length: 3736
        for sentence in story_lines:
            # keeping whitespaces between words in each sentence
            cleaned_lines.append(re.sub('[^A-Za-z0-9 ]+', '', sentence).lower()) 
            
    return cleaned_lines

After running the story_to_sentences function, a text file partially looks like this at story_lines level.

![my_image](./story_lines.png)

and at cleaned_lines level, it looks like below.

![my_image](./cleaned_lines.png)

In [34]:
def list_to_dict(sentences_lst):
    """
    Takes a list of sentences and goes through every words in each sentence.
    Creates a dictionary which takes word as the key and word length and frequency in story as a value.
    
    INPUT : text file
    OUTPUT: a dictionary, {'word' : [length of word, occurrence], ... }
    """
    ## {Dictionary} split into single word and put it in a dictionary
    words_dict = {}                               # length: 3254
    for sentence in sentences_lst:
        ## [List] "words" is a list with words from one sentence, there will be 3736 lists
        words = sentence.split(" ")  
        for w in words:
            if w in words_dict: 
                # when already exists in words_dict, just increment the freq
                words_dict[w][1] += 1
            else:               
                # add a new entry in words_dict with word length and freq info
                words_dict[w] = [len(w), 1]

    return words_dict
    

In [35]:
sentences = story_to_sentences("aliceText.txt")

In [38]:
alice_vocab = list_to_dict(sentences)

Below is glimpse of how Alice's vocabulary dictionary going to be look like after processing the text file.

![my_image](./alicevoca.png)

In [9]:
alice_vocab = txt_to_dict("aliceText.txt")
print (alice_vocab['remarkable'])

[10, 2]


#### Now she needs set up a procedure to look up a "cloest word" from her vocab list.

In [25]:
def max_match(words_dict, user_word):
    """
    Counts the number of matching letters for every words in length of +/- 1 of user_word in a words_dict 
    compared to user_word.
    
    INPUT : a dictionary, {'word' : [length of word, occurrence]}
    OUTPUT: a string, 'word'    
    """
    
    closest_word = ""
    closest_match = 0

    for w in words_dict:
        ## reset the value of matching_char for every word in words_dict
        matching_char = 0
        proper_length = False     # for dubugging 
        ## pick a word in words_dict 1 letter shorter or longer than the user_word
        if (len(user_word) -1) <= words_dict[w][0] <= (len(user_word) + 1):
            proper_length = True  # for dubugging 
            ## check whether each letter in user_word exists in a word in words_dict
            for i in range(len(user_word)):
                if user_word[i] in w:
                    matching_char += 1
                    
            ## if we were doing like this instead of above, 
            ## this counts the number of matching letter in a word in words_dict to user_word
            ## as discussed at the beginning, this draws not-so-ideal results
#             for i in range(len(w)):
#                 if w[i] in user_word:
#                     matching_char += 1
        
        ## update closest_match and closest_word if applicable
        if matching_char > closest_match:
            closest_match = matching_char
            closest_word = w
        ## if new candidate word's matching_char is same as current closest_match, select the higher freq one
        elif matching_char == closest_match: 
            if words_dict[w][1] > words_dict[closest_word][1]:
                closest_word = w
                
        ## dubugging purpose         
        print ("tested:", w, words_dict[w], proper_length, " -----  status:", closest_match, closest_word)

    return closest_word    

#### Let's play word suggestion game with Alice.

In [72]:
search_word = input("Which word would you like to ask her? ")
print ("Alice here, how about '%s' instead of '%s'?" % (max_match(alice_vocab, search_word), search_word))

Which word would you like to ask her? tha
Alice here, how about 'that' instead of 'tha'?


In [5]:
search_word = input("Which word would you like to ask her? ")
print ("Alice here, how about '%s' instead of '%s'?" % (max_match(alice_vocab, search_word), search_word))

Which word would you like to ask her? alic
Alice here, how about 'alice' instead of 'alic'?


In [28]:
search_word = input("Which word would you like to ask her? ")
print ("Alice here, how about '%s' instead of '%s'?" % (max_match(alice_vocab, search_word), search_word))

Which word would you like to ask her? bel
tested: project [7, 87] False  -----  status: 0 
tested: gutenbergs [10, 2] False  -----  status: 0 
tested: alices [6, 17] False  -----  status: 0 
tested: adventures [10, 11] False  -----  status: 0 
tested: in [2, 428] True  -----  status: 0 
tested: wonderland [10, 8] False  -----  status: 0 
tested: by [2, 76] True  -----  status: 1 by
tested: lewis [5, 4] False  -----  status: 1 by
tested: carroll [7, 4] False  -----  status: 1 by
tested:  [0, 2319] False  -----  status: 1 by
tested: this [4, 181] True  -----  status: 1 by
tested: ebook [5, 9] False  -----  status: 1 by
tested: is [2, 128] True  -----  status: 1 by
tested: for [3, 179] True  -----  status: 1 by
tested: the [3, 1804] True  -----  status: 1 the
tested: use [3, 29] True  -----  status: 1 the
tested: of [2, 625] True  -----  status: 1 the
tested: anyone [6, 5] False  -----  status: 1 the
tested: anywhere [8, 3] False  -----  status: 1 the
tested: at [2, 224] True  -----  stat

In [22]:
### add this:
"Alice here, I know this word, _______!" --> when she knows
"Alice here, how about '%s' instead of '%s'?" --> when she does not know 

SyntaxError: invalid syntax (<ipython-input-22-1c0dd65a1ac6>, line 2)

#### Result of below shows her journey to that conclusion. 

In [75]:
search_word = "tha"
max_match(alice_dict, search_word)

tested: project [7, 87] False  -----  status: 0 
tested: gutenbergs [10, 2] False  -----  status: 0 
tested: alices [6, 17] False  -----  status: 0 
tested: adventures [10, 11] False  -----  status: 0 
tested: in [2, 428] True  -----  status: 0 
tested: wonderland [10, 8] False  -----  status: 0 
tested: by [2, 76] True  -----  status: 0 
tested: lewis [5, 4] False  -----  status: 0 
tested: carroll [7, 4] False  -----  status: 0 
tested:  [0, 2319] False  -----  status: 0 
tested: this [4, 181] True  -----  status: 2 this
tested: ebook [5, 9] False  -----  status: 2 this
tested: is [2, 128] True  -----  status: 2 this
tested: for [3, 179] True  -----  status: 2 this
tested: the [3, 1804] True  -----  status: 2 the
tested: use [3, 29] True  -----  status: 2 the
tested: of [2, 625] True  -----  status: 2 the
tested: anyone [6, 5] False  -----  status: 2 the
tested: anywhere [8, 3] False  -----  status: 2 the
tested: at [2, 224] True  -----  status: 2 the
tested: no [2, 97] True  -----  

'that'

In [10]:
search_word = "proect"
max_match(alice_dict, search_word)

tested: project [7, 87] True  -----  status: 6 project
tested: gutenbergs [10, 2] False  -----  status: 6 project
tested: alices [6, 17] True  -----  status: 6 project
tested: adventures [10, 11] False  -----  status: 6 project
tested: in [2, 428] False  -----  status: 6 project
tested: wonderland [10, 8] False  -----  status: 6 project
tested: by [2, 76] False  -----  status: 6 project
tested: lewis [5, 4] True  -----  status: 6 project
tested: carroll [7, 4] True  -----  status: 6 project
tested:  [0, 2319] False  -----  status: 6 project
tested: this [4, 181] False  -----  status: 6 project
tested: ebook [5, 9] True  -----  status: 6 project
tested: is [2, 128] False  -----  status: 6 project
tested: for [3, 179] False  -----  status: 6 project
tested: the [3, 1804] False  -----  status: 6 project
tested: use [3, 29] False  -----  status: 6 project
tested: of [2, 625] False  -----  status: 6 project
tested: anyone [6, 5] True  -----  status: 6 project
tested: anywhere [8, 3] False  

'project'

This works decent for 1 distance away words .
But propject --> pepper becuase there are more "p" and thereforem higher matchign_char score. 
to fix this error.. I will use different method



if user types "tha"
the options are the, than, that... 
since the most occuring one is the, it will return "the"

flaw:
projec --> project
proeec --> pepper



   - Since she counts only the number of matching letter, regardless each letter's location within a word nor the number of presence of each letter(i.e. how many 'h' in the given word), this has flaw of getting "hhhh" as a closest word of "cha". 

## Final project

- HW6 - Exercise 6
    Recall the similarity function to compute the edit distance in homework 7 and classwork 7(https://github.com/cis024c/fall2017classwork/blob/master/week7/word_similarity.ipynb).

    Create a Python module using the similarity function. Write a Python program to invoke the similarity function on any word entered by the user.
    
    solution: https://github.com/cis024c/fall2017hwsolutions/blob/master/hw8/hw8.ipynb
    
    
- Final Project Guidance
https://github.com/cis024c/projects/blob/master/spellcheck/spell-checker.ipynb

Week 9 lecture notes (pg 33)
 "Accept a paragraph of input" and "Spell correct the input and output the corrected paragraph."
 
 
dorairajsanjay [9:40 AM] 
project idea that involves building a spell checker of the entire set of possible words using the edit distance measure. this is very similar to what we did in class. we will however need to use more than just the alice in wonderland text and include all possible words from here - https://github.com/dwyl/english-words.  The program will accept a paragraph from the user and suggest corrections.

In [98]:
# ????how to access only the story part???? ????
# biginning is line 40th


import re

alice_story = open("aliceText.txt", "r")
alice_story.seek(0)

story_lines = alice_story.read().splitlines() 
# print ("story_lines len", len(story_lines))



cleaned_lines len 3736


![my_image](./clock.png)