## Side notes 
_(code snippets, summaries, resources, etc.)_

__Further Reading:__
- [Kaggle project](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)`
    - If you're interested in the details of language processing, you might start with this project, which introduces a more detailed and standard approach to text processing very different from what we cover here.

__Python code snippets:__
- Strip punctuation from String (from good [StackOverflow](http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python) answer)
```python
import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)
```

# Mini-Project: Bayes with Natural Language Processing 

## Bad Handwriting Exposition

![Bad Handwriting Exposition](mini-project_bayes_natural_language_processing_images/exposition.png)

## Calculations

Assuming the above text is representative, calculate the following probabilities:
1. Finding the word "you" after "if":
    - $P(you|if) = \frac{P(if|you)P(you)}{P(if)}$
    - $P(you) = P(if) = \frac{1}{21}$
    - $\therefore$ $P(you|if) = P(if|you)$
    - $\implies$ $P(you|if) = 1$
- That a randomly selected word is "you":
    - $P(you) = \frac{1}{22} = 0.0454546$
- That a randomly selected word is "if":
    - $P(you) = \frac{1}{22} = 0.0454546$

In [1]:
from collections import Counter
import string
from random import randint

In [2]:
sentence = '''
So if you could just go ahead and pack 
up your stuff and move it down there, 
that would be terrific, OK?
'''

s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)
words = sentence.split(' ')

cnt = Counter()
for word in words:
    cnt[word] += 1
print 'no. of all words: ', len(words)
print 'no. of unique words: ', len(cnt)
print 'no. most common words: ', cnt.most_common(2)
print
print 'no. of you\'s', cnt['you']
print 'no. of if\'s', cnt['if']

no. of all words:  22
no. of unique words:  21
no. most common words:  [('and', 2), ('there,', 1)]

no. of you's 1
no. of if's 1


## Maximum Likelihood

In [3]:
# not split into more readable lines
original_sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

sample_memo = '''
Milt, we're gonna need to go ahead and move 
you downstairs into storage B. We have some 
new people coming in, and we need all the space 
we can get. So if you could just go ahead and 
pack up your stuff and move it down there, that 
would be terrific, OK?

Oh, and remember: next Friday... is Hawaiian 
shirt day. So, you know, if you want to, go ahead 
and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna 
need you to go ahead and come in on Sunday, too...

Hello Peter, whats happening? Ummm, I'm gonna need 
you to go ahead and come in tomorrow. So if you 
could be here around 9 that would be great, mmmk... 
oh oh! and I almost forgot ahh, I'm also gonna need 
you to go ahead and come in on Sunday too, kay. We 
ahh lost some people this week and ah, we sorta need 
to play catch up.
'''

#
#   Maximum Likelihood Hypothesis
#
#
#   In this quiz we will find the maximum likelihood 
#   word based on the preceding word.
#
#   Fill in the NextWordProbability procedure so that 
#   it takes in sample text and a word,
#   and returns a dictionary with keys the set of 
#   words that come after, whose values are
#   the number of times the key comes after that word.
#   
#   Just use .split() to split the sample_memo text 
#   into words separated by spaces.

from collections import Counter
import string
from random import randint

def SampleTextPreprocessing(sampletext):
    #!! Issue that punctuation within word like 'we're' is taken out too
    # this can cause problems because "we're" will be confused with "were"
    s = sampletext
    s = s.replace('\n','')
    s = s.translate(string.maketrans("",""), string.punctuation)
    words = s.split(' ')

    return words

def NextWordProbability(sampletext, word):
    words = SampleTextPreprocessing(sampletext)
    words
    next_words = [words[i+1] for i, w in enumerate(words[:-1]) if w == word]

    cnt = Counter()
    for w in next_words:
        cnt[w] += 1
    
    return dict(cnt)

# Test code
num_words = len(words)
i = randint(0, num_words-1)
random_word = words[i]
print random_word
print "for randomly chosen word: \'", random_word, "\':"  # ??why spaces in ' word '?
print
print len(NextWordProbability(sample_memo, random_word))
print NextWordProbability(sample_memo, random_word)


be
for randomly chosen word: ' be ':

3
{'terrific': 1, 'great': 1, 'here': 1}


## NLP Disclaimer

In the previous exercise, you may have thought of some ways we might want to clean up the text available to us.

For example, we would certainly want to remove punctuation, and generally want to make all strings lowercase for consistency. In most language processing tasks we will have a much larger corpus of data, and will want to remove certain features.

Overall, just keep in mind that this mini-project is about Bayesian probability. If you're interested in the details of language processing, you might start with this [Kaggle project](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words), which introduces a more detailed and standard approach to text processing very different from what we cover here.

## Optimal Classifier Example

Method: Made tree diagram and add up probabilities for each word.

1. What word should you predict in the second blank?
    - Job
- With what probability?
    - 0.4

## Optimal Classifier Exercise

In [4]:
from collections import defaultdict


#------------------------------------------------------------------

#
#   Bayes Optimal Classifier
#
#   In this quiz we will compute the optimal label 
#   for a second missing word in a row
#   based on the possible words that could be in the 
#   first blank
#
#   Finish the procedurce, LaterWords(), below
#
#   You may want to import your code from the previous 
#   programming exercise!
#

original_sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

sample_memo = '''
Milt, we're gonna need to go ahead and move 
you downstairs into storage B. We have some 
new people coming in, and we need all the space 
we can get. So if you could just go ahead and 
pack up your stuff and move it down there, that 
would be terrific, OK?

Oh, and remember: next Friday... is Hawaiian 
shirt day. So, you know, if you want to, go ahead 
and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna 
need you to go ahead and come in on Sunday, too...

Hello Peter, whats happening? Ummm, I'm gonna need 
you to go ahead and come in tomorrow. So if you 
could be here around 9 that would be great, mmmk... 
oh oh! and I almost forgot ahh, I'm also gonna need 
you to go ahead and come in on Sunday too, kay. We 
ahh lost some people this week and ah, we sorta need 
to play catch up.
'''

corrupted_memo = '''
Yeah, I'm gonna --- you to go ahead --- --- complain 
about this. Oh, and if you could --- --- and sit at 
the kids' table, that'd be --- 
'''

data_list = sample_memo.strip().split()

words_to_guess = ["ahead","could"]


# conditional probabily tree used for later questions / debugging
cond_prob_tree = defaultdict(dict)

def NextWordProbabilityCalculated(sample, word):
    next_words = NextWordProbability(sample, word)
    n = sum(next_words.values())
    return {word: count/(n*1.) 
            for word, count in next_words.iteritems()}

def LaterWords(sample,word,distance):
    '''
    @param sample: a sample of text to draw from
    @param word: a word occuring before a corrupted sequence
    @param distance: how many words later to estimate 
        (i.e. 1 for the next word, 2 for the word after that)
    @returns: a single word which is the most likely possibility
    '''
    
    # TODO: Given a word, collect the relative probabilities 
    # of possible following words
    # from @sample. You may want to import your code from the 
    # maximum likelihood exercise.
    
    # TODO: Repeat the above process--for each distance beyond 1, 
    # evaluate the words that
    # might come after each word, and combine them weighting by 
    # relative probability
    # into an estimate of what might appear next.
    
    #if distance == 0:
        
    next_probs = NextWordProbabilityCalculated(sample, word)
    
    if distance == 1:
        return max(next_probs.iterkeys(), 
                   key=(lambda key: next_probs[key]))
    
    prev_probs = next_probs
    
    cond_probs = defaultdict(lambda : 0)
    
    global cond_prob_tree
    
    # run through each possiblity for first blank
    for prev_word, prev_prob in prev_probs.iteritems():
        next_probs = NextWordProbabilityCalculated(sample, prev_word)
        
        for next_word, next_prob in next_probs.iteritems():
            cond_prob_tree[prev_word][next_word] = prev_prob * next_prob
            cond_probs[next_word] += prev_prob * next_prob
    return max(cond_probs.iterkeys(),
               key=(lambda key: cond_probs[key]))


for word in words_to_guess:
    print LaterWords(sample_memo, word, 2)
    
### NOTE: result for 'could' differs from quiz which gives 'in'
### Not sure what's going on but quiz says output for my code is correct
### giving 'come' and 'go'

come
go


## Which Words Meditation

1. What set of words in a memo do you think could help predict what a missing word might be?
    - The words in the sample text that come after the word preceding the missing word in the corrupted text.

2. What are some advantages and disadvantages of using fewer possible influences in prediction?
    - I will interpret more possible influences as words farther back than just one preceeding word for each instance in the sample text.
    - Advantages: Given enough training data, looking at the string n-number of words back from the word preceding the missing text would give higher accuracy. This is because phrases would be taken into account, resulting in richer data for a more likely prediction.
    - Disadvantages: If there is not enough data, the drastic reduction of possible words for a missing word would cause overfitting. Data would be richer, but would not have enough of it to generalize to unseen text.

## Joint Distribution Analysis

If you wanted to measure the joint probability distribution of a missing word given its position relative to every other word in the document, how many probabilites would you need to measure? Say the document is $N$ words long.
- $N^2$ number of probabilities
- "When we only measured the likelihood of the word before, we had only N computations. Including regular expressions, the maximum amount of information we could use is super-exponential in the vocabulary size!"
- My attempt at an explanation:
    - Assuming each word occurs independently for the others (Naive Bayes), the Conditional Independence Rule would allow us to ignore all words in the prior probability
    - i.e. $P(x|y) = P(x|y_{1}, y_{2}, y_{n})$

## Domain Knowledge Quiz

Given the corpus of text we have from our boss, we might like to identify some things he often says, and use that knowledge to make better predictions.

What are some statements you see arising multiple times? (see sample_memo above)
- "gonna need to go ahead and",
- "gonna need you to go ahead and",
- "so if you could … that would be (great, ok)",
- "oh, and I almost forgot"

In [7]:
### Code attempts to extract most common phrases 
### in groups of 3 from sample text.
### Needs work somewhere (maybe in code from previous questions)

best_phrases = []
words_hit = []
words_skipped = []

for word_0 in data_list[2:]:
    try:
        words_hit.append(LaterWords(sample_memo, word_0 , 2))
    except ValueError as e:
        words_skipped.append(word_0)
  
    # contains most likely possible 2nd missing words
    # for each possible 1st missing words
    best_subphrases = {} 
    
    for missing_1st, missing_2nds in cond_prob_tree.iteritems():
        best_missing_2nd = max(missing_2nds.iterkeys(),
                               key=(lambda key: missing_2nds[key]))
        subphrase = missing_1st + ' ' + best_missing_2nd
        best_subphrases[subphrase] = missing_2nds[best_missing_2nd]
    
    part_best_phrase = max(best_subphrases.iterkeys(),
                           key=(lambda key: best_subphrases[key]))
    best_phrase = word_0 + ' ' + part_best_phrase
    best_phrase_prob = best_subphrases[part_best_phrase]
    best_phrases.append(tuple([best_phrase, best_phrase_prob]))

best_phrases.sort(key=(lambda item: item[1]))
    
print len(data_list)
print len(words_hit)
print len(words_skipped)
len(best_phrases)
for phrase, prob in best_phrases:
    print phrase, ' '*(20-len(phrase)) , prob

165
127
36
gonna just go         0.5
need just go          0.5
to go ahead           0.833333333333
go ahead and          1.0
ahead ahead and       1.0
and ahead and         1.0
move ahead and        1.0
you ahead and         1.0
downstairs into storage  1.0
into into storage     1.0
storage into storage  1.0
B. into storage       1.0
We into storage       1.0
have B We             1.0
some B We             1.0
new B We              1.0
people B We           1.0
coming into storage   1.0
in, into storage      1.0
and into storage      1.0
we into storage       1.0
need into storage     1.0
all the space         1.0
the into storage      1.0
space into storage    1.0
we into storage       1.0
can into storage      1.0
get. into storage     1.0
So into storage       1.0
if into storage       1.0
you into storage      1.0
could into storage    1.0
just into storage     1.0
go into storage       1.0
ahead into storage    1.0
and into storage      1.0
pack into storage     1.0
up into stora

## Domain Knowledge Fill In

![domain knowledge fill in](mini-project_bayes_natural_language_processing_images/domain_knowledge_fill_in.png)


"If we could encode this all in our classifier, we'd be able to achieve very high levels of accuracy!"