<a href="https://colab.research.google.com/github/mgite03/bu-ai4all-2019/blob/main/Copy_of_Lesson_5__Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Problems
## Lesson 5: Naive Bayes
---
Created by Terron Ishihara

### Problem 1 - Probability

Suppose we have two bags containing four types of gems: diamond, garnet, amethyst, and pearl (which may ring a bell for any [Steven Universe](https://en.wikipedia.org/wiki/Steven_Universe) fans). We'll call these Bag A and Bag B. The event of randomly selecting a diamond from either bag is denoted with a capital D. The same goes for G for garnet, A for amethyst, and P for Pearl.

> Each bag contains the following number of each gem:

| Bag A | Bag B |
|------------|------------------|
| 2 diamonds | 0 diamonds |
| 4 garnets | 6 garnets |
| 1 amethyst | 3 amethysts |
| 0 pearls | 2 pearls |

> **For each of the following questions, be sure to answer using the proper notation for probabilities.**

> a. What is the sample space of these bags?

The sample space of A is 7 and of B is 11. <<<<< nope!


> A= {diamonds, garnets, amethyst}
>B= {garnets, amethysts, pearls}

> b. What is the probability of drawing a diamond from Bag A? What is the probability of drawing an amethyst from Bag B? What is the probability of drawing an amethyst or a garnet from Bag A?

>P(D|A) = 2/ 7

>P(A|B) = 3/11

>P(A or G|A) = 5/7

> c. What is the probability of drawing a garnet from Bag A and drawing a pearl from Bag B? Are these events independent?

These two events are mutually exclusive. P(G from A) = 4/7 and (P from B = 2/11) so 4/7 * 2/11 = 8 / 77

> d. What is the probability of drawing two diamonds from Bag B? That is, drawing a diamond, then drawing a second diamond without replacing the first? Are these events independent?

uh there are no diamonds in bag b, therefore the probability of drawing two diamonds from bag b is 0 %. 

> e. What are the union and intersection of Bag A and Bag B? Provide an example of an event which produces an empty set of outcomes.

> A union B = {diamonds,  garnets,  amethysts, and  pearls.} 

> A intersection B  =  { garnets,   amethyst}

> A intersection B - ???????????????????????????????????????????????????????????????

### Problem 2 - Naive Bayes

Let's take a look at how a Naive Bayes classifier words by working through the classifier's computation process step by step. To do this, we're going to make use of a library called NLTK (Natural Language Toolkit). This is a very useful tool for doing natural language processing, the subfield of machine learning that deals with understanding language.

The example task we're going to work with is known as sentiment analysis. We will use a dataset of 2000 movie reviews (1000 positive, 1000 negative) provided by NLTK. In this case, the classification is binary: a review is either positive or negative (i.e. the reviewer liked or didn't like the movie). Let's start by importing the dataset.

In [None]:
import nltk
nltk.download('movie_reviews')

from nltk.corpus import movie_reviews as mr

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


> Now we can go through each positive and negative movie review and count how many times each word appears in the dataset. We need these counts to estimate the probabilities of a given word appearing in either a positive or negative review.

In [None]:
# defaultdict is a handy type of dictionary where each key has a default value.
# In this case, the keys in the dictionaries will be strings (words), and
# by default their value will be a default int value of 0 (word count)
from collections import defaultdict

pos_word_counts = defaultdict(int)
neg_word_counts = defaultdict(int)

# Iterate over the file names in the dataset
for file_name in mr.fileids():
  # The part before the '/' in the file name is the label, either "pos" or "neg"
  label = file_name.split('/')[0]
  # Iterate over the words in the file
  review_words = mr.words(file_name)
  for word in review_words:
    # Increment the appropriate word count dictionary
    if label == "pos":
      pos_word_counts[word] += 1
    if label == "neg":
      neg_word_counts[word] += 1
      
# Print the first 10 word counts in positive reviews
for word, count in list(pos_word_counts.items())[:10]:
  print(word, count)

films 884
adapted 28
from 2731
comic 221
books 49
have 2240
had 721
plenty 76
of 18636
success 126


> In order to calculate probabilities, we need to know how many words there are total for each set, positive and negative reviews.

In [None]:
pos_word_total = 0
neg_word_total = 0

for count in pos_word_counts.values():
  pos_word_total += count
for count in neg_word_counts.values():
  neg_word_total += count

print("Total number positive words: ", pos_word_total)
print("Total number negative words: ", neg_word_total)

Total number positive words:  832564
Total number negative words:  751256


> Consider a new movie review that we want to classify as positive or negative: *“This is the best movie I have ever seen.”* Naturally, we expect our classifier to say that this is a positive review. Recall that we make an independence assumption, that every word is independent from all other words. Given this, let's work out the math.

> We ultimately want to calculate P(positive | review). That is, given this review, what is the probability that it is a positive review. To do this, we use Bayes' rule, which states:

> P(positive | review) = P(review | positive) * P(positive) / P(review)

> We want to do the same for P(negative | review), which follows the same formula. Since both quantities contain P(review) as the denominator, we can essentially ignore it in our calculations. P(positive) and P(negative) are simply the number of words in positive and negative reviews, respectively, which we already calculated. P(review | positive) and P(review | negative) are given by the independence assumption, allowing us to multiply the probabilities that each word in the new review would appear in a positive or negative review.

> Let's see what this looks like in code.

In [None]:
def classify(review):
  # Split the string on ' ' to get a list of words
  review_words = review.split(' ')
  
  p_review_given_pos = 1 # P(review | pos)
  for word in review_words:
    # Calculate P(word_n | pos)
    p_word_given_pos = pos_word_counts[word.lower()] / pos_word_total
    # Multiply with overall product
    p_review_given_pos *= p_word_given_pos
  p_pos = p_review_given_pos * pos_word_total # P(pos | review)
    
  # Do the same for negative reviews
  p_review_given_neg = 1
  for word in review_words:
    p_word_given_neg = neg_word_counts[word.lower()] / pos_word_total
    p_review_given_neg *= p_word_given_neg
  p_neg = p_review_given_neg * neg_word_total # P(neg | review)
  
  print(p_pos)
  print(p_neg)
    
  # Compare the two probabilities and return the appropriate classification
  if p_pos > p_neg:
    return 'positive'
  elif p_neg > p_pos:
    return 'negative'
  else:
    return "Not sure"

  
print(classify('This movie was the greatest!!'))


0.0
0.0
Not sure


> a. Replace the review in the code above with something that you would expect to return negative as a classification. Does our model agree?

>"This is the worst movie I've ever seen" - negative

>"This is terrible." - negative

>"This movie was the greatest!" - negative

> b. Now try using review that contains a word that does not appear in the dataset. For example, "Thanos is a great villain" ("Thanos" does not appear in any reviews in this dataset). By default, the classification will be `negative`, but can you explain why? *(Hint: print out the values of `p_pos` and `p_neg`.)* 

> Both values were 0.0, so p_pos wasn't greater than p_neg. therefore, it went to the else statement, returning negative.

> c. (Challenge question) How can we go about resolving the predicament in part B? That is, if the review we want to classify contains a word that does not appear in our dataset, what can we do to make our model still calculate a reasonable prediction? Make these augmentations to the `classify()` method above. 

> Maybe just have and if, elif, and a "not sure" category for else. 

> d. (Challenge question) One step that we have skipped when processing the words in each file is removing tokens (which basically means "words", but includes non-words too) like punctuation marks and what are known as "stop words". Stop words are words like "the", "at", "is", etc. These are words that are extremely common, so the probability that we calculate for these words is unnecessarily high. For example, "The movie is great" should place the most emphasis on the word "great", not the words "The" and "is".

> The code below shows how you can import a collection stop words for the English language and a collection of punctuation marks. Use these to update the code above so that stop words and punctuation are ignored when counting words. 

> *Note: for the resulting code to work, your answer for part C must be implemented in the code as well, since stop words will no longer appear in our dataset.*

In [None]:
# Import the stopwords corpus
nltk.download('stopwords')
from nltk.corpus import stopwords
# Get the stopwords for English
stops = stopwords.words('english')

# With string imported, you can access string.punctuation
# which is a collection of punctuation marks.
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def classify(review):
  # Split the string on ' ' to get a list of words
  review_words = review.split(' ')
  print(review_words)
  print(stops)
  
  for j in review_words:
      if j in stops:
        review_words.remove(j)
        
        
  print(review_words)
  
  
  p_review_given_pos = 1 # P(review | pos)
  for word in review_words:
    # Calculate P(word_n | pos)
    p_word_given_pos = pos_word_counts[word.lower()] / pos_word_total
    # Multiply with overall product
    p_review_given_pos *= p_word_given_pos
  p_pos = p_review_given_pos * pos_word_total # P(pos | review)
    
  # Do the same for negative reviews
  p_review_given_neg = 1
  for word in review_words:
    p_word_given_neg = neg_word_counts[word.lower()] / pos_word_total
    p_review_given_neg *= p_word_given_neg
  p_neg = p_review_given_neg * neg_word_total # P(neg | review)
  
  print(p_pos)
  print(p_neg)
    
  # Compare the two probabilities and return the appropriate classification
  if p_pos > p_neg:
    return 'positive'
  elif p_neg > p_pos:
    return 'negative'
  else:
    return "Not sure"

  
print(classify('this movie was adapted so well from the book. it was the best'))


['this', 'movie', 'was', 'adapted', 'so', 'well', 'from', 'the', 'book.', 'it', 'was', 'the', 'best']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few