<h2> ======================================================</h2>
 <h1>MA477 - Theory and Applications of Data Science</h1> 
  <h1>Lesson 13: Naive Bayes Classifier (NBC) </h1> 
 
 <h4>Dr. Valmir Bucaj</h4>
 United States Military Academy, West Point 
AY20-2
<h2>======================================================</h2>

<h2>Lecture Outline</h2>

<ul>
    <li>What is NBC?</li>
    <li> What is NBC used for?</li>
    <li>Bayes Theorem</li>
    <li> Applications of NBC</li>
    <li> Natural Language Tool Kit (NLTK) Overview</li>
    <li> Text Counting</li>
    
</ul>

<h3>What is NBC?</h3>

Naive Bayes classifier, as the name suggests, is a supervised machine-learning technique that is used in classification problems. At the heart of NBC is Bayes Theorem(which will be discussed shortly), which is used to calculate the probability that a new data point will belong to a certain class. 

<h3>What is NBC used for?</h3>

NBC can be used to classify data. In this course, we will use NBC to classify <b>text</b> and perform <b> sentiment analysis</b>.

Examples include:

<ul>
    
 <li> Natural Language Processing (NLP)</li>
 <li> Email Spam Detection</li>
 <li>Classify whether a name is male, female, gender-neutral etc.</li>
 <li> Classify whether product reviews are good or bad </li>
 <li> etc.</li>
 </ul>
 
 Before we describe a concrete applicaton of NBC that we will do later on, let us briefly turn to the Bayes Theorem.
 
 <h3>Bayes Theorem</h3>
 
 Let $A,B$ be any two events with $P(B)\neq 0.$ Then,
 
 $$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$ where $P(B)=P(B|A)P(A)+P(B|A^c)P(A^c)$
 
 <h3>Application of NBC</h3>
 
 Suppose we want to classify a name as a female or male name. For example,
 
 Marie, Ana, Sophia, Zoey are typical female names, while John, Karl, Brad etc. are typical male names. 
 
 First off we would need a strategy and a criterion to help us make this distinction. One such simple criteria could be the letter that the name ends with. Typically, female names end in vowels A, E, I, O, U, Y and male names typically end in consonants. 
 
 So, in our case we have two classes $C_1=Female$ and $C_2=Male$, while the target that we want to classify will be written as a vector $\vec{x}=(a,e,i,o,u,y)$.
 
 Our goal is to find $$P(C_i|\vec{x})$$ for $i=1,2.$ In other words, given that a name ends in a vowel, we want to compute the probability that it is a female name and the probability that it is a male name (which in this case it is simply the complement of the prob. of being a female name since we are only dealing with two classes). We make our decision based on which of these two probabilities is largest.
 
This is where Bayes Theorem comes in handy. Instead of computing the above probabilities directly, we do it using Bayes Theorem, namely


$$P(C_i|\vec{x})=\frac{P(\vec{x}|C_i)P(C_i)}{P(\vec{x})}$$

One thing to point out immediately is that we will completely ignore the denominator, as it is common to all the classes, and we are not interested in computing the exact probability, but rather how they compare with one another. Here, $P(C_i)$ is known as the prior probability; that is, we need to have some apriory sense of the probability for a name to be male or female, before knowing anything else about the particular name in question. Wheras the probability that a name ends in a vowel given that it is male or female, $P(\vec{x}|C_i)$, is obtained empiricaly from the data that we have; that is, we need to have a collection of names that we know whether they are male or female, which we will use to train our model. 

So, why the term $Naive?$ It's called <b> Naive</b> Bayes Classifier, because of the assumption that the features are indepndent of each other. 


 <h3>NLTK</h3>
 
 NLTK is a collection of Python libraries which are used to conduct symbolic and statistical natural language processing. For more information visit www.nltk.org
 
We begin by importing this library

In [2]:
import nltk

Inside ntlk is a wide corpus of digitized books and other texts, which we will often use as examples to illustrate certain points. Let's go ahead and import some text.


In [18]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [42]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [19]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

So, these are all the texts that are contained in the nltk corpus as part of the gutenberg project. We'll use the 'carroll-alice' to illustrate text counting.

In [21]:
alice=nltk.corpus.gutenberg.words('carroll-alice.txt')

As we can check `alice` contains a list of all the words in the text.

In [24]:
alice[:30]

['[',
 'Alice',
 "'",
 's',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865',
 ']',
 'CHAPTER',
 'I',
 '.',
 'Down',
 'the',
 'Rabbit',
 '-',
 'Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by']

If we wanted to check how many times the words `Alice`, `Rabbit` etc. appears in the text we may do so:

In [26]:
alice.count('Alice')

396

In [30]:
alice.count('Rabbit')

45

In [31]:
alice.count('rabbit')

5

As we can observe, it is case sensitive.

We can also check how many words/items are total in the text, to get a rough idea of its length

In [32]:
len(alice)

34110

If we want to check how many unique words/items are in the text, we can first convert the list to a set and then compute its length.

In [35]:
alice_set=set(alice)

In [36]:
len(alice_set)

3016

So, there is a little over 3000 unique words/items in the text. Now, we can check the average number of times each item/word appears in the text.

In [37]:
len(alice)/len(alice_set)

11.309681697612731

So, on average, each word/item appears roughly 11 times in the text.

Instead of getting the individual words out of the text, we may instead get the sentences, so that we may begin to get a better idea of the overall structure of the text.

In [45]:
alice_sent=nltk.corpus.gutenberg.sents('carroll-alice.txt')

As we will soon check, `alice_sent` is a list of lists, where each list-element is a sentence broken-down into individual words. For example, below is what the fifth sentence looks like.

In [55]:
alice_sent[5]

['There',
 'was',
 'nothing',
 'so',
 'VERY',
 'remarkable',
 'in',
 'that',
 ';',
 'nor',
 'did',
 'Alice',
 'think',
 'it',
 'so',
 'VERY',
 'much',
 'out',
 'of',
 'the',
 'way',
 'to',
 'hear',
 'the',
 'Rabbit',
 'say',
 'to',
 'itself',
 ',',
 "'",
 'Oh',
 'dear',
 '!']

Let's check the average number of words per sentence.

In [56]:
len(alice)/len(alice_sent)

20.029359953024077

So, a sentence contains on average about 20 words.

<font color='red' size='4'>Exercise</font>

Load the inagural speeches and compute the average number of words per sentence in each speech. (Later we'll do more interesting things with the inagural speeches!)

In [71]:
nltk.download('inaugural')
from nltk.corpus import inaugural

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


In [72]:
speeches=inaugural.fileids()

In [70]:
speeches

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

In [73]:
#Start Your Answer Here


<h3>Frequency Distributions</h3>

When analyzing text, one of the most basic things we may want to know if how often a list of words appears in the text.

In [74]:
alice_fd=nltk.FreqDist(alice)

In [75]:
alice_fd

FreqDist({',': 1993, "'": 1731, 'the': 1527, 'and': 802, '.': 764, 'to': 725, 'a': 615, 'I': 543, 'it': 527, 'she': 509, ...})

So, `alice_fd` is a dictionary that contains the number of times each item/word appears in the text. For example, if I want to know how many times the word `dear` appears we can do so as follows:

In [76]:
alice_fd['dear']

28

A great feature of the Frequency Distribution method is that we can use it to return the most common words/items.

In [78]:
alice_fd.most_common(15)

[(',', 1993),
 ("'", 1731),
 ('the', 1527),
 ('and', 802),
 ('.', 764),
 ('to', 725),
 ('a', 615),
 ('I', 543),
 ('it', 527),
 ('she', 509),
 ('of', 500),
 ('said', 456),
 (",'", 397),
 ('Alice', 396),
 ('in', 357)]

For example, the character `comma` is the one that appears the most, followed by the aphostrophe and the word `the`. 

Is this list informatory at all? Not so much! Other than `Alice` all other words/items tell us absolutely nothing about the actual content of the text. In the future we will learn how to first get rid of these non-descriptive words, so that we can gain a better idea of the actual contents of the text.

We can also check the words that were used only once:

In [81]:
alice_fd.hapaxes()[:10]

['Lewis',
 'Carroll',
 '1865',
 ']',
 'Hole',
 'conversations',
 'daisy',
 'chain',
 'daisies',
 'pink']

<h3> Conditional Frequency Distributions</h3>