# <p style="background-color:#e36288; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Naive Bayes and Natural Language Processing (NLP)</p>


<div class="alert alert-block alert-success">
    
## <span style=" color:#e36288">Naive Bayes Classifier

* The Naïve Bayes classifier is a **supervised machine learning algorithm** that is used for classification tasks such as text classification. They use principles of probability to perform classification tasks.
* Bayes Theorem is a probability formula that leverages previously known probabilities to define probability of related events occuring: **P(Y|X) = P(X|Y)P(Y) / P(X)**
* **Example:** If you get a smoke alarm detecting a fire, what is the probability that there actually is a dangerous fire?

  * **A** and **B** are events (Dangerous Fire and Smoke Alarm Triggered)
  * **P(A|B)** is probability of event A given that B is True (Probability of Fire given Smoke Alarm)
  * **P(B|A)** is probability of event B given that A is True (Probability of Smoke Alarm given a Dangerous Fire)
  * **P(A)** is probability of A occuring (Probabiliity of Fire)
  * **P(B)** is probability of B occuring  (Probability of Smoke)

* Naïve Bayes is part of a family of generative learning algorithms, meaning that it seeks to model the distribution of inputs of a given class or category. Unlike discriminative classifiers, like logistic regression, **it does not learn which features are most important to differentiate between classes.**
* There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: **all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.** Since all x features are assumed to be **mutually independent of each other**, it is called "Naive" Bayes.
* For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers **each of these features to contribute independently to the probability** that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
* There are many **variations of Naive Bayes** model, including:
  * Multinomial Naive Bayes
  * Gaussian Naive Bayes
  * Complement Naive Bayes
  * Bernouilli Naive Bayes
  * Categorical Naive Bayes
</span>

# Extracting Features from Text Data

<div class="alert alert-block alert-info alert">

## <span style=" color:#bf2e98">Part One: Core Concepts on Feature Extraction

In this section we'll use basic Python to build a rudimentary NLP system. We'll build a **corpus of documents** (two small text files), create a **vocabulary** from all the words in both documents, and then demonstrate a **Bag of Words** technique to extract features from each document.
</div>

In [1]:
import numpy as np
import pandas as pd

### Start with some documents:
Let's quickly open the tex files and read them. Keep in mind, you should avoid opening and reading entire files if they are very large, as Python could just display everything depending on how you open the file.

#### Read the entire text as a string: read()

In [2]:
# 
with open("One.txt") as mytext: 
    a = mytext.read()

In [3]:
a
# here "\n" represents the end of the line

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [4]:
# to see the lines separately use print
print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [5]:
# Read another text document
with open('Two.txt') as mytext:
    print(mytext.read())

This story is about surfing
Catching waves is fun
Surfing is a popular water sport



#### Read each line as a list: readlines()

In [6]:
# We can return it as a list using "readlines"
with open("One.txt") as mytext: 
    a = mytext.readlines()

a

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

#### Read the words separately: split()

In [7]:
with open("One.txt") as mytext: 
    a = mytext.read()

a.lower().split() # split words as lower case

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

### Building a vocabulary (Creating a "Bag of Words")

Let's create dictionaries that correspond to unique mappings of the words in the documents. We can begin to think of this as mapping out all the possible words available for all (both) documents.

In [8]:
with open("One.txt") as mytext:
    words_one = mytext.read().lower().split()

In [9]:
words_one

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [10]:
len(words_one)

13

#### unique words: set()

In [11]:
uni_words_one = set(words_one)

In [12]:
uni_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [13]:
# Repeat it for the other document

with open("Two.txt") as mytext:
    words_two = mytext.read().lower().split()
    uni_words_two = set(words_two)

In [14]:
# Unique words in the Two.txt file
uni_words_two

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

#### Get all unique words across all documents

In [15]:
all_uni_words = set()
all_uni_words.update(uni_words_one)
all_uni_words.update(uni_words_two)

In [16]:
all_uni_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [17]:
len(all_uni_words)

19

#### Assign a number for each word 

In [18]:
# Let's assign numbers to each unique word

full_vocab = dict()
i = 0

for word in all_uni_words:
    full_vocab[word] = i
    i = i +1

In [19]:
full_vocab 
# there is no alphabetical order

{'waves': 0,
 'about': 1,
 'dogs': 2,
 'a': 3,
 'pets': 4,
 'is': 5,
 'fun': 6,
 'water': 7,
 'our': 8,
 'popular': 9,
 'sport': 10,
 'this': 11,
 'canine': 12,
 'surfing': 13,
 'furry': 14,
 'are': 15,
 'story': 16,
 'animals': 17,
 'catching': 18}

### Bag of Words to Frequency Counts

Now that we've encapsulated our "entire language" in a dictionary, let's perform **feature extraction** on each of our original documents.

#### Empty counts per doc

In [20]:
# Create an empty vector with space for each word in the vocabulary:
one_freq = [0]*len(full_vocab)
two_freq = [0]*len(full_vocab)
all_words = ['']*len(full_vocab)

In [21]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [22]:
two_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [23]:
all_words

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

#### Add in counts per word per doc:

In [24]:
# map the frequencies of each word in One.txt to our vector
with open('One.txt') as f:
    one_text = f.read().lower().split()
    
for word in one_text:
    word_ind = full_vocab[word]
    one_freq[word_ind]+=1

In [25]:
one_freq
# We can see the repeated words (as 2)

[0, 1, 2, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0]

In [26]:
# Do the same for the second document:
with open('Two.txt') as f:
    two_text = f.read().lower().split()
    
for word in two_text:
    word_ind = full_vocab[word]
    two_freq[word_ind]+=1

In [27]:
two_freq

[1, 1, 0, 1, 0, 3, 1, 1, 0, 1, 1, 1, 0, 2, 0, 0, 1, 0, 1]

In [28]:
for word in full_vocab:
    word_ind = full_vocab[word]
    all_words[word_ind] = word

In [29]:
all_words
# these words show the index

['waves',
 'about',
 'dogs',
 'a',
 'pets',
 'is',
 'fun',
 'water',
 'our',
 'popular',
 'sport',
 'this',
 'canine',
 'surfing',
 'furry',
 'are',
 'story',
 'animals',
 'catching']

#### Bag of Words

In [30]:
# Words, their index and repetition in each text
bow = pd.DataFrame(data=[one_freq,two_freq],columns=all_words)
bow

Unnamed: 0,waves,about,dogs,a,pets,is,fun,water,our,popular,sport,this,canine,surfing,furry,are,story,animals,catching
0,0,1,2,1,1,1,0,0,1,0,0,1,1,0,1,1,1,1,0
1,1,1,0,1,0,3,1,1,0,1,1,1,0,2,0,0,1,0,1


By comparing the vectors we see that some words are common to both, some appear only in `One.txt`, others only in `Two.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them **sparse matrices**.

<div class="alert alert-block alert-success">
    
## <span style=" color:#e36288">Important Concepts in NLP

### Bag of Words and Tf-idf
In the above examples, each vector can be considered a **bag of words**. By itself these may not be helpful until we consider **term frequencies**, or how often individual words appear in documents. However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider **inverse document frequency**, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

**Term Frequency (TF):** It is the raw count of a term in a document. The number of times that term **t** occurs in a documnet **d**. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

**Inverse Document Frequency (IDF):** An IDF factor is incorparated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. The closer it is to 0, the more common a word is.
* TF-IDF = term frequency * (1 / document frequency), or
* TF-IDF = term frequency * inverse document frequency

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

### Stop Words and Word Stems
**Stop Words** are words common enough throughout a language that it is usually safe to remove them and not consider them as important (e.g. the, a, is, etc.). NLP Libraries have a built-in list of common stop words. Also, it may make sense to only record the **root of a word**, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

### Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of **tokenization** - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and **tag** our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.
</span>


<div class="alert alert-block alert-info alert">

## <span style=" color:#bf2e98">Part Two:  Feature Extraction with Scikit-Learn

Let's explore the more realistic process of using sklearn to complete the tasks mentioned above!
</div>

In [31]:
# Let's create a simple text consisting of both unique and repetitive words
text = ['This is a line',
         "This is another line",
       "Completely different line"]

## CountVectorizer

In [32]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

In [33]:
# help(CountVectorizer)

In [34]:
cv = CountVectorizer()

In [35]:
# Fit and transform
cv.fit_transform(text) # text data above

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [36]:
# We can assign it to a variable ans see the sparse matrix (3x6)
sparse_matrix = cv.fit_transform(text) 
sparse_matrix.todense()

# each line represents the line of the text data

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]], dtype=int64)

In [37]:
# Let^s see the unique words (with their index) in the dictionary
cv.vocabulary_

# notice that it did not count "a" 

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

#### stop_words

In [38]:
# Let's try it again using "stop_words" argument ans see the matrix

cv = CountVectorizer(stop_words="english")
sparse_matrix = cv.fit_transform(text) 
sparse_matrix.todense()

matrix([[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1]], dtype=int64)

In [39]:
cv.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

**Result:** There are only three words. It removed other words such as "this", "another" and "is".

## TfidfTransformer

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

In [40]:
cv = CountVectorizer()

In [41]:
# First fit and transform CountVectorizer()
counts = cv.fit_transform(text)

In [42]:
counts

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [43]:
tfidf = TfidfTransformer()

In [44]:
# Then, fit and transform TfidfTransformer()
# BOW --> TF-IDF
tfidf= tfidf.fit_transform(counts)

In [45]:
tfidf.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

#### pipeline
We can use pipeline for this process.

In [46]:
from sklearn.pipeline import Pipeline

In [47]:
pipe = Pipeline([('cv',CountVectorizer()),('tfidf',TfidfTransformer())])

In [48]:
results = pipe.fit_transform(text)

In [49]:
results

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [50]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

## TfIdfVectorizer
Does both above in a single step!

In [51]:
tfidf = TfidfVectorizer()

# TF-IDF and Vectorizer together

In [52]:
tv_results = tfidf.fit_transform(text)

In [53]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])