Text Classification
===

![](images/supervised_learning.png)

By The End of This Session You Will:
---
- Know the the basics of text classification
- Be able to the name the advantages and disadvantages of bag-of-words model
- Be able to write Naive Bayes classifier

***
<br>
<br>

---
Text Classification Basics
---

Text classification predicts the category of a given document based on a set of pre-defined categories

---
Summary:
---

1. **Some examples of text classification tasks:**
  - An email is "Spam" or "Not Spam"
  - The author of a particular document
  - The gender of a particular document
  - If a movie review is positive or negative
  - And many more
  
  <br>
  
2. **The most basic method for text classification is to device rules that would differentiate the different classes of documents, for example:**
  - "!" in title would indicate spam
  - Certain usage of words would indicate one author as oppose to the other
  - More factual style of writing would indicate a male author
  - Words such as "suck" and "terrible" would indicate a bad movie review
  
  <br>
  
3. **However, devicing rules tailored to each clssification task is unfeasible due to the human labor required**

  <br>
  
4. **In general, a text classification problem has 3 components:**
   - A document
   - A pre-defined set of classes
   - A training set of documents and their corresponding classes
  
  <br>
  
5. **The output of the text classification is the predicted class of the document in question**

***
<br>
<br>

Knowledge Check Questions
---

1) Why is unadvisable to device specific rules for each text classification problem?

<details><summary>
Click here for solution to 1.
</summary>
`
Specific rules have to be deviced for each classification problem and that is time consuming
`
</details>

2) What are the 3 components of a text classification problem? Is it a supervised or unsupervised problem?

<details><summary>
Click here for solution 2.
</summary>
`
1. A document
2. A pre-defined set of classes
3. A training set of documents and their corresponding classes

Supervised problem since labels are provided to train the model
`
</details>

***
<br>
<br>

---
Bag of Words Representation
---

The Bag of Words Representation vectorizes a document to a vector of numbers represent the count of each vocabulary in the document

---
Summary
----

1. **The Bag of Words Representation does not take into account the order of the words in the document**

   - Since the number of each vocabulary is counted for each document regardless of order

   <br>

2. **Say we have 2 documents below:**

   1. 6004 is a fun class and I love natural language processing
   2. Katie also like natural language processing

   <br>

3. **The Bag of Words Representation would be as follow:**

|            | 6004 | is | a | fun | class | and | I | love | natural | language | processing | Katie | also | like |
|------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| Document 1 | 1    | 1  | 1 | 1   | 1     | 1   | 1 | 1    | 1       | 1        | 1          | 0     | 0    | 0    |
| Document 2 | 0    | 0  | 0 | 0   | 0     | 0   | 0 | 0    | 1       | 1        | 1          | 1     | 1    | 1    |

Exercises:
---

1. Write a function that would give the Bag of Words Representation for the following documents

In [1]:
docs = ['System that replaces human intuition with algorithms outperforms human teams',
           'MIT researchers aim to take the human element out of big-data analysis', 
           'We view the Data Science Machine as a natural complement to human intelligence']

In [19]:
from collections import Counter
from textblob import TextBlob
import re

In [20]:
blob = TextBlob(str(docs))
tok = blob.tokenize()

In [None]:
re.sub

In [31]:
def bag_of_words(lst):


    tokens = re.sub(r"\[\]',","",str(lst))
    tokens = TextBlob(str(tokens))
    tokens = tokens.tokenize()
    
    return Counter(tokens)

In [32]:
bag_of_words(docs)

Counter({"'": 3,
         "'MIT": 1,
         "'System": 1,
         "'We": 1,
         ',': 2,
         'Data': 1,
         'Machine': 1,
         'Science': 1,
         '[': 1,
         ']': 1,
         'a': 1,
         'aim': 1,
         'algorithms': 1,
         'analysis': 1,
         'as': 1,
         'big-data': 1,
         'complement': 1,
         'element': 1,
         'human': 4,
         'intelligence': 1,
         'intuition': 1,
         'natural': 1,
         'of': 1,
         'out': 1,
         'outperforms': 1,
         'replaces': 1,
         'researchers': 1,
         'take': 1,
         'teams': 1,
         'that': 1,
         'the': 2,
         'to': 2,
         'view': 1,
         'with': 1})

<details><summary>
Click here for solution to 1.
</summary>
`
docs = ['System that replaces human intuition with algorithms outperforms human teams',
        'MIT researchers aim to take the human element out of big-data analysis', 
        'We view the Data Science Machine as a natural complement to human intelligence']

from copy import deepcopy
import numpy as np

def bag_of_words(lst):
    docs_tokens = [doc.split(' ') for doc in lst]
    all_tokens = [token for tokens in docs_tokens for token in tokens]
    vocab_dict = dict(zip(all_tokens, [0] * len(all_tokens)))
    results = []
    for tokens in docs_tokens:
        dict_copy = deepcopy(vocab_dict)
        for token in tokens:
            dict_copy[token] += 1
        results.append(dict_copy.values())
    return vocab_dict.keys(), np.array(results)

bag_of_words(docs)

(['intuition',
  'intelligence',
  'big-data',
  'System',
  'Machine',
  'outperforms',
  'as',
  'human',
  'element',
  'MIT',
  'out',
  'We',
  'to',
  'take',
  'Data',
  'that',
  'Science',
  'complement',
  'with',
  'a',
  'replaces',
  'natural',
  'of',
  'analysis',
  'teams',
  'aim',
  'algorithms',
  'the',
  'researchers',
  'view'],
 array([[1, 0, 0, 1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
         0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1, 0, 1, 1, 0],
        [0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1,
         0, 0, 0, 0, 0, 1, 0, 1]]))
`
</details>


***
<br>
<br>

Naive Bayes Basics
---

Naive Bayes is a common algorithm that uses the bag of word representation for text classification.

### Summary

1. **Naive Bayes predicts the class of a document based on the probability of oberserving the class given the document**
   - The probability of observing each possible class given the document in question is computed
   - The prediction is the class with the highest probability given the document
   
   <br>

2. **Naive Bayes is based on the Bayes Theorem:**

  $$p(\text{class }| \text{ doc}) = \frac{p(\text{doc } | \text{ class}) \times p(\text{class})}{p(\text{doc})}$$
  
  <br>

  **where**

  - $p(\text{class }| \text{ doc})$ is the probability of observing a particular class given a document (**Posterior**)
  - $p(\text{doc } | \text{ class})$ is the probability of observing a document given a particular class (**Likelihood**)
  - $p(\text{class})$ is the probability of observing each of the classes (**Prior**)
  - $p(\text{doc})$ is the probability of observing a a document
  
  <br>
  
3. **We assume the probability of observing any given document $p(doc)$ is constant**

   Hence we can simplify the above expression: 

   $$p(\text{class }| \text{ doc}) \propto p(\text{doc } | \text{ class}) \times p(\text{class})$$

   Note that we are not computing a measure of probability exactly when we make this simplification.
   
   However, since $p(doc)$ is constant, the most probable class (posterior) can still be decided by selecting the highest value
   
   <br>

4. **Using the bag of words model, we can express the likelihood $p(\text{doc }| \text{ class})$ as following:**

  $$p(\text{doc }| \text{ class}) = p(x_1,x_2,... ,x_n \text{ | class})$$
  
  **where** $x_1, x_2, ... , x_n$ are the features in the bag of word model
  
  <br>
  
5. **We are also assuming conditional independence for the likelihood term, where**
  
   - The probability of $x_1$ occurring is independent of $x_2$ occurring given a particular class
   - Hence we are able to express the likelihood as following:
     
     $$p(x_1,x_2,... ,x_n \text{ | class}) = p(x_1 \text{ | class}) \cdot p(x_2 \text{ | class}) \cdot ... \cdot p(x_n \text{ | class})$$
     
     <br>
     
   - **This assumption is mostly incorrect**, since we can easily think of how the occurrence of some words affects that of the others in a given class
   - But this simplification allow us to compute the posterior easily without sacrificing too much predictive power
   
   <br>
     
6. **Hence we are able to express the posterior as following:**

   $$p(\text{class }| \text{ doc}) \propto p(\text{class}) \times (p(x_1 \text{ | class}) \cdot p(x_2 \text{ | class}) \cdot ... \cdot p(x_n \text{ | class}))$$


----
Questions
----

1) What theorem is Naive Bayes based on ?

  $$p(\text{class }| \text{ doc}) = \frac{p(\text{doc } | \text{ class}) \times p(\text{class})}{p(\text{doc})}$$

2) State the term that represents likelihood and explain in plain English what likelihood is ?

In [None]:
# Likelihood is the event that has the highest probability of occuring. 

3) State the term that represents the posterior probability and explain in plain English what posterior is ?

In [None]:
# p(class | doc)  is the probability of observing a particular class given a document (Posterior). 
#Posterior is the probability that an event is true given recent events. The probability that it is a class, given it is a doc.

4) State the term that represents the prior probability and explain in plain English what prior is ?

In [None]:
#p(class)p(class)  is the probability of observing each of the classes (Prior) . 
# This is the probability before we start the event

5) State the assumptions that Naive Bayes make when used with the Bag of Words Representation.

In [None]:
#Assuming independence of each feature given a class

<details><summary>
Click here for solution to 1.
</summary>
`
Bayes Theorem
`
</details>

<details><summary>
Click here for solution to 2.
</summary>
`
Likelihood is the probability of observing the data given a class. In the context of text classification, the probability of observing the document (the bag of words) given the class.
`
</details>

<details><summary>
Click here for solution to 3.
</summary>
`
Posterior is the probability of observing a certain class given the data. In the context of text classification, the probability of observing the class of a document given the document.
`
</details>

<details><summary>
Click here for solution to 4.
</summary>
`
Prior is the probability of observing each of the classes regardless of the data. In the context of text classification, the probability of observing the class of a document. Imagine drawing a document randomly and recording the class that it belongs to
`
</details>

<details><summary>
Click here for solution to 5.
</summary>
`
Naive Bayes, used with bag of words, assumes the order of the words does not matter to the prediction. It also assume conditional independence between the words such that the occurrences of words do not affect each other in a given class.
`
</details>

----
Summary
----

__Text classification__:
- Text classification is the task of predicting the category of a given document
- It requires a document, a pre-defined set of classes and a set of documents that are labeled with the pre-defined classes
- Text classification is a supervised machine learning problem 

  <br>
__Bag of Words__:
- Bag of Words is a basic way of featurizing / vectorizing a document into a matrix of numbers
- The count of each vocabulary in the corpus (collection of documents) is tallied for each of the documents
- The information about the order of the words is lost in the Bag of Words Representation

  <br>
__Naive Bayes__:
- Naive Bayes is a common machine learning algorithm used in text classification
- Naive Bayes based on the __Bayes Theorem__
- __Posterior probability__ is calculated for each of the class given a document
- __Posterior probability__ is the probability of observing a class given the document
- Class with highest __Posterior probability__ is decided as the predicted class of the document
- __Likelihood__ and __Prior__ is used to calculate the **Posterior**
- __Likelihood__ is probability of obeserving the data given a class 
- __Prior__ the probability of observing a class 

<br>
<br>
<br>

---