Naive Bayes (NB) for text classification
---

![](http://img.scoop.it/_VP0qLV-jYbp2cFF4H4k4jl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9)

---
By the end of this session, you should be able to:
---

- Apply Naive Bayes formula for text classification
- Calculate probablities for words that do not appear during training
- Evaluate a Naive Bayes classifier (or any supervised algorithm)
- Describe machine learning in Plain English

Check for understanding
----

What is classification?

Predict amongst a discrete set of groups.  

__Puppy or Lion__
![](images/prediction.png)

3 things about ML
-----

1. Feature = numeric representation of raw data
2. Model = mathematical “summary” of features
3. Making something that works = choose the right model and features, given data and task

Source: Understanding Feature Space in Machine Learning from Alice Zheng

Feature = numeric representation of raw data
----

![](images/features.png)

![](images/sparse.png)

Visualizing Feature Space
------

> Crudely speaking, mathematicians fall into two categories: the __algebraists__, who find it easiest to reduce all problems to sets of numbers and variables, and the __geometers__, who understand the world through shapes. 
 
>\- Masha Gessen, “Perfect Rigor”

![](images/bag.png)

![](images/dim.png)

Model = Mathematical “summary” of feature
-----



What is a summary?

- Data: point cloud in feature space
- Model: a geometric shape that best “fits” the point cloud

![](images/class.png)

When does bag-of-words fail?
------

![](images/fail.png)

Improving on bag-of-words
-----

![](images/tf-idf.png)

![](images/flat.png)

MACHINE LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
====

[Source](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

Representation
-----

> A classifier must be represented in some formal language that the computer can handle. 

>Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can
possibly learn. This set is called the hypothesis space of the learner. If a classifier is not in the hypothesis
space, it cannot be learned. 

> A related question ... is how to represent the input, i.e., what features to use.

Evaluation
----

> An evaluation function (also called objective function or scoring function) is needed to distinguish
good classifiers from bad ones. 

> The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimize.

Optimization
----

> We need a method to search among the classifiers in the language for the highest-scoring one. 

> The choice of optimization technique is key to the efficiency of the learner, and also helps determine the
classifier produced if the evaluation function has more than one optimum.

Naive Bayes for text classification
----

- Representation:
    - Data is a bag-of-words (counts or tf-idf)
    - Classifer is a hyperplane 


- Evaluation is most often a confusion matrix of predicted by observed.


- Optimization is done by calculating class probabilities and conditional probabilities on training data and then values of these probabilities are used to classify new observations. Predict the most probable class given a test observation.

![](http://briank.im/content/images/2015/05/Bucky.jpg)

In Naive Bayes (NB), features are conditionally independent given the class. This means that each feature within each class as an only parent. 

Despite this assumption, the efficiency of NB has witnessed its widespread development in real world applications including medical diagnosis, fraud detection, email filtering, and A/B testing for webpages.

----
Steps in Bayes Classifications
-------
1. Get labeled data
1. Preprocess
1. Apply Mulitnomial Naive Bayes
    1. Calculate document class priors
    1. Calculate conditional probablities of each word for each class
    1. Calculate the proportional probablities for each class of new document
    1. Pick the winning class
1. Evaluation

How does Bayes Rule apply to document classfication?
---

![](images/bayes_rule_for_docs.png)

---
What is the MLE for a word that's never been seen before (without smoothing?) 
---

![](images/mle_wo_smoothing.png)

---
Smoothing for Naive Bayes
---

![](images/mle_w_smoothing.png)

Start with simplest smoothing: Add-1 op Laplace
 

Confusion Matrix
---

![](http://i.stack.imgur.com/ysM0Z.png)

Let's classify movies as "RomCom" or not...


| Actual ↓ / Predicted → | RomCom | Not |  
| :--: |:-------:|:------:|
| RomCom | 70 | 50 |
| Not RomCom| 30 | 160 |  



---
Extension beyond 2 groups
---

| Actual↓ / Predicted→ | RomCom | Drama | Comedy |
| :--: |:-------:|:------:|
| RomCom | 70 | 0 | 50 |
| Drama  | 5 | 100 |  10 |
| Comedy | 25 | 0 |  50 |


The classifer is misclassifing movies as Comedy when they are RomCom more often than Drama.

It is always important to look at the confusion matrix to analyze your results as it also gives you very strong clues as to where your classifier is going wrong.

Check for understanding
----

How does confusion matrix scale with number of groups? For example if have 10 categories what is the size of the matrix?

| # of Groups | # of Cells  |  
|:-------:|:------:|
| 2 | 4 |
| 3 | 9 |  
| 4 | 16 |  
| ... | ... |  
| 10 | 100 |

k^2

Where does NB NOT work?
---

![](images/anti_bayes.jpg)

These groups are separable but a NB classifer will be at chance.

[Source](https://www.youtube.com/watch?v=feBKiAdhYkc)

---
Summary
---

- Machine Learning helps computers learn data and models through represents.
- The goals is effectively and efficiently learn these representations.
- Naive Bayes is among the simplest probabilistic classifiers. 
- Performs surprisingly well in many real world applications, despite the strong assumption that all features are conditionally independent given the class. 
- "Smoothing" words counts allows a NB classifier to predict features it has not yet seen
- A confusion matrix is a way to visualize our  occur

<br>
<br>
---