# Naive Bayes
Naive Bayes is a classical ML algorithm used for text analytics and general classification.
The following implementation describes the Gaussian Naive Bayes.

## The Algorithm
The algorithm it is very statistical based using prior and posterior probabilities of the classes in the dataset.

Using the Bayes' Theorem below as the main idea:

$$P(A | B)P(B) = P(A \cap B) = P(B \cap A) = P(B | A)P(A)$$

$$P(A | B) = \frac{P(B | A)P(A)}{P(B)}$$


Now using the theorem we can ask what is the probability of a given class given that a specific document happened.

$$P(Class | Document\{w1,w2,w3,...,wn\})) = \frac{P(Document | Class)P(Class)}{P(Document)}$$

The equation above describes the full Bayes algorithm but some probabilites are very non pratical to calculate an example of this is the $P(Document)$ because if we have a never seen document in the dataset this $P(Document)$ is going to be $0$.

**The term Naive comes from assuming that the variables are independent of each other when they may not be**

Then using the independance factor the Bayes' Theorem can be calculated as below:
<div>
<img src="images/BayesSimple.png" width="600"/>
</div>

## Steps
In this notebook I'm going to approach the 20 News Group dataset, the steps to solve/classify this dataset are pre-processing the text data, training the Gaussian Bayes Classifier and testing it with unseen data.

## Pre-processing
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations
* removing stop words, sparse terms, and particular words
* text canonicalization

In [2]:
# Loading the Data using sklearn
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', shuffle=True)
test = fetch_20newsgroups(subset='test', shuffle=True)

#train.data: holds the text data
#train.target: holds the id's for the classes
#train.target_names: holds the class string names
#train.filenames: holds the filenames

In [3]:
for t in train.target[:10]:

    print('Class sample [%s]' % train.target_names[t])

print('\n# of docs in the train set %d\n# of docs in the test set %d\n' %
      (len(train.target), len(test.target)))
print('Example of data(unprocessed) of the class [%s]:\n\n[%s]' %
      (train.target_names[0], train.data[0]))

Class sample [rec.autos]
Class sample [comp.sys.mac.hardware]
Class sample [comp.sys.mac.hardware]
Class sample [comp.graphics]
Class sample [sci.space]
Class sample [talk.politics.guns]
Class sample [sci.med]
Class sample [comp.sys.ibm.pc.hardware]
Class sample [comp.os.ms-windows.misc]
Class sample [comp.sys.mac.hardware]

# of docs in the train set 11314
# of docs in the test set 7532

Example of data(unprocessed) of the class [alt.atheism]:

[From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, whe

converting all letters to lower or upper case
converting numbers into words or removing numbers
removing emails
removing punctuations, accent marks and other diacritics
removing white spaces
expanding abbreviations
removing stop words, sparse terms, and particular words
text canonicalization

In [40]:
import re

punctuation = """\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?[\]\^\_\`\{\|\}~"""

# Defining cleaning regexes

number_re = re.compile(r'(\d+)', re.I | re.M | re.U)
punkt_re = re.compile(r'([%s])' % punctuation, re.I | re.M | re.U)
email_re = re.compile(r'(\w+\@\w+)', re.I | re.M | re.U)

whitespaces_re = re.compile(r'(\s)', re.I | re.M | re.U)


def preprocess(doc):
    if type(doc) != str:
        raise Exception('Doc is not text data')

    # Making a copy of the original document
    _doc = doc

    # Stripping
    _doc = _doc.strip()

    # lower case
    _doc = _doc.lower()

    # removing numbers
    _doc = number_re.sub('', _doc)

    # removing punkt before email so the email regex is simpler
    _doc = punkt_re.sub('', _doc)

    # removing emails
    _doc = email_re.sub('', _doc)

    # removing long white spaces to single spacew
    _doc = whitespaces_re.sub(' ', _doc)

    return (doc, _doc)

In [41]:
print(preprocess(train.data[123])[1])

from  subject john  paraphrased lines   at the end of a recent mon  apr  post alastair thomson offers the following paraphrase of john      god loved the world so much that he gave us his son    to die in our place so that we may have eternal life  the to die in our place bothers me since it inserts into the verse a doctrine not found in the original moreover i suspect that the poster intends to affirm not merely substitution but forensic or penal substitution  i maintain that the scriptures in speaking of the atonement teach a doctrine of substitution but not one of forensic substitution  those interested in pursuing the matter are invited to send for my essays on genesis either  thru  on this question or  through  with leadin  the nth essay can be obtained by sending to  or to  the message    get genn ruff   yours  james kiefer   any theologian worth his salt can put anything he wants to say in the form of a commentary on the book of genesis  walter kaufman
