In [1]:
from Naive import NaiveBayes

# Naive Bayes -- Class to Calculate Single Phrase Posterior Probability 

This class is designed for calculating single phrase probability to classify a given property. We could either write our own definition of likelihoods of each feature, or simply load in my pre-defined `json` file.

In this notebook, let's simply try my pre-defined `json` file for practice.

Resources: 
- [Naive Bayes Probabilistic Model (wiki)](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model)
- [Naive Bayes for Text Classification (Andrew Ng's ML course).mp4](http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/videos/06.2-NaiveBayes-TextClassification.mp4)
- [Naive Bayes for Spell Checker Python Code (Peter Norvig)](https://norvig.com/spell-correct.html)

## Loading Class

I load the likelihood table from `definition_time.json` file, and set the prior probability for being a time phrase to be 0.3.

In [2]:
from Naive import NaiveBayes

naive = NaiveBayes(filename='definition_time.json', prior=0.3)

In [3]:
# definition table attribute
naive.definition_dict.keys()

dict_keys(['data', 'columns'])

In [4]:
# you can check the columns inside it
naive.definition_dict['columns']

{'name': '(str) use few chars to represent the feature name',
 'iterable': '(str, or any iterable) any iterable can be iterated',
 'likelihood': '(float) the independent prob of the feature'}

## Iterables and Likelihoods

Iterable means the bag-of-characters (or bag-of-words) that you think they are belonging to certain type of feature. 

Say feature for 數字, 

In [5]:
naive.definition_dict['data'][-4]

{'name': '數字', 'iterable': '是元正𨳝一二三四五六七八九十廿卅', 'likelihood': 0.6}

this means that we think the bag-of-chars `是元正𨳝一二三四五六七八九十廿卅` belonging to numbers and any time phrase would have 0.6 likelihood to have these numbers.

## Calculate Posterior Probability

Once we have prior and likelihoods from `json`, we can calculate the posterior probability with a given phrase.

In [6]:
# simply use calc_posterior method
naive.calc_posterior('興寧三年')

0.8571428571428572

## Regularize the Irrelevant Characters

`naive.calc_posterior` method only consider the characters matched your given likelihoods and iterables. If you want to punish the irrelevant chars, simply use the `regularize` arg to set the punishment probability.

In [7]:
# If we don't set the regularization, 
# these 2 phrase would have the same posterior.
naive.calc_posterior('興寧三年'), naive.calc_posterior('興寧三年你好嗎')

(0.8571428571428572, 0.8571428571428572)

In [8]:
# we can set the regularize=0.4 to drag down the posterior of the second phrase 
naive.calc_posterior('興寧三年', regularize=0.4), naive.calc_posterior('興寧三年你好嗎', regularize=0.4)

(0.8571428571428572, 0.64)

## Add your Own Definition

We can modify the definition dict to add new likelihoods and iterables.

In [9]:
# before adding 平成
naive.calc_posterior("平成三十年", regularize=0.3)

0.2924187725631769

In [10]:
# we can add new definition of likelihood and a bag-of-words using dict
new_definition = {
    'name': '現代紀年', 
    'iterable': ['中華民國', '民國', '平成', '昭和', '西元', '西曆'], 
    'likelihood': 0.8
}

naive.definition_dict['data'].append(
    new_definition
)

In [11]:
# saving to json for new-time usage
naive.to_json('definition_time_modern.json')

In [12]:
# loading again
naive = NaiveBayes(filename="definition_time_modern.json", prior=0.3)

naive.calc_posterior("平成三十年", regularize=0.3)

0.9

## Interesting Philosophical Article about Thomas Bayes

- [Thomas Bayes and the crisis in science](https://www.the-tls.co.uk/articles/public/thomas-bayes-science-crisis/) by David Papineau: this article discuss the life of Thomas Bayes and why _inverse probability_ is interesting. Moreover, discuss about the over-used hypothesis testing and the ignorance of _prior probability_.