# Text classification with the Naive Bayes algorithm #

**Machine Learning** (ML) is a sub-field of Artificial Intelligence (AI) that automates analytical model building. As the amount of data grows, ML is becoming the defacto standard within all research fields. So what does it mean that a machine *learns*? Given enough data, we can offload programming tasks to the algorithm, that is, the algorithm learns structure from the data:

Tom Mitchell's **A well-posed learning problem**: "A computer program is said to learn from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$"

**Classification** is one is a primary task of supervised learning. Given labeled data, a classification algorithm will output a solution that categorizes new examples (i.e., associate labels with subsets of the data). While unsupervised learning searches for groups within the data, classification learns to map a data set onto a categorical class values or labels (i.e., function approximation).

The **Naive Bayes** (NB) algorithm is a generative algorithm that is very popular for text classification. 
The probability of a document $d$ being in class $c$, $P(c \mid d)$ is computed as:


$$ P(c \mid d) \propto P(c) \prod_{i = 1}^{m}P(t_i \mid c) $$
and the class of a document $d$ is then computed as:

$$c_{MAP} = arg~max_{c \in \{c_1, c_2 \}} P(c \mid d)$$


## Research problem ##

The medieval writer Saxo Grammatricus (c. 1160 - post 1208) represents the beginning of the modern day historian in Scandinavia. Saxo's history of the Danes *Gesta Danorum* ("Deeds of the Danes") is the single most important written source to Danish history in the 12<sup>th</sup> century. *Gesta Danorum*  is tendentious, contains elements of fiction, and its compositions has been an academic subject of debate for more than a century. The more recent debate treat the bipartite composition *Gesta Danorum* and centers on two related issues: 1) is the transition between the old mythical and new historical part located in book eight, nine, or ten; and 2) is this transition gradual (continuous) or sudden (point-like)? In this tutorial we will ask "is book nine more similar to the early books (1-8) or the late books (10-16)". This is an example uses simple vector space techniques from author chronometry to represent the most salient stylistic and semantic leaxical features of *Gesta Danorum*.

In [3]:
import os
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from collections import Counter


data_path = os.path.join("data","saxo_books","saxo_class.csv")

data = pd.read_csv(data_path)

print(data.tail())

print(set(data["book_class"]))

      book book_class  slice_id  \
1263    16       late        33   
1264    16       late        34   
1265    16       late        35   
1266    16       late        36   
1267    16       late        37   

                                                content  
1263  sig dog nøje med æren og trak sig lidt efter l...  
1264  de dødeliges boliger han opgav så sin vrede fo...  
1265  på at betale en kolossal sum bøde og ikke kunn...  
1266  vågnede om morgenen og fik øje på vagterne pri...  
1267  et varsel om det vendiske riges undergang bugi...  
{'late', 'early', 'uncertain'}
