The [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html) library support a number of Naive Bayes classifiers: 

- Gaussian: It assumes that continuous features follow a normal distribution.
- Multinomial: It is useful if your features are discrete.
- Bernoulli: The binomial model is useful if your features are binary.

### Titanic Survival Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Importing dataset
data = pd.read_csv("data/titanic_2.csv")

In [2]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Convert categorical variable to numeric
data["Sex_cleaned"]=np.where(data["Sex"]=="male",0,1)
data["Embarked_cleaned"]=np.where(data["Embarked"]=="S",0,
                                  np.where(data["Embarked"]=="C",1,
                                           np.where(data["Embarked"]=="Q",2,3)
                                          )
                                 )
# Cleaning dataset of NaN
data=data[[
    "Survived",
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]].dropna(axis=0, how='any')

In [4]:
data.head()

Unnamed: 0,Survived,Pclass,Sex_cleaned,Age,SibSp,Parch,Fare,Embarked_cleaned
0,0,3,0,22.0,1,0,7.25,0
1,1,1,1,38.0,1,0,71.2833,1
2,1,3,1,26.0,0,0,7.925,0
3,1,1,1,35.0,1,0,53.1,0
4,0,3,0,35.0,0,0,8.05,0


In [5]:
# Split dataset in training and test datasets
X_train, X_test = train_test_split(data, test_size=0.5, random_state=int(time.time()))

In [6]:
# Instantiate the classifier
gnb = GaussianNB()
used_features =[
    "Pclass",
    "Sex_cleaned",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_cleaned"
]

# Train classifier
gnb.fit(
    X_train[used_features].values,
    X_train["Survived"]
)
y_pred = gnb.predict(X_test[used_features])

# Print results
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          X_test.shape[0],
          (X_test["Survived"] != y_pred).sum(),
          100*(1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0])
))

Number of mislabeled points out of a total 357 points : 81, performance 77.31%


In [7]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
used_features =["Fare"]
y_pred = gnb.fit(X_train[used_features].values, X_train["Survived"]).predict(X_test[used_features])
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          X_test.shape[0],
          (X_test["Survived"] != y_pred).sum(),
          100*(1-(X_test["Survived"] != y_pred).sum()/X_test.shape[0])
))
print("Std Fare not_survived {:05.2f}".format(np.sqrt(gnb.sigma_)[0][0]))
print("Std Fare survived: {:05.2f}".format(np.sqrt(gnb.sigma_)[1][0]))
print("Mean Fare not_survived {:05.2f}".format(gnb.theta_[0][0]))
print("Mean Fare survived: {:05.2f}".format(gnb.theta_[1][0]))

Number of mislabeled points out of a total 357 points : 121, performance 66.11%
Std Fare not_survived 34.03
Std Fare survived: 66.85
Mean Fare not_survived 24.24
Mean Fare survived: 55.43


### Spam SMS

In this example, we will attempt to apply Naive Bayes on textual data. The target data set is [a collection of spam sms](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

The data table contains two columns separated by a tab. The two columns are **label** and **message**.

In [8]:
import pandas as pd

df = pd.read_table('data/SMSSpamCollection',  
                   sep='\t', 
                   header=None,
                   names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Unlike numerical data, textual data requires additional processing. First, we need to convert the categorical labels into numerical values. 

In [9]:
df['label'] = df.label.map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Second, we need to convert all characters in the message to lower case:

In [10]:
df['message'] = df.message.map(lambda x: x.lower())  
df.head()

Unnamed: 0,label,message
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


Third, remove any punctuation:

In [11]:
df['message'] = df.message.str.replace('[^\w\s]', '')  
df.head()

Unnamed: 0,label,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...


Fourth, tokenize the messages into into single words using nltk. First, we have to import and download the tokenizer from the console. 

To work with textual data, we will use the **nltk** library. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP (natural language processing) libraries.

In [12]:
import nltk  
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Linh B
[nltk_data]     Ngo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
df['message'] = df['message'].apply(nltk.word_tokenize)
df.head()

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazy, available, o..."
1,0,"[ok, lar, joking, wif, u, oni]"
2,1,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,0,"[u, dun, say, so, early, hor, u, c, already, t..."
4,0,"[nah, i, dont, think, he, goes, to, usf, he, l..."


Fifth, we will perform some word stemming. The idea of stemming is to normalize our text for all variations of words carry the same meaning, regardless of the tense. One of the most popular stemming algorithms is the Porter Stemmer:

In [14]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])  
df.head()

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazi, avail, onli,..."
1,0,"[ok, lar, joke, wif, u, oni]"
2,1,"[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,0,"[u, dun, say, so, earli, hor, u, c, alreadi, t..."
4,0,"[nah, i, dont, think, he, goe, to, usf, he, li..."


Finally, we will transform the data into occurrences, which will be the features that we will feed into our model:

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))
df.head()

Unnamed: 0,label,message
0,0,go until jurong point crazi avail onli in bugi...
1,0,ok lar joke wif u oni
2,1,free entri in 2 a wkli comp to win fa cup fina...
3,0,u dun say so earli hor u c alreadi then say
4,0,nah i dont think he goe to usf he live around ...


In [16]:
count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df['message'])  
print(len(count_vect.get_feature_names()))

8169


We could leave it as the simple word-count per message, but it is better to use Term Frequency Inverse Document Frequency, more known as tf-idf:

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)  

Training the Model
Now that we have performed feature extraction from our data, it is time to build our model. We will start by splitting our data into training and test sets:

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], 
                                                    test_size=0.1, 
                                                    random_state=69)  

Then, all that we have to do is initialize the Naive Bayes Classifier and fit the data. For text classification problems, the Multinomial Naive Bayes Classifier is well-suited:

In [19]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)  

Evaluating the Model
Once we have put together our classifier, we can evaluate its performance in the testing set:

In [21]:
import numpy as np

predicted = model.predict(X_test)

print(np.mean(predicted == y_test))  

0.9480286738351255


Congratulations! Our simple Naive Bayes Classifier has 94.8% accuracy with this specific test set! But it is not enough by just providing the accuracy, since our dataset is imbalanced when it comes to the labels (86.6% legitimate in contrast to 13.4% spam). It could happen that our classifier is over-fitting the legitimate class while ignoring the spam class. 

### Question 2:

Complete the following code that apply Naive Bayes Classifier to a sentiment text data set. There are three datasets inside the sentiment_label_sentences for IMDB, Yelp, and Amazone review. You can select one to perform the classification process on. 

Hint, while this question is similar to the procedure of the Spam example, the structure of the data file is different. You need to review the data from the Spam example, and then identify the difference such that you can fill in the right contents. 

In [None]:
import pandas as pd

df = pd.read_table('data/sentiment_label_sentences/________',  
                   sep='\t', 
                   header=None,
                   names=['_____', '_____'])
df.head()

In [None]:
# convert to lower case
df[_____] = df._____.map(lambda x: x.lower())  

# remove punctuation
df[_____] = df._____.str.replace('[^\w\s]', '')  

# tokenize
import nltk  
nltk.download('punkt')
df[_____] = df[_____].apply(nltk._____)

# stem the words
from nltk.stem import _____
stemmer = _____()
df[_____] = df[_____].apply(lambda x: [stemmer.stem(y) for y in x])  
df.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings and then switch
# to term frequency format
df[_____] = df[_____].apply(lambda x: ' '.join(x))
df.head()

In [None]:
count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df[_____])  
print(len(count_vect.get_feature_names()))

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)  
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df[_____], 
                                                    test_size=0.1, 
                                                    random_state=69) 
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)  
import numpy as np

predicted = model.predict(X_test)

print(np.mean(predicted == y_test)) 