# Module 2A (Part 1): Introduction to Machine Learning: the Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For instance, if we want to describe an apple, features such as  color or shape would be considered independent from the fruit and with different probability distributions. In this workshop, we will explore two major algorithms for training a Naive Bayes classifier: the Gaussian Naive Bayes and the MultinomialNaive Baayes (there are others, of course).

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector $x =(x_{1},\dots ,x_{n})$, representing some *N* features or pieces of evidence (independent variables), it assigns to this instance probabilities

$Pr( C_k | x_1, x_2,..., x_N)$

for each of K possible outcomes or classes, $C_K$.

The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as

$ Pr( C | x_1, x_2, ..., x_N ) = \frac{Pr(C) Pr(x_1, x_2, ..., x_N | C)}{Pr( x_1, x_2, ..., x_N)} $

In plain English, using Bayesian probability terminology, the above equation can be written as

$ posterior = \frac{ prior x likelihood }{evidence} $

## A General Machine Learning Architecture

A general supervised machine learning architecture consists in 2 major steps:
- **A Training Phase**
- **A Test Phase**

The **training phase** consists in getting a dataset with a set of features and pass it to a machine learning system (in this week, the machine learning system that you will learn is the Naive Bayes classifier). This machine learning system will output a mathematical function, which approximates the patterns and trends of the input training data. In machine learning, this function is usually referred to as a model. This model is computed through optimization problems that try to minimize the error between each datapoint during the training phase and its correct prediction. That is whay it is called *supervised* learning: one always needs to provide information about the true predictions of the data.

The **test phase** consists in the application of a set of data points that *were not used during the training phase* to the trained model, and evaluate how good that model is able to make a prediction.


<img src="images/ML.png" width="700px" />

## Task A: A Breast Cancer Classifier Using Naive Bayes

In this lecture, we will be applying Naive Bayes to try to predict if some tumor is malignant (cancer) or benign. 

### Import the Required Libraries

In [None]:
# Numerical Data Manipulation libraries
import pandas as pd
import numpy as np
import statistics as stat

# Figure Plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()

# Naive Bayes libraries
import sklearn
from sklearn.naive_bayes import BernoulliNB      # Naive Bayes Classifier based on a Bernoulli Distribution
from sklearn.naive_bayes import GaussianNB       # Naive Bayes Classifier based on a Gaussian Distribution
from sklearn.naive_bayes import MultinomialNB    # Naive Bayes Classifier based on a Multinomial Distribution

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# Text Analysis libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

### Some Functions



In [None]:
# function to plot gaussian distribution in data
# you do not need to understand this in detail. Use it as a function that receives some data
# and plots a gaussian distribution over it
def func_plot_gaussian( data, ylim = (5, 42), xlim = (5, 30) ):

    # separate the benign tumours (diagnosis = 0) from the malignant ones (diagnosis = 1)
    mal = data[ data[ 'diagnosis'] == 1]
    ben = data[ data[ 'diagnosis'] == 0]

    # need to convert dataframe into a matrix in order to make the plot work
    X = data[ ['radius_mean', 'texture_mean'] ]
    x = X.to_numpy()

    # plot figure
    fig=plt.figure(dpi=150)
    ax= fig.add_subplot(111)
    
    # plot the datapoints of our data and color encodede them according to their diagnosis (malignant / benign)
    plt.scatter(mal['radius_mean'], mal['texture_mean'], c='r', marker='s', s=3, label='malignant')
    plt.scatter(ben['radius_mean'], ben['texture_mean'], c='b', marker='o', s=3, label='benign')
    plt.ylabel('texture_mean', fontsize=12)
    plt.xlabel('radius_mean', fontsize=12)
    plt.title('Breast Tumors', fontsize=14)
    plt.legend()

    # plot the gaussian curves over the data
    # a gaussian distribution can be computed by having the mean of the data, variable mu
    # and the standard deviation of the data, vatiable std
    xg = np.linspace(xlim[0], xlim[1], 60)
    yg = np.linspace(ylim[0], ylim[1], 40)
    xx, yy = np.meshgrid(xg, yg)
    Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T

    for label, color in enumerate(['blue', 'red']):
        mask = (y == label)
        mu, std = x[mask].mean(0), x[mask].std(0)
        P = np.exp(-0.5 * (Xgrid - mu) ** 2 / std ** 2).prod(1) # Gaussian distr. mathematical formula
        Pm = np.ma.masked_array(P, P < 0.05)
        ax.pcolorfast(xg, yg, Pm.reshape(xx.shape), alpha=0.5, cmap=color.title() + 's')
        ax.contour(xx, yy, P.reshape(xx.shape), levels=[0.01, 0.1, 0.5, 0.9], colors=color, alpha=0.2) 
    
    ax.set(xlim=xlim, ylim=ylim)
    fig.show()
    return

### The Dataset

In [None]:
# Load breast cancer dataset
# Data describes if a tumour is MALIGNANT (value 1) or BENIGN (value 0) accordong to:
# - mean radius of the tumour
# - mean texture of the tumour
file_path = 'data/breast_data_simple.csv'
data = pd.read_csv( file_path )

In [None]:
data

In [None]:
# 1st step towards a machine learning apporach: separate your dataset!
# put the variable that you wish to classify (or predict) in one variable
# put your sources of evidence (or your features) in another variable
y = data['diagnosis']                        # variable to classify and preduict
X = data[['radius_mean', 'texture_mean']]    # variable containing your features

Let's look at out variables:

In [None]:
y

In [None]:
X

Before we proceed with any data analysis, we need to try to understand what kind of data are we dealing with. 
Naive Bayes model (like most machine learning models) are based on statistical learning. This means that the distribution of your data plays an important role in how successful the machine learning algorithm is.

In [None]:
# for plotting purposes:
# separate the benign tumors (diagnosis = 0) from the malignant ones (diagnosis = 1)
malignant = data[ data[ 'diagnosis'] == 1]
benign = data[ data[ 'diagnosis'] == 0]

# need to convert dataframe into a matrix in order to make the plot work
x = X.to_numpy()

# plot figure
fig=plt.figure(dpi=150)

plt.scatter(malignant['radius_mean'], malignant['texture_mean'], c='r', marker='x', s=10, label='malignant', cmap='RdBu')
plt.scatter(benign['radius_mean'], benign['texture_mean'], c='b', marker='o', s=10, label='benign', cmap='RdBu')
plt.ylabel('texture_mean', fontsize=12)
plt.xlabel('radius_mean', fontsize=12)
plt.title('Breast Tumors', fontsize=14)
plt.legend()
plt.show()

Our dataset is not sparse, which is good (it is hard to model sparse data). The data sems to be concentric and distributed around a mean value. In statistics and machine learning, we usually represent data with a Gaussian Distribution (which is nothing more than a bell shaped curve). Let's see this in our data:

In [None]:
# plotting function defined in the begining of the notebook
func_plot_gaussian( data )

### Running the Naive Bayes Classifier with a Gaussian Kernel

Now that we took a look at our data and that we separated the data into a variable with the prediction, y, and another variable with the features, X, we need to split our data into two sets: a training set (used to estimate our model), and a test set (used to evaluate how good our model is).

**Remember!** Never use the same data on your training set as your test set! Why? If you ebaluate your model using that that was used to build that model, the the algorithm will always "know" what is the correct prediction of that data. That is why we always test a machine learning model with a set of data points that have never been seen by the model during the training phase!

#### The Importance of Defining Test and Training Sets

In [None]:
# create the training set and the test set
# good machine learning practices say that usually you should provide 70% of your data for training 
# and 30% of the data for testing
# test_size specifies how much data do you want to reserve for the test set
# the argument, random_state, is simply to ensure that we will have the same results
# when we run this cell many times. Since the split between the train set and the test set is random,
# by setting the random_state, we are ensuring reproducibility of the results.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 515)

In [None]:
# let's take a look at the size of our training sets and test sets
print( "Training set contains %d instances and the test set contains %d instances" %(X_train.shape[0], X_test.shape[0]))
print("Size of training set: %.2f" %((X_train.shape[0]/X.shape[0])*100))
print("Size of test set: %.2f" %((X_test.shape[0]/X.shape[0])*100))

#### Definition of the Type of Classifier

In [None]:
# learn the model

# STEP 2: specify the learning algorithm
# In this lecture, we will use a simple Gaussian Naive Bayes Model
model_base = GaussianNB()

# STEP 3: fit the training data to model
model_base.fit( X_train, y_train )

# STEP 4: make predictions on test set
# given a set of features that the system did not see before
# tries to predict the correct label to the data (label = malignant or benign tumor)
y_prediction = model_base.predict( X_test )

# # STEP 5: Measure the accuracy of the model
# compare the predicted results with the ones associated to X_test data
print( 'The overall accuracy of the model is %.2f%%' %(accuracy_score( y_test, y_prediction )*100))

In the above code, we applied the 5 steps for a machine learning problem:
1. Split data into train and test sets
2. Specify the learning algorithm
3. Fit the training data to algorithm
4. Make predictions on test set
5. Measure the performance of the learnt model

#### Validating Results

In [None]:


trials = 500

accuracy = []
for trial in range( 0, trials ):
    
    # randomly select a test set and a training set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    y_expected = y_test
    
    GaussNB = GaussianNB()         # create the Gaussian Naive Bayes Classifer
    GaussNB.fit(X_train, y_train)  # fit the model to the training data

    y_predicted = GaussNB.predict(X_test)                       # get predictions of model on the test set
    accuracy.append(accuracy_score( y_expected, y_predicted ))  # save accuracy obtained in each trial
    print("Applying Naive Bayes............................ Trial #" + str(trial + 1) + " ....... acc = " + str( accuracy[trial] ))


In [None]:
# Computute overall average accuracy over the 500 trials
min_accuracy = np.min(accuracy)
max_accuracy = np.max(accuracy)
avg_accuracy = np.mean( accuracy )

print("Results range from [%.2f, %.2f]" %(min_accuracy, max_accuracy))
print( "Average model accuracy is %.2f" %avg_accuracy  )

In [None]:
# plot results
plt.figure()
plt.scatter( range( 0, trials ), accuracy, s = 2 )
lst = np.ones(trials, float)
plt.plot( range( 0, trials ), avg_accuracy*lst, c='r' )
plt.ylabel('Accuracy', fontsize=12)
plt.xlabel('Number of Trials', fontsize=12)
plt.title('Average Accuracy of a Naive Bayes Classifier using a Gaussian Kernel', fontsize=14)
plt.show()

print( "Average model accuracy is %.2f" %avg_accuracy  )

#### What if we use another distribution? How about a Bernoulli distribution?

A Bernoulli distribution is the kind of distribution that you get when you flip a coin many times: you get a probability *p* of a coin landing heads, and you get a probability *(1-p)* of the coin landing tails. More formally, a Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability and the value 0 with probability.

In [None]:
trials = 500

accuracy = []
for trial in range( 0, trials ):
    
    # randomly select a test set and a training set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    y_expected = y_test
    
    # Specify the learning algorithm
    model = BernoulliNB()   
    
    # Fit the training data to algorithm
    model.fit(X_train, y_train)  # fit the model to the training data
    
    # Make predictions on test set using the learnt model
    y_predicted = model.predict(X_test)
    
    # Measure the performance of the learnt model
    accuracy.append(accuracy_score( y_expected, y_predicted ))  # save accuracy obtained in each trial
    
print( 'The average overall accuracy of the model is %.2f' %(accuracy_score( y_test, y_prediction )*100))

What happen to the performance of our classifier? How can you justify this?

## Try it Yourself! Breast Cancer classification with more features

In [None]:
# Load your dataset
file_path = 'data/breast_data_full.csv'
data_full = pd.read_csv( file_path )

In [None]:
# what are the features in this dataset? How many are there?

# YOUR CODE HERE:
features = 


In [None]:
# separate your dataset: 
# put the variable that you wish to classify (or predict) in one variable
# put your sources of evidence (or your features) in another variable

# note that your dataset contains a column id, which is not necessary. 

# YOUR CODE HERE:
y = 
X = 


In [None]:
# separate the dataset into test set and train set

# YOUR CODE HERE:



In [None]:
# Define the NaiveBayes Gaussian kernel
# YOUR CODE HERE:
model = 

# Fit a model to the data -> learning the model
# YOUR CODE HERE:


# Use the learned model to try to predict the tumors on the testset
# YOUR CODE HERE:
y_predicted =

# Measure the overall accuracy of the model
# | y_predicted - y_test | -> 0
# YOUR CODE HERE
accuracy = 

print( accuracy )

### Comments
What were your findings? Did the incorporation of more features have any impact on the predictions?

## Other Applications: Using Naive Bayes in Text Classification

We will now provide an example of how to apply a Naive Bayes classifier to news. This is a dataset of textual data where the goal is to determine the topics of each news item.

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

For simplicity here, we will select just a few of these categories, and download the training and testing set:

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics']
#categories = data.target_names

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

With any of the preceding examples, it can quickly become tedious to do the transformations by hand, especially if you wish to string together multiple steps. For example, we might want a processing pipeline that looks something like this:

Impute missing values using the mean
Transform features to quadratic
Fit a linear regression
To streamline this type of processing pipeline, Scikit-Learn provides a Pipeline object, which can be used as follows

In [None]:
# define test and training sets
X_train_raw = train.data
y_train = train.target

X_test_raw = test.data
y_test = test.target
y_expected = test.target

In [None]:
# example of training instance
print(X_train_raw[1])

In [None]:
# Extract textual features from text, such as TF.IDF (more info: https://monkeylearn.com/blog/what-is-tf-idf/)
vec = TfidfVectorizer()
X_train = vec.fit_transform( X_train_raw )
X_test = vec.fit_transform( X_test_raw )

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# MultinomialNB != GaussianNB


model.fit(train.data, train.target)
y_predicted  = model.predict(test.data)

print( 'The overall accuracy of the model is %.2f%%' %(accuracy_score( y_expected, y_predicted )*100))

colormap = "YlOrBr" # more colors can be found here: https://matplotlib.org/tutorials/colors/colormaps.html
mat = confusion_matrix(test.target, y_predicted)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=True, cmap = colormap,
            xticklabels=train.target_names, yticklabels=train.target_names)

plt.xlabel('true label')
plt.ylabel('predicted label');

#### testing different sentences

In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [None]:
predict_category('Teaching data analytics with really nice graphics')

In [None]:
predict_category('Flat earth people say Australia does not exist and we are all being paid by Nasa')

In [None]:
predict_category('Chef Kiko is opening a restaurant in Mars')

In [None]:
predict_category('discussing islam vs atheism')