<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Programming: Bayes Theorem

The Bayes Theorem is applicable in machine learning where we get to use a Bayes classifier inorder to make a prediction. In this session, we will learn how to apply this classifer to a few machine learning problems even though later during Core we will spent time exhaustively on working on such problems. While working, we should note that the bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. 

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Such classifiers, Naive Bayes classifiers, are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.


## Example 

In [0]:
# Example 1
# ---
# Let's see an overview on how this classifier works, which suitable applications it has, 
# and how to use it in just a few lines of Python and the Scikit-Learn library.
# ---
# Question: Build a very simple SPAM detector for SMS messages given the following dataset; 
# ---
# Dataset source = https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
#

In [0]:
# Importing our library
# ---
#
import pandas as pd

import numpy as np

In [0]:
# Loading our uploaded Data
# ---
# We define a separator (in this case, a tab) and rename the columns accordingly
# 
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'], encoding='latin-1')
df.head()

In [0]:
# Pre-processing
# ---
# 1. Converting the labels from strings to binary values for our classifier
# 
df['label'] = df.label.map({'ham': 0, 'spam': 1})
df.head()

In [0]:
# Pre-processing
# ---
# 2. Converting all characters in the message to lower case:
# 
df['message'] = df.message.map(lambda x: x.lower())
df.head()

In [0]:
# Pre-processing
# ---
# 3. Remove any punctuation:
# 
df['message'] = df.message.str.replace('[^\w\s]', '')
df.head()

In [0]:
# Pre-processing
# ---
# 4. tokenize the messages into into single words using nltk. 
# First, we have to import and download the tokenizer from the console:
# 
import nltk
nltk.download("popular")

In [0]:
# Pre-processing
# ---
# 5. Applying the tokenization. 
# What is tokenization (http://bit.ly/WhatisTokenization)
# 
df['message'] = df['message'].apply(nltk.word_tokenize)
df.head()

In [0]:
# Pre-processing
# ---
# 6. We then perform some word stemming. 
# The idea of stemming is to normalize our text for all variations of words carry the same meaning, 
# regardless of the tense. One of the most popular stemming algorithms is the Porter Stemmer:
# 
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
 
df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])
df.head()

In [0]:
# Pre-processing
# ---
# 7. We will transform the data into occurrences, 
# which will be the features that we will feed into our model:
#
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['message'])
df.head()

In [0]:
# Pre-processing
# ---
# 8. We could leave it as the simple word-count per message, 
# but it is better to use Term Frequency Inverse Document Frequency, more known as tf-idf:
#
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)

counts = transformer.transform(counts)
df.head()

In [0]:
# Training the Model
# ---
# Now that we have performed feature extraction from our data, 
# it is time to build our model. We will start by splitting our data into training and test sets:
#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.1, random_state=69)

In [0]:
# Training the Model
# ---
# Then, all that we have to do is initialize the Naive Bayes Classifier and fit the data. 
# For text classification problems, the Multinomial Naive Bayes Classifier is well-suited:
# 
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [0]:
# Evaluating the Model
# ---
# Once we have put together our classifier, we can evaluate its performance in the testing set:
#
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

# Our simple Naive Bayes Classifier has 94.8% accuracy with this specific test set!

## <font color="green">Challenges</font>

In [0]:
# Example 1
# ---
# In this challenge, we have been tasked with creating a classifier, the training set,
# then training the classifier using the training set and making a prediction.
# ---
# The training set (X) consits of length, weight and shoe size. 
# Y contains the associated labels (male or female).
# 

X = [[121, 80, 44], [180, 70, 43], [166, 60, 38], [153, 54, 37], [166, 65, 40], [190, 90, 47], [175, 64, 39],
     [174, 71, 40], [159, 52, 37], [171, 76, 42], [183, 85, 43]]

Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']

# Training the classifier:
#
OUR CODE GOES HERE

# Making the prediciton:
# Using the GaussianNB classifier (i.e. from sklearn.naive_bayes import GaussianNB) 
# 



In [0]:
# Example 2
# ---
# Question: Use the titanic disaster dataset to create a Gaussian Naive Bayes classifier model 
# (i.e. from sklearn.naive_bayes import GaussianNB) that will make a prediction of survival 
# using passenger ticket fare information. 
# ---
# Dataset url: http://bit.ly/TitanicDataset 
# 
OUR CODE GOES HERE

In [0]:
# Example 3
# ---
# Question: Create a GaussianNB classifier (i.e. from sklearn.naive_bayes import GaussianNB) 
# to identify the different species of iris flowers.
# ---
# Dataset url = http://bit.ly/MSIrisDatasetNB
# 
OUR CODE GOES HERE