
# 我们的任务

垃圾邮件检测是当今互联网中机器学习的主要应用之一。几乎所有主要的电子邮件服务提供商都内置了垃圾邮件检测系统，并将这些邮件自动分类为“垃圾邮件”。

在此任务中，我们将使用朴素贝叶斯算法来创建一个模型，该模型可根据我们对该模型的训练，将数据集SMS消息分类为垃圾邮件或非垃圾邮件。对垃圾短信的外观要有一定的直觉，这一点很重要。通常这些文字中有“免费”，“赢”，“赢家”，“现金”，“奖品”之类的词，因为这些文字的目的是引起您的注意，并在某种意义上诱使您打开它们。同样，垃圾邮件消息中的单词通常用大写字母表示，并且往往使用很多感叹号。对于收件人来说，识别垃圾邮件通常很简单，而我们的目标是训练一个模型来为我们做到这一点！

能够识别垃圾邮件是一个二进制分类问题，因为邮件被分类为“垃圾邮件”或“非垃圾邮件”，别无其他。同样，这是一个监督学习问题，因为我们将向其学习的标记数据集输入模型中，以进行未来的预测。

# 总览

该项目已细分为以下步骤：

- 步骤0：朴素贝叶斯定理简介
- 步骤1.1：了解我们的数据集
- 步骤1.2：数据预处理
- 步骤2.1：单词袋（BoW）
- 步骤2.2：从头开始实施BoW
- 步骤2.3：在scikit-learn中实现单词袋
- 步骤3.1：培训和测试集
- 步骤3.2：将词袋处理应用于我们的数据集。
- 步骤4.1：从头开始实施贝叶斯定理
- 步骤4.2：从零开始实施朴素贝叶斯
- 步骤5：使用scikit-learn的朴素贝叶斯实现
- 步骤6：评估我们的模型
- 步骤7：结论


### 步骤0：朴素贝叶斯定理简介

贝叶斯定理（Bayes theorem）是由贝弗斯牧师（Reverend Bayes）开发的最早的概率推断算法之一（他曾经尝试尝试推断上帝的存在），并且在某些用例中仍然表现出色。

最好通过一个例子来理解这个定理。假设您是特勤局的成员，并且在他/她的一次竞选演讲中被部署来保护民主党总统候选人。作为一个对所有人开放的公共活动，您的工作并不容易，而且您必须时刻警惕威胁。因此，一个起点就是为每个人设置一定的威胁因素。因此，根据个人的特征（例如年龄，性别）和其他较小的因素（例如，提包的人？），这个人看起来紧张吗？等等。您可以就此人是否可行威胁做出判断。

如果某人在所有方框中打勾，直到其超过您的怀疑阈值，您就可以采取行动，将该人从附近移开。贝叶斯定理的工作原理与我们基于某些相关事件的概率（年龄，性别，是否有行李，人的紧张程度等）计算事件（某人是威胁）的概率相同。

要考虑的一件事是这些功能之间的独立性。例如，如果一个孩子对事件感到紧张，那么那个人受到威胁的可能性就不如说是一个长大的人感到紧张。为了进一步说明这一点，我们考虑了以下两个特征：年龄和紧张感。假设我们单独查看这些功能，我们可以设计一个模型，将所有紧张的人标记为潜在威胁。但是，我们很可能会产生很多误报，因为参加活动的未成年人很有可能会感到紧张。因此，通过考虑一个人的年龄以及“神经质”功能，我们肯定会获得关于谁是潜在威胁，谁不是潜在威胁的更准确的结果。

这是定理的“天真”位，它认为每个特征彼此独立，但并非总是如此，因此会影响最终判断。

简而言之，贝叶斯定理根据某些其他事件的联合概率分布（在我们的情况下，消息中某些单词的出现）来计算发生某事件（在我们的情况下，邮件为垃圾邮件）的概率。在任务的稍后部分，我们将深入探讨贝叶斯定理的工作原理，但首先，让我们了解将要使用的数据。

### 步骤1.1：了解我们的数据集


我们将使用UCI机器学习存储库中的[数据集](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)，该库具有非常好的数据集，可用于实验研究。 直接数据链接是[here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/)。

Here's a preview of the data:

<img src="images/dqnb.png" height="1242" width="1242">

数据集中的列当前未命名，如您所见，共有2列。

第一列采用两个值，“ ham”表示邮件不是垃圾邮件，“ spam”表示邮件是垃圾邮件。

第二列是正在分类的SMS消息的文本内容。

>  说明：
* 使用read_table方法将数据集导入到熊猫数据框。 由于这是一个制表符分隔的数据集，因此我们将使用'\t'作为指定此格式的'sep'参数的值。
* 另外，通过在`read_table()`的`names`参数中指定一个列表['label，'sms_message']来重命名列名称。
* 使用新的列名打印数据框的前五个值。

In [10]:
'''
Solution
'''
import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('smsspamcollection/SMSSpamCollection',
                   sep='\t',
                   header=None, 
                   names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### 步骤1.2：数据预处理 ###

现在我们对数据集的外观有了基本的了解，让我们将标签转换为二进制变量，为便于计算，将0表示`ham`（即不是垃圾邮件），将1表示`spam`。

您可能想知道为什么我们需要执行此步骤？ 答案在于scikit-learn如何处理输入。 Scikit-learn仅处理数字值，因此，如果我们将标签值保留为字符串，则scikit-learn将在内部进行转换（更具体地说，字符串标签将转换为未知的float值）。

如果我们将标签保留为字符串，我们的模型仍然可以做出预测，但是稍后在计算性能指标时（例如，在计算精度和召回率时）可能会出现问题。 因此，为避免以后发生意外的“陷阱”，优良作法是将分类值作为整数输入到我们的模型中。

> 说明:
* 使用映射方法，将'label'列中的值转换为数字值，如下所示：`{'ham':0, 'spam':1}`它将`ham`值映射为0，将`spam`值映射为1。
* 另外，要了解我们正在处理的数据集的大小，请使用“shape”打印出行数和列数。

In [11]:
'''
Solution
'''
df['label'] = df.label.map({'ham': 0, 'spam': 1})
print(df.shape)
df.head() # returns (rows, columns)

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 步骤2.1：单词袋 ###

我们的数据集中有大量文本数据（5,572行数据）。 大多数机器学习算法都依赖于数字数据作为输入，而电子邮件/短信通常是大量文本。

在这里，我们要介绍“词袋”（BoW）概念，该术语用于指定具有“词袋”或需要处理的文本数据集合的问题。 BoW的基本思想是获取一段文本并计算该文本中单词的出现频率。 重要的是要注意，BoW概念单独对待每个单词，单词出现的顺序无关紧要。

使用我们现在将要经历的过程，我们可以将文档集合转换为矩阵，每个文档为一行，每个单词（令牌）为列，相应的（行，列）值为频率 该文档中每个单词或标记的出现。

例如：

可以说我们有4个文档，如下所示：

`['Hello, how are you!',
'Win money, win from home.',
'Call me now',
'Hello, Call you tomorrow?']`

我们的目标是将这组文本转换为频率分布矩阵，如下所示：

<img src="images/countvectorizer.png" height="542" width="542">

正如我们所看到的，文档在行中编号，每个单词是一个列名，相应的值是该单词在文档中的出现频率。

让我们分解一下，看看我们如何使用一小组文档进行转换。

为了解决这个问题，我们将使用sklearns
[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method 

它执行以下操作：

* 它对字符串进行标记化（将字符串分成单个单词），并为每个标记赋予一个整数ID。
* 计算每个标记的出现次数。

请注意：

* CountVectorizer方法自动将所有标记化的单词转换为小写形式，这样就不会像对待“ He”和“ he”那样区别对待。它使用“ lowercase”参数执行此操作，该参数默认设置为“ True”。
* 它还会忽略所有标点符号，以使后面带有标点符号的单词（例如：“ hello！”）与未带标点符号前缀或后缀的相同单词（例如：“ hello”）的对待方式不同。它使用`token_pattern`参数执行此操作，该参数具有默认正则表达式，该正则表达式选择2个或更多字母数字字符的标记。
* 第三个要注意的参数是`stop_words`参数。停用词是指语言中最常用的词。它们包括'am'，'an'，'and'，'the'等字词。通过将此参数值设置为“ english”，CountVectorizer将自动忽略在内置文件中找到的所有字词（来自我们的输入文本） scikit-learn中的英语停用词列表。这非常有用，因为当我们尝试查找某些表示垃圾邮件的关键字时，停用词会使我们的计算产生偏差。

我们将在以后的步骤中将每种方法应用到我们的模型中，但是就目前而言，在处理文本数据时要意识到可供我们使用的此类预处理技术非常重要。

### 步骤2.2：从头开始实施单词袋 ###

在我们深入scikit-learn的“语言袋”（BoW）库为我们做复杂的工作之前，让我们先自己实现它，以便我们了解幕后发生的事情。

**步骤1：将所有字符串转换为小写形式。**

假设我们有一个文档集：

```
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
```
>说明：
* 将文档集中的所有字符串转换为小写。 将它们保存到名为“ lower_case_documents”的列表中。 您可以使用lower()方法在python中将字符串转换为小写形式。


In [12]:
'''
Solution:
'''
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


**步骤2：删除所有标点符号**

> **说明：**
从文档集中的字符串中删除所有标点符号。 将它们保存到名为
'sans_punctuation_documents'。

In [15]:
'''
Solution:
'''
sans_punctuation_documents = []
import string

for i in lower_case_documents:
    # translate() 方法根据参数table给出的表(包含 256 个字符)转换字符串的字符, 要过滤掉的字符放到 del 参数中。
    # Python maketrans() 方法用于创建字符映射的转换表，对于接受两个参数的最简单的调用方式，第一个参数是字符串，表示需要转换的字符，第二个参数也是字符串表示转换的目标。
    # string.punctuation 所有的标点字符
    sans_punctuation_documents.append(i.translate(str.maketrans('', '', string.punctuation)))
    
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


**步骤3：标记化**

将文档集中的句子标记为令牌意味着使用定界符将一个句子拆分为单个单词。 分隔符指定我们将使用什么字符来标识单词的开头和结尾（例如，我们可以使用单个空格作为分隔符来标识文档集中的单词）。

>**说明：**
使用split（）方法标记存储在“ sans_punctuation_documents”中的字符串。 并存储最终文档集
在名为“ preprocessed_documents”的列表中。


In [16]:
'''
Solution:
'''
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' '))
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


**步骤4：计数频率**

现在，我们已经以所需的格式设置了文档集，我们可以继续计算文档集中每个文档中每个单词的出现次数。 为此，我们将使用Python`collections`库中的`Counter`方法。

“ Counter”对列表中每个项目的出现进行计数，并返回一个字典，其中键为要计数的项目，而对应的值为列表中该项目的计数。

> **说明：**
使用Counter（）方法和preprocessed_documents作为输入，创建一个字典，其中的键是每个文档中的每个单词，而对应的值是该单词出现的频率。 将每个Counter字典另存为名为“ frequency_list”的列表中的项目。


In [17]:
'''
Solution
'''
frequency_list = []
import pprint
from collections import Counter

for i in preprocessed_documents:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
    
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


恭喜你！ 您已经从头开始实施了“语言袋”流程！ 正如我们在先前的输出中看到的，我们有一个频率分布字典，可以清晰地查看正在处理的文本。

现在，我们应该对scikit-learn的`sklearn.feature_extraction.text.CountVectorizer`方法在幕后发生的事情有深刻的了解。

现在，我们将在下一步中实现`sklearn.feature_extraction.text.CountVectorizer`方法。

### 步骤2.3：在scikit-learn中实现单词袋 ###

现在我们已经从头开始实现了BoW概念，让我们继续使用scikit-learn以简洁明了的方式完成此过程。 我们将使用与上一步相同的文档集。

In [18]:
'''
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the 
document-term matrix generation happens. We have created a sample document set 'documents'.
'''
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

> **说明：**
导入sklearn.feature_extraction.text.CountVectorizer方法并创建一个名为“count_vector”的实例。

In [19]:
'''
Solution
'''
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

**使用CountVectorizer()进行数据预处理**

在步骤2.2中，我们从头开始实现了CountVectorizer()方法的一个版本，该版本需要首先清除我们的数据。清理工作涉及将我们的所有数据转换为小写并删除所有标点符号。 CountVectorizer()具有某些参数，这些参数将为我们处理这些步骤。他们是：

* `lowercase = True`
    
    小写参数的默认值为True，它将所有文本转换为小写形式。


* `token_pattern = (?u)\\b\\w\\w+\\b`
    
    `token_pattern`参数的默认正则表达式值为`(?u)\\b\\w\\w+\\b`，它将忽略所有标点符号并将其视为定界符，同时接受长度大于或等于的字母数字字符串等于2，作为单个标记或单词。
    
* `stop_words`

    如果将`stop_words`参数设置为`english`，则会从文档集中删除与scikit-learn中定义的英语停用词列表匹配的所有词。考虑到数据集的大小以及我们正在处理SMS消息而不是诸如电子邮件这样的较大文本源的事实，我们将不会设置此参数值。

您可以通过简单地如下打印对象来查看您的`count_vector`对象的所有参数值：


In [20]:
'''
Practice node:
Print the 'count_vector' object which is an instance of 'CountVectorizer()'
'''
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


> **说明：**
使文档数据集适合使用fit()创建的CountVectorizer对象，并获取单词列表
使用get_feature_names()方法将其分类为功能。

In [21]:
'''
Solution:
'''
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

`get_feature_names()`方法返回该数据集的特征名称，该特征名称是构成'documents'词汇的一组单词。

> **说明：**
创建一个矩阵，其中行是4个文档中的每个文档，列是每个单词。
相应的（行，列）值是该单词（在列中）在特定单词中出现的频率
文档（在行中）。 您可以使用transform()方法，并将文档数据集作为
论点。 transform()方法返回一个numpy整数矩阵，您可以使用以下方法将其转换为数组
toarray()。 将该数组称为“doc_array”


In [22]:
'''
Solution
'''
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]])

现在，我们可以根据文档中单词的频率分布来清晰地表示文档。 为了更容易理解，我们的下一步是将该数组转换为数据框并适当命名列。

> **说明：**
将获得的数组转换为“doc_array”，并转换为数据框，并将列名设置为
单词名称（您之前使用get_feature_names()计算出的名称。将数据框称为“frequency_matrix”。


In [23]:
'''
Solution
'''
frequency_matrix = pd.DataFrame(doc_array, 
                                columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


恭喜你！您已经为我们创建的文档数据集成功实现了“单词袋”问题。

开箱即用使用此方法可能引起的一个潜在问题是，如果我们的文本数据集非常大（例如，如果我们有大量新闻文章或电子邮件数据集），那么某些值会更大。由于语言本身的结构，其他人也很常见。因此，例如“ is”，“ the”，“ an”，代词，语法结构等词可能会歪曲我们的矩阵并影响我们的分析。

有两种方法可以减轻这种情况。一种方法是使用“ stop_words”参数并将其值设置为“ english”。这将自动忽略在scikit-learn的内置英语停用词列表中找到的所有词（来自我们的输入文本）。

缓解此问题的另一种方法是使用[tfidf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)方法。此方法超出了本课程的范围。

### 步骤3.1：培训和测试集 ###

既然我们已经了解了如何处理“单词袋”问题，我们可以返回到数据集并继续进行分析。 我们在这方面的第一步是将我们的数据集分为训练和测试集，以便以后可以测试模型。


>>**Instructions:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data
using the following variables:
* `X_train` is our training data for the 'sms_message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'sms_message' column.
* `y_test` is our testing data for the 'label' column
Print out the number of rows we have in each our training and testing data.


In [25]:
'''
Solution

NOTE: sklearn.cross_validation will be deprecated soon to sklearn.model_selection 
'''
# split into training and testing sets
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation warning.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Step 3.2: Applying Bag of Words processing to our dataset. ###

Now that we have split the data, our next objective is to follow the steps from Step 2: Bag of words and convert our data into the desired matrix format. To do this we will be using CountVectorizer() as we did before. There are two  steps to consider here:

* Firstly, we have to fit our training data (`X_train`) into `CountVectorizer()` and return the matrix.
* Secondly, we have to transform our testing data (`X_test`) to return the matrix. 

Note that `X_train` is our training data for the 'sms_message' column in our dataset and we will be using this to train our model. 

`X_test` is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with `y_test` in a later step. 

For now, we have provided the code that does the matrix transformations for you!

In [None]:
'''
[Practice Node]

The code for this segment is in 2 parts. Firstly, we are learning a vocabulary dictionary for the training data 
and then transforming the data into a document-term matrix; secondly, for the testing data we are only 
transforming the data into a document-term matrix.

This is similar to the process we followed in Step 2.3

We will provide the transformed data to students in the variables 'training_data' and 'testing_data'.
'''

In [26]:
'''
Solution
'''
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Step 4.1: Bayes Theorem implementation from scratch ###

Now that we have our dataset in the format that we need, we can move onto the next portion of our mission which is the  algorithm we will use to make our predictions to classify a message as spam or not spam. Remember that at the start of the mission we briefly discussed the Bayes theorem but now we shall go into a little more detail. In layman's terms, the Bayes theorem calculates the probability of an event occurring, based on certain other probabilities that are related to the event in question. It is  composed of a  prior(the probabilities that we are aware of or that is given to us) and the posterior(the probabilities we are looking to compute using the priors). 

Let us implement the Bayes Theorem from scratch using a simple example. Let's say we are trying to find the odds of an individual having diabetes, given that he or she was tested for it and got a positive result. 
In the medical field, such probabilies play a very important role as it usually deals with life and death situations. 

We assume the following:

`P(D)` is the probability of a person having Diabetes. It's value is `0.01` or in other words, 1% of the general population has diabetes(Disclaimer: these values are assumptions and are not reflective of any medical study).

`P(Pos)` is the probability of getting a positive test result.

`P(Neg)` is the probability of getting a negative test result.

`P(Pos|D)` is the probability of getting a positive result on a test done for detecting diabetes, given that you have diabetes. This has a value `0.9`. In other words the test is correct 90% of the time. This is also called the Sensitivity or True Positive Rate.

`P(Neg|~D)` is the probability of getting a negative result on a test done for detecting diabetes, given that you do not have diabetes. This also has a value of `0.9` and is therefore correct, 90% of the time. This is also called the Specificity or True Negative Rate.

The Bayes formula is as follows:

<img src="images/bayes_formula.png" height="242" width="242">

* `P(A)` is the prior probability of A occurring independently. In our example this is `P(D)`. This value is given to us.

* `P(B)` is the prior probability of B occurring independently. In our example this is `P(Pos)`.

* `P(A|B)` is the posterior probability that A occurs given B. In our example this is `P(D|Pos)`. That is, **the probability of an individual having diabetes, given that, that individual got a positive test result. This is the value that we are looking to calculate.**

* `P(B|A)` is the likelihood probability of B occurring, given A. In our example this is `P(Pos|D)`. This value is given to us.

Putting our values into the formula for Bayes theorem we get:

`P(D|Pos) = P(D) * P(Pos|D) / P(Pos)`

The probability of getting a positive test result `P(Pos)` can be calculated using the Sensitivity and Specificity as follows:

`P(Pos) = [P(D) * Sensitivity] + [P(~D) * (1-Specificity))]`

In [27]:
'''
Instructions:
Calculate probability of getting a positive test result, P(Pos)
'''

'\nInstructions:\nCalculate probability of getting a positive test result, P(Pos)\n'

In [28]:
'''
Solution (skeleton code will be provided)
'''
# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg|~D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = # TODO
print('The probability of getting a positive test result P(Pos) is: {}',format(p_pos))

SyntaxError: invalid syntax (<ipython-input-28-b67cafd2a25e>, line 17)

** Using all of this information we can calculate our posteriors as follows: **
    
The probability of an individual having diabetes, given that, that individual got a positive test result:

`P(D|Pos) = (P(D) * Sensitivity)) / P(Pos)`

The probability of an individual not having diabetes, given that, that individual got a positive test result:

`P(~D|Pos) = (P(~D) * (1-Specificity)) / P(Pos)`

The sum of our posteriors will always equal `1`. 

In [None]:
'''
Instructions:
Compute the probability of an individual having diabetes, given that, that individual got a positive test result.
In other words, compute P(D|Pos).

The formula is: P(D|Pos) = (P(D) * P(Pos|D) / P(Pos)
'''

In [None]:
'''
Solution
'''
# P(D|Pos)
p_diabetes_pos = # TODO
print('Probability of an individual having diabetes, given that that individual got a positive test result is:\
',format(p_diabetes_pos)) 

In [None]:
'''
Instructions:
Compute the probability of an individual not having diabetes, given that, that individual got a positive test result.
In other words, compute P(~D|Pos).

The formula is: P(~D|Pos) = P(~D) * P(Pos|~D) / P(Pos)

Note that P(Pos|~D) can be computed as 1 - P(Neg|~D). 

Therefore:
P(Pos|~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1
'''

In [None]:
'''
Solution
'''
# P(Pos|~D)
p_pos_no_diabetes = 0.1

# P(~D|Pos)
p_no_diabetes_pos = # TODO
print 'Probability of an individual not having diabetes, given that that individual got a positive test result is:'\
,p_no_diabetes_pos

Congratulations! You have implemented Bayes theorem from scratch. Your analysis shows that even if you get a positive test result, there is only a 8.3% chance that you actually have diabetes and a 91.67% chance that you do not have diabetes. This is of course assuming that only 1% of the entire population has diabetes which of course is only an assumption.

** What does the term 'Naive' in 'Naive Bayes' mean ? ** 

The term 'Naive' in Naive Bayes comes from the fact that the algorithm considers the features that it is using to make the predictions to be independent of each other, which may not always be the case. So in our Diabetes example, we are considering only one feature, that is the test result. Say we added another feature, 'exercise'. Let's say this feature has a binary value of `0` and `1`, where the former signifies that the individual exercises less than or equal to 2 days a week and the latter signifies that the individual exercises greater than or equal to 3 days a week. If we had to use both of these features, namely the test result and the value of the 'exercise' feature, to compute our final probabilities, Bayes' theorem would fail. Naive Bayes' is an extension of Bayes' theorem that assumes that all the features are independent of each other. 

### Step 4.2: Naive Bayes implementation from scratch ###



Now that you have understood the ins and outs of Bayes Theorem, we will extend it to consider cases where we have more than feature. 

Let's say that we have two political parties' candidates, 'Jill Stein' of the Green Party and 'Gary Johnson' of the Libertarian Party and we have the probabilities of each of these candidates saying the words 'freedom', 'immigration' and 'environment' when they give a speech:

* Probability that Jill Stein says 'freedom': 0.1 ---------> `P(F|J)`
* Probability that Jill Stein says 'immigration': 0.1 -----> `P(I|J)`
* Probability that Jill Stein says 'environment': 0.8 -----> `P(E|J)`


* Probability that Gary Johnson says 'freedom': 0.7 -------> `P(F|G)`
* Probability that Gary Johnson says 'immigration': 0.2 ---> `P(I|G)`
* Probability that Gary Johnson says 'environment': 0.1 ---> `P(E|G)`


And let us also assume that the probability of Jill Stein giving a speech, `P(J)` is `0.5` and the same for Gary Johnson, `P(G) = 0.5`. 


Given this, what if we had to find the probabilities of Jill Stein saying the words 'freedom' and 'immigration'? This is where the Naive Bayes'theorem comes into play as we are considering two features, 'freedom' and 'immigration'.

Now we are at a place where we can define the formula for the Naive Bayes' theorem:

<img src="images/naivebayes.png" height="342" width="342">

Here, `y` is the class variable or in our case the name of the candidate and `x1` through `xn` are the feature vectors or in our case the individual words. The theorem makes the assumption that each of the feature vectors or words (`xi`) are independent of each other.

To break this down, we have to compute the following posterior probabilities:

* `P(J|F,I)`: Probability of Jill Stein saying the words Freedom and Immigration. 

    Using the formula and our knowledge of Bayes' theorem, we can compute this as follows: `P(J|F,I)` = `(P(J) * P(F|J) * P(I|J)) / P(F,I)`. Here `P(F,I)` is the probability of the words 'freedom' and 'immigration' being said in a speech.
    

* `P(G|F,I)`: Probability of Gary Johnson saying the words Freedom and Immigration.  
    
    Using the formula, we can compute this as follows: `P(G|F,I)` = `(P(G) * P(F|G) * P(I|G)) / P(F,I)`

In [None]:
'''
Instructions: Compute the probability of the words 'freedom' and 'immigration' being said in a speech, or
P(F,I).

The first step is multiplying the probabilities of Jill Stein giving a speech with her individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_j_text

The second step is multiplying the probabilities of Gary Johnson giving a speech with his individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_g_text

The third step is to add both of these probabilities and you will get P(F,I).
'''

In [None]:
'''
Solution: Step 1
'''
# P(J)
p_j = 0.5

# P(F/J)
p_j_f = 0.1

# P(I/J)
p_j_i = 0.1

p_j_text = # TODO
print(p_j_text)

In [None]:
'''
Solution: Step 2
'''
# P(G)
p_g = 0.5

# P(F/G)
p_g_f = 0.7

# P(I/G)
p_g_i = 0.2

p_g_text = # TODO
print(p_g_text)

In [None]:
'''
Solution: Step 3: Compute P(F,I) and store in p_f_i
'''
p_f_i = # TODO
print('Probability of words freedom and immigration being said are: ', format(p_f_i))

Now we can compute the probability of `P(J|F,I)`, that is the probability of Jill Stein saying the words Freedom and Immigration and `P(G|F,I)`, that is the probability of Gary Johnson saying the words Freedom and Immigration.

In [None]:
'''
Instructions:
Compute P(J|F,I) using the formula P(J|F,I) = (P(J) * P(F|J) * P(I|J)) / P(F,I) and store it in a variable p_j_fi
'''

In [None]:
'''
Solution
'''
p_j_fi = # TODO
print('The probability of Jill Stein saying the words Freedom and Immigration: ', format(p_j_fi))

In [None]:
'''
Instructions:
Compute P(G|F,I) using the formula P(G|F,I) = (P(G) * P(F|G) * P(I|G)) / P(F,I) and store it in a variable p_g_fi
'''

In [None]:
'''
Solution
'''
p_g_fi = # TODO
print('The probability of Gary Johnson saying the words Freedom and Immigration: ', format(p_g_fi))

And as we can see, just like in the Bayes' theorem case, the sum of our posteriors is equal to 1. Congratulations! You have implemented the Naive Bayes' theorem from scratch. Our analysis shows that there is only a 6.6% chance that Jill Stein of the Green Party uses the words 'freedom' and 'immigration' in her speech as compared the the 93.3% chance for Gary Johnson of the Libertarian party.

Another more generic example of Naive Bayes' in action is as when we search for the term 'Sacramento Kings' in a search engine. In order for us to get the results pertaining to the Scramento Kings NBA basketball team, the search engine needs to be able to associate the two words together and not treat them individually, in which case we would get results of images tagged with 'Sacramento' like pictures of city landscapes and images of 'Kings' which could be pictures of crowns or kings from history when what we are looking to get are images of the basketball team. This is a classic case of the search engine treating the words as independent entities and hence being 'naive' in its approach. 


Applying this to our problem of classifying messages as spam, the Naive Bayes algorithm *looks at each word individually and not as associated entities* with any kind of link between them. In the case of spam detectors, this usually works as there are certain red flag words which can almost guarantee its classification as spam, for example emails with words like 'viagra' are usually classified as spam.

### Step 5: Naive Bayes implementation using scikit-learn ###

Thankfully, sklearn has several Naive Bayes implementations that we can use and so we do not have to do the math from scratch. We will be using sklearns `sklearn.naive_bayes` method to make predictions on our dataset. 

Specifically, we will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

In [None]:
'''
Instructions:

We have loaded the training data into the variable 'training_data' and the testing data into the 
variable 'testing_data'.

Import the MultinomialNB classifier and fit the training data into the classifier using fit(). Name your classifier
'naive_bayes'. You will be training the classifier using 'training_data' and y_train' from our split earlier. 
'''

In [None]:
'''
Solution
'''
from sklearn.naive_bayes import MultinomialNB
naive_bayes = # TODO
naive_bayes.fit(# TODO)

In [None]:
'''
Instructions:
Now that our algorithm has been trained using the training data set we can now make some predictions on the test data
stored in 'testing_data' using predict(). Save your predictions into the 'predictions' variable.
'''

In [None]:
'''
Solution
'''
predictions = naive_bayes.predict(# TODO)

Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

### Step 6: Evaluating our model ###

Now that we have made predictions on our test set, our next goal is to evaluate how well our model is doing. There are various mechanisms for doing so, but first let's do quick recap of them.

** Accuracy ** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

** Precision ** tells us what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of

`[True Positives/(True Positives + False Positives)]`

** Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of

`[True Positives/(True Positives + False Negatives)]`

For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

We will be using all 4 metrics to make sure our model does well. For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

In [None]:
'''
Instructions:
Compute the accuracy, precision, recall and F1 scores of your model using your test data 'y_test' and the predictions
you made earlier stored in the 'predictions' variable.
'''

In [None]:
'''
Solution
'''
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(# TODO)))
print('Precision score: ', format(precision_score(# TODO)))
print('Recall score: ', format(recall_score(# TODO)))
print('F1 score: ', format(f1_score(# TODO)))

### Step 7: Conclusion ###

One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. 
It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. All in all, Naive Bayes' really is a gem of an algorithm!

Congratulations! You have successfully designed a model that can efficiently predict if an SMS message is spam or not!

Thank you for learning with us!