<a href="https://colab.research.google.com/github/marreapato/Deep_Learning_Course/blob/main/mlcrs_de_NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:

%cd /content/gdrive/MyDrive/mlcrs

!ls

/content/gdrive/MyDrive/mlcrs
access_log.txt			      KFoldCrossValidation.ipynb     PolynomialRegression.ipynb
breakfast.jpg			      KMeans.ipynb		     Python101.ipynb
bridge.jpg			      KNN.ipynb			     Q-Learning.ipynb
bunny.jpg			      LinearRegression.ipynb	     regression.txt
castle.jpg			      mammographic_masses.data.txt   Seaborn.ipynb
ConditionalProbabilityExercise.ipynb  mammographic_masses.names.txt  SimilarMovies.ipynb
ConditionalProbabilitySolution.ipynb  mammo_masses_project.ipynb     SparkDecisionTree.py
CovarianceCorrelation.ipynb	      MatPlotLib.ipynb		     SparkKMeans.py
DecisionTree.ipynb		      MeanMedianExercise.ipynb	     SparkLinearRegression.py
DeepLearningProject.ipynb	      MeanMedianMode.ipynb	     SparkPCA.py
DeepLearningProject-Solution.ipynb    ml-100k			     StdDevVariance.ipynb
Distributions.ipynb		      MLCourse.zip		     subset-small.tsv
emails				      MNIST_data		     SVC.ipynb
fighterjet.jpg			      Moments.ipynb		     Tensorflow.ipynb
FinalProject

In [3]:
data_dir = "/content/gdrive/MyDrive/mlcrs/emails/spam"
data_dir2 = "/content/gdrive/MyDrive/mlcrs/emails/ham"



In [4]:
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = pd.concat([data, dataFrameFromDirectory(data_dir, "spam")]);
data = pd.concat([data, dataFrameFromDirectory(data_dir2, "ham")])

#For Pandas 1.3:
#data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
#data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [5]:
data.head()

Unnamed: 0,message,class
/content/gdrive/MyDrive/mlcrs/emails/spam/00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
/content/gdrive/MyDrive/mlcrs/emails/spam/00020.29725cf331fc21e18a1809e7d8b27332,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
/content/gdrive/MyDrive/mlcrs/emails/spam/00010.445affef4c70feec58f9198cfbc22997,"Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce...",spam
/content/gdrive/MyDrive/mlcrs/emails/spam/00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
/content/gdrive/MyDrive/mlcrs/emails/spam/00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [16]:
import numpy as np

In [6]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

counts

<3000x62964 sparse matrix of type '<class 'numpy.int64'>'
	with 429785 stored elements in Compressed Sparse Row format>

In [23]:
print(counts.shape)
print(counts.toarray())#wordcoun in message, words are column names

(3000, 62964)
[[1 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [24]:
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

Let's try it out:

In [25]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.