###Spam Classifying Naive Bayes - Brady Young
---

For this project I implemented a Naive Bayes classifier that differentiates between spam e-mails and legitimate e-mails. The distribution of each feature is calculated for both the spam and non-spam classes. These distributions are used in combination with the gaussian function to calculate a given feature's "distance" from the class' distribution. For a given instance, these distances are totaled separately for each class and the "closest" class, or the one with the highest probablity is the assigned class. 

## Modules

In [0]:
import numpy as np                #Used for array operations
import math                       #Used for mathematical constants(e, pi) and log
import requests                   #Used to download dataset
import os                         #Used to manage filesystem
import random as rand             #Used to shuffle dataset

##Data Acquisition, Preprocessing, and Plotting

In [0]:
#Creates directory for writing data files
if(os.path.isdir('data') == False):
  os.makedirs('data')
  
#Writes dataset and remote location to local files
link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
r = requests.get(link)
with open('data/spambase.data', 'wb') as file:
  file.write(r.content)

### Dataset

In [0]:
#A wrapper to process the dataset after reading
class Dataset:
  def __init__(self):
    self.trainset = []
    self.testset = []
    self.goalset = []
    self.testgoalset = []
    
    self.load()
    self.process()

  #Copies the data into memory
  def load(self):
    with open('data/spambase.data', mode='r') as file:
      self.fileIn = file.read()
    
  def process(self):
    featureSets = self.fileIn.split("\n")
    rand.shuffle(featureSets)
    
    for i in range(int(len(featureSets)/2)):
      self.trainset.append(featureSets[i].split(","))
      self.testset.append(featureSets[i+1].split(","))
      
      #Grab the target from each instance
      self.goalset.append(self.trainset[-1].pop())
      self.testgoalset.append(self.testset[-1].pop())
      i += 1
      
      #For some reason there is an occasional empty set added
      if(self.trainset[-1] == []):
        self.trainset.pop()
        self.goalset.pop()
        
      if(self.testset[-1] == []):
        self.testset.pop()
        self.testgoalset.pop()

    self.trainset = np.array(self.trainset, 'float64')
    self.testset = np.array(self.testset, 'float64')
    self.goalset = np.array(self.goalset, 'float64')
    self.testgoalset = np.array(self.testgoalset, 'float64')

###Plotting

In [0]:
def confMat(confMat):
  print("\t\t Target")
  print("\t\tFalse\tTrue")
  print("Result\tFalse\t", confMat[0][0], "\t", confMat[0][1])
  print("\tTrue\t", confMat[1][0], "\t", confMat[1][1])

## Naive Bayes Classifier

In [0]:
class NaiveBayes:
  def __init__(self, dataset):
    #Instantiate the dataset
    self.data = dataset
    
    #Establish spam prior (not spam prior = 1-spam prior)
    self.prior = np.sum(self.data.goalset) / len(self.data.goalset)
    
    #Small value to modify STD with
    self.epsilon = 0.0001
    self.confMat = np.zeros((2, 2))
    
    #Holds statistical data for each class
    self.spamAvg = []
    self.spamSTD = []
    self.notAvg = []
    self.notSTD = []

    #Execution
    self.computeStats()
    self.run()
    self.calcResults()
    self.printResults()
    
  #Computes the averages and standard deviations
  #for each feature for each class
  def computeStats(self):
    numberOfFeatures = len(self.data.trainset[0])
    numberOfInstances = len(self.data.trainset)
    spamSet = []
    notSet = []
    
    for i in range(len(self.data.trainset)):
      if(self.data.goalset[i]):
        spamSet.append(self.data.trainset[i])
      else:
        notSet.append(self.data.trainset[i])
    
    spamSet = np.array(spamSet)
    notSet = np.array(notSet)
    
    #Separates each feature into its own list
    spamFeatureLists = np.hsplit(spamSet, numberOfFeatures)
    notFeatureLists = np.hsplit(notSet, numberOfFeatures)

    for f in spamFeatureLists:
      self.spamAvg.append(np.sum(f) / numberOfInstances)
      self.spamSTD.append(np.std(f) + self.epsilon)
      
    for f in notFeatureLists:
      self.notAvg.append(np.sum(f) / numberOfInstances)
      self.notSTD.append(np.std(f) + self.epsilon)
    
    
  #Classifies each instance in the test set
  def run(self):
    for i in range(len(self.data.testset)):
      result = self.classify(self.data.testset[i])
      target = int(self.data.testgoalset[i])
      
      self.confMat[result][target] += 1
  
  
  #Iterates through each feature, calculating conditional probablity given the class
  #Returns the class with the highest probablity
  def classify(self, features):
    #Index 0 = not spam
    #      1 = spam
    prob = [math.log(1-self.prior), math.log(self.prior)]
    for i in range(len(features)):
      prob[0] += math.log(self.gaussian(features[i], self.notAvg[i], self.notSTD[i]))
      prob[1] += math.log(self.gaussian(features[i], self.spamAvg[i], self.spamSTD[i]))
    
    return prob.index(max(prob))
  
  
  #Computes the gaussian 
  def gaussian(self, feature, mean, std):
    expNumer = (feature - mean) ** 2
    expDenom = (std ** 2) * 2
    exponent = -1 * (expNumer / expDenom)
    
    denom = math.sqrt(2 * math.pi) * std
    numer = math.e ** exponent
    
    gaussian = numer/denom
    
    #log undefined at 0, set to small value instead
    if(gaussian == 0):
      return self.epsilon
    
    return gaussian
  
  
  #Calculates the accuracy, precision, and recall of the results
  def calcResults(self):
    #Confusion Matrix = [result][target]
    trueP = self.confMat[1][1]
    trueN = self.confMat[0][0]
    falseP = self.confMat[1][0]
    falseN = self.confMat[0][1]
    
    self.accuracy = (trueP + trueN) / len(self.data.testset)
    self.precision = trueP / (trueP + falseP)
    self.recall = trueP / (trueP + falseN)
    
    
  #Prints the results  
  def printResults(self):
    print("Accuracy: ", self.accuracy)
    print("Precision: ", self.precision)
    print("Recall: ", self.recall)

##Execution

In [0]:
dataset = Dataset()
naive = NaiveBayes(dataset)

Accuracy:  0.6936114732724902
Precision:  0.5601328903654486
Recall:  0.9514672686230248


### Confusion Matrix

In [0]:
confMat(naive.confMat)

		 Target
		False	True
Result	False	 753.0 	 43.0
	True	 662.0 	 843.0


##Results

### Usability

It seems that my model classifies more false-positives than false-negatives. This results in real e-mail being classified as spam at a high rate. Before using this to classify e-mail for my own inbox, I would definitely want to minimize the amount of false-positives. It would be better to occasionally see spam in my inbox than miss important e-mails due to inaccurate classification.

### SVM Comparison

The accuracy of this model tends to about 70%, with the variance due to selection of training and test sets. This falls below the general accuracy of the naive SVM classifier which incorporated the whole feature set and varied around 80% by about 10 percentage points. Notably, the precision and recall between the two systems differed as well. While the precision of the naive bayes is fairly low, around 60%, the precision of the SVM was generally around 90%. However, the recall of the naive bayes model is reliably higher, around 94%, while the SVM recall was generally quite a bit less, often under 60%. Overall though, my tuned SVM that considered only the highest weighted features had better accuracy and precision, while just a bit less recall than the Naive Bayes model.

### Feature Dependency

Reading through the names of the features as described (https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names) leads me believe that there are features in the set that are interdependent, namely the last three which describe the number of capital letters. Despite these dependencies, the model is still able to classify e-mails between spam and not spam. And it does so better than random chance on the assumption that all the features are independent from one another.

### Performance Speculation

I suspect that accuracy could be increased if more features were included for instances in the dataset. If the dataset included features similar to: number of misspelled words, number of non-english words, domain of e-mail sender's e-mail address (mapped to an index). In addition, perhaps the occurence of a given word has no bearing on the class of the e-mail when considered in isolation. However, in conjunction with another word or punctuation, always exists in any given spam e-mail.

It seems that features that are equally shared between classes decrease the overall accuracy of the model. The conditional probablity of a feature that is equally likely to be present in instances of both classes does not give us any information about the class the feature belongs to. I suspect some of the features from the dataset my model training were shared between classes. Specifically the features that quantified common words such as "over", "our", "make", "you". If the dataset contained only features that belonged to one class or the other, I suspect the model would be able to classify e-mails more accurately.