In [39]:
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Preprocessing step:
 Removing the extra characters, saving the words of the email in the dictionary 

In [40]:
def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
    
    dictionary = Counter(all_words)
    # code for non-word removal here 
    # removing extra characters: ex ","
    list_to_remove = dictionary.keys()
    for item in list(list_to_remove):
      if item.isalpha() == False: 
        dictionary.pop(item)
      elif len(item) == 1:
        dictionary.pop(item)
    dictionary = dictionary.most_common(3000)
    print(len(dictionary))
    
    return dictionary

Feature Extractions:

Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. The next code will generate a feature vector matrix whose rows denote 700 files of training set and columns denote 3000 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.

In [41]:
def extract_features(mail_dir): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1     
    return features_matrix

We have trained two models here namely Naive Bayes classifier and Support Vector Machines (SVM). Naive Bayes classifier is a conventional and very popular method for document classification problem. It is a supervised probabilistic classifier based on Bayes theorem assuming independence between every pair of features. SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick

Create a dictionary of words with its frequency:

In [42]:
# Create a dictionary 
train_dir='/content/gdrive/MyDrive/ling-spam/train-mails/'
test_dir='/content/gdrive/MyDrive/ling-spam/test-mails/'
dictionary = make_Dictionary(train_dir)
#print(dictionary)
# removing extra characters: ex

3000


Prepare feature vectors per training mail and its labels:

In [43]:
# feature vectors

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)
#print(dictionary)

Train and Testing: SVM and Naive bayes classifier

In [None]:
from sklearn.metrics import confusion_matrix
# Training SVM and Naive bayes classifier

model1 = MultinomialNB()
model2 = LinearSVC()
model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)

# Test the unseen mails for Spam
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)
#print(confusion_matrix(test_labels,result1))
#print(confusion_matrix(test_labels,result2))



Conclusion:
Both approach have a similar performance with accuracy more than %95, while SVD is a bit better. 