In [1]:
import warnings
warnings.filterwarnings('ignore')

from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


# **Report**
Our overall goal in this report and assignment is to become used to Naive Bayes Algorithm. Naive Bayes is a simple classification algorithm that makes an assumption about the conditional independence of features. To familiarize with Naive Bayes Algoritm, we are tasked with implementing an algorithm that predicts whether an email is spam or ham and finally we will measure its performance to verify whether it works quite well in practice.

### Importing necessary libraries and methods
Our first job, of course, is to import the libraries that will be used in the implementation of given assignment

In [2]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix

## Part 1

### Read the dataset into Pandas DataFrame



In [3]:
df = pd.read_csv('/drive/My Drive/Colab Notebooks/BBM409-Assignment-3/emails.csv')

In [4]:
df.head(5)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


### Spam and Ham Mail Information And Ratio Within Our Data

We can see that vast majority of our data is made up of ham e-mails with around 75% majority

In [5]:
print('Sample Dataset Information:')
print()
print('Sample Dataset Shape:', df.shape)
print()
print('Sample Dataset Value Distribution:')
print(df['spam'].value_counts(normalize=False))
print()
print('Sample Dataset Value Distribution Rates:')
print(df['spam'].value_counts(normalize=True))

Sample Dataset Information:

Sample Dataset Shape: (5728, 2)

Sample Dataset Value Distribution:
0    4360
1    1368
Name: spam, dtype: int64

Sample Dataset Value Distribution Rates:
0    0.761173
1    0.238827
Name: spam, dtype: float64


### Train-Test Split
To estimate the performance of our model, we shuffled the given dataset and then splitted into two parts, 20% as the test set and 80% as the train set. We also checked whether the resulting datasets' ham/spam distributions are similar to our initial dataset, which we found out to be true.

In [6]:
train_df, test_df = train_test_split(df, test_size=0.20, shuffle=True)

In [7]:
print('Training Dataset Information:')
print()
print('Training Dataset Shape:', train_df.shape)
print()
print('Training Dataset Value Distribution:')
print(train_df['spam'].value_counts(normalize=False))
print()
print('Training Dataset Value Distribution Rates:')
print(train_df['spam'].value_counts(normalize=True))
print()
print("*"*50)

print()
print('Test Dataset Information:')
print()
print('Test Dataset Shape:', test_df.shape)
print()
print('Test Dataset Value Distribution:')
print(test_df['spam'].value_counts(normalize=False))
print()
print('Test Dataset Value Distribution Rates:')
print(test_df['spam'].value_counts(normalize=True))

Training Dataset Information:

Training Dataset Shape: (4582, 2)

Training Dataset Value Distribution:
0    3490
1    1092
Name: spam, dtype: int64

Training Dataset Value Distribution Rates:
0    0.761676
1    0.238324
Name: spam, dtype: float64

**************************************************

Test Dataset Information:

Test Dataset Shape: (1146, 2)

Test Dataset Value Distribution:
0    870
1    276
Name: spam, dtype: int64

Test Dataset Value Distribution Rates:
0    0.759162
1    0.240838
Name: spam, dtype: float64


### Analyzing Promising Words

From the experience we got from real world life we live, we chose to analyze and evaluate the statistics of three highly suspicious words within e-mails.

Those words are:
- click
- money
- online

In [8]:
data_statistics_vectorizer = CountVectorizer()
data_statistics = data_statistics_vectorizer.fit_transform(df['text'])

index_of_click = data_statistics_vectorizer.vocabulary_['click']
index_of_money = data_statistics_vectorizer.vocabulary_['money']
index_of_online = data_statistics_vectorizer.vocabulary_['online']

spam_data = data_statistics[np.array(df['spam'] == 1)]
ham_data = data_statistics[np.array(df['spam'] == 0)]

spam_count_click = spam_data[:, index_of_click].sum()
ham_count_click = ham_data[:, index_of_click].sum()
print("Number of times 'click' has been used in a spam mail:", spam_count_click)
print("Number of times 'click' has been used in a ham mail:", ham_count_click)

print()

spam_count_money = spam_data[:, index_of_money].sum()
ham_count_money = ham_data[:, index_of_money].sum()
print("Number of times 'money' has been used in a spam mail:", spam_count_money)
print("Number of times 'money' has been used in a ham mail:", ham_count_money)

print()

spam_count_online = spam_data[:, index_of_online].sum()
ham_count_online = ham_data[:, index_of_online].sum()
print("Number of times 'online' has been used in a spam mail:", spam_count_online)
print("Number of times 'online' has been used in a ham mail:", ham_count_online)

Number of times 'click' has been used in a spam mail: 531
Number of times 'click' has been used in a ham mail: 200

Number of times 'money' has been used in a spam mail: 662
Number of times 'money' has been used in a ham mail: 113

Number of times 'online' has been used in a spam mail: 345
Number of times 'online' has been used in a ham mail: 173


From above, we can easily see the overwhelming connection between spam e-mails and these three words as the difference between the amount of use of these words between spam and ham e-mails is drastic. In fact, the connection is even more profound than the result above, since we didn't take the difference in numbers between spam and ham e-mail amounts into account. If we normalize the count of those three words with the distribution of e-mail types in mind:

In [9]:
print("Normalized value of the times 'click' has been used in a spam mail:", spam_count_click / spam_data.shape[0])
print("Normalized value of the times 'click' has been used in a ham mail:", ham_count_click / ham_data.shape[0])

print()

spam_count_money = spam_data[:, index_of_money].sum()
ham_count_money = ham_data[:, index_of_money].sum()
print("Normalized value of the times 'money' has been used in a spam mail:", spam_count_money / spam_data.shape[0])
print("Normalized value of the times 'money' has been used in a ham mail:", ham_count_money / ham_data.shape[0])

print()

spam_count_online = spam_data[:, index_of_online].sum()
ham_count_online = ham_data[:, index_of_online].sum()
print("Normalized value of the times 'online' has been used in a spam mail:", spam_count_online / spam_data.shape[0])
print("Normalized value of the times 'online' has been used in a ham mail:", ham_count_online / ham_data.shape[0])

Normalized value of the times 'click' has been used in a spam mail: 0.3881578947368421
Normalized value of the times 'click' has been used in a ham mail: 0.045871559633027525

Normalized value of the times 'money' has been used in a spam mail: 0.48391812865497075
Normalized value of the times 'money' has been used in a ham mail: 0.025917431192660552

Normalized value of the times 'online' has been used in a spam mail: 0.25219298245614036
Normalized value of the times 'online' has been used in a ham mail: 0.03967889908256881


As we can see from above, each of these words has around ten times more importance within spam mails than ham ones. And within those spam mails, each one has between 25% to 50% repetition within spam mails (though we do not take the repetition within individual e-mails into account), an unbelievable difference in importance pointing us towards a clear connection between these words and spam mails as spammers seem to love to use these words within their mails.

## Part 2

### Naive Bayes Classifier



In [10]:
class NaiveBayesClassifier:
  """ The Naive Bayes Class where we do the majority of our calculations using Naive Bayes Algorithm 
  and its supportive utility functions """

  def __init__(self, ngram_range=(1,1), stop_words=None, vocabulary=None):

    """Constructor of our Naive Bates Algorithm where we define and initialize our class fields"""

    # Parameters : 
    #   ngram_range (tuple): The tuple that decides which Bag of Words option will be used.
    #   stop_words (list): The default is None. It takes the list of words which would be ignored. 

    #object of CountVectorizer
    self.vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words=stop_words, vocabulary=vocabulary)  
    # Sparse Matrix which stores the count of occurrences of every word in the all mails.
    self.count_matrix = None 
    # Sparse Matrix which stores the count of occurrences of every word in the spam mails.
    self.spam_matrix = None
    # Sparse Matrix which stores the count of occurrences of every word in the ham mails.
    self.ham_matrix = None 
  
  def fit(self, train_df):

    """Fit and train the given input dataframe by constructing a Naive Bayes Algorithm"""

    # Parameter:
    #   train_df (Pandas DataFrame): The dataframe that contains mails.

    #Consturcts the sparse matrix which stores the count of each word occured in the train_df
    self.count_matrix = self.vectorizer.fit_transform(train_df['text'])

    #Constructs the sparse matrix which stores the count of occurrences of every word in the spam mails.
    self.spam_matrix = self.count_matrix[np.array(train_df['spam'] == 1)]

    #Constructs the sparse matrix which stores the count of occurrences of every word in the ham mails.
    self.ham_matrix = self.count_matrix[np.array(train_df['spam'] == 0)]

    
    self.vocab_amount = len(self.vectorizer.vocabulary_) # vocabularies of train_df

    
    self.spam_probability = self.spam_matrix.shape[0] + 1 / self.count_matrix.shape[0] + 2 # stores the value of P(spam)
    self.ham_probability = self.ham_matrix.shape[0] + 1 / self.count_matrix.shape[0] + 2 # tores the value of P(ham)

    self.spam_total_word_count = self.spam_matrix.sum() # the total word count in the spam mails. 
    self.ham_total_word_count = self.ham_matrix.sum() # the total word count in the ham mails.

  def _calculate_probability(self, sample_mail, calculating_spam):

    """Calculates the final probability of sample_mail"""

    # Parameter:
    #   sample_mail (Matrix): Matrix which stores the word counts of a specific mail.
    #   calculating_spam (Bool): Indicates which conditional probability will be calculated.

    # If P(sample_mail|spam) will be calculated
    if calculating_spam == True:
      matrix = self.spam_matrix  # the matrix of spam mails.
      total_word_count = self.spam_total_word_count # total word count of spam mails.
      class_probability = self.spam_probability  # P(spam) 

    # If P(sample_mail|ham) will be calculated  
    else:
      matrix = self.ham_matrix  # the matrix of ham mails.
      total_word_count = self.ham_total_word_count # total word count of ham mails.
      class_probability = self.ham_probability  # P(ham) 

    # Here we take the columns i.e. words that are inside our testing sample e-mail and sum each of their use amounts
    word_sum_matrix = matrix[:, sample_mail.tocoo().col].sum(axis=0)

    # As the resulting variable is a matrix, we transform it into a 1D array as it will be more useful
    word_sum_array = np.squeeze(np.asarray(word_sum_matrix))
    
    # We add one to each of the words' amounts as we provide Laplace Smoothing
    word_sum_array = np.add(word_sum_array, 1)

    # We divide the resulting array to our total word count after adding the vocabulary amount as part of our smoothing
    word_sum_array = np.divide(word_sum_array, total_word_count + self.vocab_amount)

    # Once we get the frequencies of our words, we log each of them as a way to get their log-probabilities
    word_sum_array = np.log(word_sum_array)

    # Then, we sum them up to get our overall probability with logarthmic results
    log_probability = word_sum_array.sum()

    # Finally, we also add our overall class probability P(ham) or P(spam) to our final probability to get our desired result
    log_probability = log_probability + np.log(class_probability)

    return log_probability

  def predict(self, test_df):

    """Predict and return the resulting targets of a dataset given to us as input"""

    #Parameter:
    #   test_df (Pandas DataFrame): DataFrame that stores test mails.

  
    test_count_matrix = self.vectorizer.transform(test_df['text']) #number of test mails.

    predictions = np.array([]) 
    for mail in test_count_matrix:
      # the probability of mail is spam
      spam_or_ham = {}
      spam_or_ham['spam'] = self._calculate_probability(mail, True)

      # the probability of mail is ham
      spam_or_ham['ham'] = self._calculate_probability(mail, False)

      # stores the key of P(mail|spam) or P(mail|ham) which has the highest value
      max_key = max(spam_or_ham,  key=spam_or_ham.get)

      #appends 1 to predictions, if the mail is predicted as spam
      if max_key == 'spam':
        predictions = np.append(predictions, 1)

      #appends 1 to predictions, if the mail is predicted as ham
      else:
        predictions = np.append(predictions, 0)
      
    #return the predictions of test_df
    return predictions


In [11]:
def calculate_performance(true_results, predictions):
  conf_matrix = confusion_matrix(true_results, predictions, labels=[1, 0])

  tp = conf_matrix[0,0] #Number of True Positive
  fp = conf_matrix[1,0] #Number of False Positive
  tn = conf_matrix[1,1] #Number of True Negative
  fn = conf_matrix[0,1] #Number of False Negative

  #Calculations of classification metrics 

  accuracy = (tp + tn) / (tp + tn + fp + fn)
  precision = tp / (tp + fp)
  recall = tp / (tp + fn)
  f1 = (2 * recall * precision) / (recall + precision) 

  #Display classification metrics 

  print("Accuracy:", accuracy)
  print("Precision:", precision)
  print("Recall:", recall)
  print("F1:", f1)
  print("Confusion Matrix\n",conf_matrix)
  print("*"*50)

## Part 3

### Analyzing Effect of the Words on Prediction

Below, we will calculate the importance of individual words on the classes of spams and hams

In [12]:
# We vectorize and count each individual word's use to calculate their importance later on
vectorizer = CountVectorizer()
all_counts_matrix = vectorizer.fit_transform(df['text'])

# Store different classes of documents in different variables
spam_matrix = all_counts_matrix[np.array(df['spam'] == 1)]
ham_matrix = all_counts_matrix[np.array(df['spam'] == 0)]

# We prepare a tf-idf transformer to calculate the relative importance of words within each document
transformer = TfidfTransformer()

# Ham E-Mail word importance calculations

# Fit and transform the ham e-mails' count matrix
ham_tf_idf_matrix = transformer.fit_transform(ham_matrix)

# We sum up the values within the matrix's columns i.e. unique words for each column
# and then transform the resulting matrix into a easier to use array
ham_vocabulary_tf_idf_values = np.squeeze(np.asarray(ham_tf_idf_matrix.sum(axis=0)))

# Here, we normalize our resulting sum values to the amount of ham mails within our dataset
# We do this to ensure the difference in the number of sample data between spam and ham mails
# doesn't affect our statistics and analysis as we anchor them both to an identical point
normalized_ham_tf_idf_values = ham_vocabulary_tf_idf_values / ham_matrix.shape[0]

# Spam E-Mail word importance calculations

# Fit and transform the spam e-mails' count matrix
spam_tf_idf_matrix = transformer.fit_transform(spam_matrix)

# We sum up the values within the matrix's columns i.e. unique words for each column
# and then transform the resulting matrix into a easier to use array
spam_vocabulary_tf_idf_values = np.squeeze(np.asarray(spam_tf_idf_matrix.sum(axis=0)))

# Here, we normalize our resulting sum values to the amount of spam mails within our dataset
# We do this to ensure the difference in the number of sample data between spam and ham mails
# doesn't affect our statistics and analysis as we anchor them both to an identical point
normalized_spam_tf_idf_values = spam_vocabulary_tf_idf_values / spam_matrix.shape[0]

In [13]:
# Ham

# Firstly, we subtract the normalized values of words between ham and spam e-mails
# We do this to ensure the words within every or most of the mails that doesn't constitute
# meaningful data (Such as "subject" word that is stored within the start of every mail as a way to point at the starting point)
# are given less importance within our current analysis as they do not affect classification of our data.
ham_minus_spam = normalized_ham_tf_idf_values - normalized_spam_tf_idf_values

# Then, we store both the 10 most important ham words, those that are having their presence affect the ham classification the most
# and also the '10 least important ham words', those that are having their absence affect the ham classification the most
most_important_indices = np.argpartition(ham_minus_spam, -10)[-10:]
least_important_indices = np.argpartition(ham_minus_spam, 10)[:10]

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_sorting_indices = np.argsort(ham_minus_spam[most_important_indices])[::-1]
least_important_sorting_indices = np.argsort(ham_minus_spam[least_important_indices])

# Lastly for display purposes, we extract both the the most important present and absent words and their normalized tf-idf values
strongest_present_ham_words = vectorizer.get_feature_names_out()[most_important_indices][most_important_sorting_indices]
strongest_present_ham_word_values = ham_minus_spam[most_important_indices][most_important_sorting_indices]
strongest_absent_ham_words = vectorizer.get_feature_names_out()[least_important_indices][least_important_sorting_indices]
strongest_abesent_ham_word_values = ham_minus_spam[least_important_indices][least_important_sorting_indices]

# Spam

# In spam calculations too, we firstly subtract the normalized values of words between ham and spam e-mails
# We do this to ensure the words within every or most of the mails that doesn't constitute
# meaningful data (Such as "subject" word that is stored within the start of every mail as a way to point at the starting point)
# are given less importance within our current analysis as they do not affect classification of our data.
spam_minus_ham = normalized_spam_tf_idf_values - normalized_ham_tf_idf_values

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_indices = np.argpartition(spam_minus_ham, -10)[-10:]
least_important_indices = np.argpartition(spam_minus_ham, 10)[:10]

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_sorting_indices = np.argsort(spam_minus_ham[most_important_indices])[::-1]
least_important_sorting_indices = np.argsort(spam_minus_ham[least_important_indices])

# Lastly for display purposes, we extract both the the most important present and absent words and their normalized tf-idf values
strongest_present_spam_words = vectorizer.get_feature_names_out()[most_important_indices][most_important_sorting_indices]
strongest_present_spam_word_values = spam_minus_ham[most_important_indices][most_important_sorting_indices]
strongest_absent_spam_words = vectorizer.get_feature_names_out()[least_important_indices][least_important_sorting_indices]
strongest_absent_spam_word_values = spam_minus_ham[least_important_indices][least_important_sorting_indices]

In [14]:
print("10 words whose presence most strongly predicts that the mail is ham:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_present_ham_words, strongest_present_ham_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 words whose presence most strongly predicts that the mail is ham:

Words 		 Values
--------------------------------------------------
ect  :		 0.050719791225108744
enron  :	 0.04403856711707512
vince  :	 0.03694228067443453
hou  :		 0.025584879612187857
the  :		 0.025172686489140783
kaminski  :	 0.022724985334668095
2000  :		 0.022104536195975435
am  :		 0.017218117189881112
pm  :		 0.01610623471613469
cc  :		 0.015962551136407533


In [15]:
print("10 words whose absence most strongly predicts that the mail is ham:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_absent_ham_words, strongest_abesent_ham_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 words whose absence most strongly predicts that the mail is ham:

Words 		 Values
--------------------------------------------------
your  :		 -0.033196770143117144
software  :	 -0.017338619619152403
website  :	 -0.016880055343645218
adobe  :	 -0.016322406072843177
you  :		 -0.0146451996239495
click  :	 -0.014491941054072994
money  :	 -0.013724226976983023
save  :		 -0.013083805789379361
here  :		 -0.012704522300950942
business  :	 -0.012291112639000268


In [16]:
print("10 words whose presence most strongly predicts that the mail is spam:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_present_spam_words, strongest_present_spam_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 words whose presence most strongly predicts that the mail is spam:

Words 		 Values
--------------------------------------------------
your  :		 0.033196770143117144
software  :	 0.017338619619152403
website  :	 0.016880055343645218
adobe  :	 0.016322406072843177
you  :		 0.0146451996239495
click  :	 0.014491941054072994
money  :	 0.013724226976983023
save  :		 0.013083805789379361
here  :		 0.012704522300950942
business  :	 0.012291112639000268


In [17]:
print("10 words whose absence most strongly predicts that the mail is spam:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_absent_spam_words, strongest_absent_spam_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 words whose absence most strongly predicts that the mail is spam:

Words 		 Values
--------------------------------------------------
ect  :		 -0.050719791225108744
enron  :	 -0.04403856711707512
vince  :	 -0.03694228067443453
hou  :		 -0.025584879612187857
the  :		 -0.025172686489140783
kaminski  :	 -0.022724985334668095
2000  :		 -0.022104536195975435
am  :		 -0.017218117189881112
pm  :		 -0.01610623471613469
cc  :		 -0.015962551136407533


From above lists of important words within our dataset, we can deduce many useful data points. Firstly, the words whose presence most strongly predicts spam and the words whose absence most strongly predicts ham are the same exact words; similarly, the words whose presence most strongly predicts ham and the words whose absence most strongly predicts ham are the same words as well. This is both understandable and predictable, as the fact of us working within binary classification (spam and ham), constitutes a word having opposite effects towards each class as they have no other option. We can even see it within the words: words like 'click', 'money' etc. and even informal speech like 'you', 'your' are logically indicative of a spam mail, and if we see an e-mail with an overwhelming amount of words like these, we would definetly see their absence in other mails as a logical proof of the new e-mail's hammicity. We can also see that the words that show an e-mail being ham are mostly formal, business speak, with abbreviations such as 'cc', 'am' or 'pm' showing a certain degree of professionalism behind them.

### Narrowing Down the Dictionary and Reimplementing Naive Bayes with The New Dictionary

As we learned from the previous section, some words are vastly more important for some classes than other words, from which we can deduce that somewords are relatively unimportant for both classes, being situated in the middle section of our value rankings. As they are relatively unimportant, it is logical for us to test getting rid of them as it is possible that they may skew our results from time to time while not contributing and wasting precious resources in runtime. In accordance with this, we will try narrowing down our dictionary, reimplementing our Naive Bayes Algorithm and testing our dataset with this new arrangement. We obtained this new vocabulary and dictionary by comparing and extracting the best TF-IDF values, in a different way of saying, the best probabilities (both conditional and not conditional) among the numerous possible words and wordings.

In [18]:
ham_minus_spam.shape

(37303,)

In [19]:
# As we have about 35000 different, unique words within our dictionary, we can narrow them down.
# For this purpose we will chose 7500 of the most important spam words and 7500 of the most important ham words
most_important_spam_words = np.argpartition(ham_minus_spam, -7500)[-7500:]
most_important_ham_words = np.argpartition(spam_minus_ham, -7500)[-7500:]

# Add these two word lists together to get our new dictionary and vocabulary
most_important_words = np.append(most_important_spam_words, most_important_ham_words)
vectorizer.get_feature_names_out()[most_important_words]

array(['4704', 'pervasive', 'ctc', ..., 'marhtadowns', '0000', 'zzzz'],
      dtype=object)

Testing with normal, unnarrowed down vocabulary

In [20]:
classifier = NaiveBayesClassifier()
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9886561954624782
Precision: 0.9747292418772563
Recall: 0.9782608695652174
F1: 0.9764918625678118
Confusion Matrix
 [[270   6]
 [  7 863]]
**************************************************


Testing with narrowed down vocabulary

In [21]:
classifier = NaiveBayesClassifier(vocabulary=vectorizer.get_feature_names_out()[most_important_words])
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9930191972076788
Precision: 0.975177304964539
Recall: 0.9963768115942029
F1: 0.985663082437276
Confusion Matrix
 [[275   1]
 [  7 863]]
**************************************************


After testing our code numerous times, we concluded that narrowing down the dictionary may both increase or decrease our performance metrics at different times based on luck. But even when decreased, the decrease amount was always small enough to be negrected. Regardless, the fact that our vocabulary size and because of that our matrix size being drastically reduced, makes us come to the conclusion that the main and best drawing point and advantage of this narrowing down technique would be speeding up our code and increasing the efficency of our algorithm drastically.

In [22]:
# New vocabulary size, and matrix column amount
classifier.count_matrix.shape[1]

15000

### Analyzing the effect of the Stop words
As a result of Naive Bayes algorithm, we obtained quite satisfactory performance measurements without removing stop words because we calculate the probability of each word independently from each other instead of eliminating the instance completely. But still, the performance of our model may be improved by removing stop words because stop words are commonly used words in a language. For this reason, stop words cannot be significant factor in deciding whether an email is spam or ham. They acts like noisy data in our dataset. Thanks to Naive Bayes Algorithm, keeping stop words in the dataset doesn't affect the performance much but, removing stop words may improve the performance. According to our test results, when we compare performance metrics, we see that performance metrics rarely decrease, but mostly improved by 0.1% to 1% when we remove stop words.  

Below, is the same code we used to calculate the importance of different words within our dataset, modified to work with stop word deletion.

In [23]:
# We vectorize and count each individual word's use to calculate their importance later on
vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
all_counts_matrix = vectorizer.fit_transform(df['text'])

# Store different classes of documents in different variables
spam_matrix = all_counts_matrix[np.array(df['spam'] == 1)]
ham_matrix = all_counts_matrix[np.array(df['spam'] == 0)]

# We prepare a tf-idf transformer to calculate the relative importance of words within each document
transformer = TfidfTransformer()

# Ham E-Mail word importance calculations

# Fit and transform the ham e-mails' count matrix
ham_tf_idf_matrix = transformer.fit_transform(ham_matrix)

# We sum up the values within the matrix's columns i.e. unique words for each column
# and then transform the resulting matrix into a easier to use array
ham_vocabulary_tf_idf_values = np.squeeze(np.asarray(ham_tf_idf_matrix.sum(axis=0)))

# Here, we normalize our resulting sum values to the amount of ham mails within our dataset
# We do this to ensure the difference in the number of sample data between spam and ham mails
# doesn't affect our statistics and analysis as we anchor them both to an identical point
normalized_ham_tf_idf_values = ham_vocabulary_tf_idf_values / classifier.ham_matrix.shape[0]

# Spam E-Mail word importance calculations

# Fit and transform the spam e-mails' count matrix
spam_tf_idf_matrix = transformer.fit_transform(spam_matrix)

# We sum up the values within the matrix's columns i.e. unique words for each column
# and then transform the resulting matrix into a easier to use array
spam_vocabulary_tf_idf_values = np.squeeze(np.asarray(spam_tf_idf_matrix.sum(axis=0)))

# Here, we normalize our resulting sum values to the amount of spam mails within our dataset
# We do this to ensure the difference in the number of sample data between spam and ham mails
# doesn't affect our statistics and analysis as we anchor them both to an identical point
normalized_spam_tf_idf_values = spam_vocabulary_tf_idf_values / classifier.spam_matrix.shape[0]

In [24]:
# Ham

# Firstly, we subtract the normalized values of words between ham and spam e-mails
# We do this to ensure the words within every or most of the mails that doesn't constitute
# meaningful data (Such as "subject" word that is stored within the start of every mail as a way to point at the starting point)
# are given less importance within our current analysis as they do not affect classification of our data.
ham_minus_spam = normalized_ham_tf_idf_values - normalized_spam_tf_idf_values

# Then, we store both the 10 most important ham words, those that are having their presence affect the ham classification the most
# and also the '10 least important ham words', those that are having their absence affect the ham classification the most
most_important_indices = np.argpartition(ham_minus_spam, -10)[-10:]

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_sorting_indices = np.argsort(ham_minus_spam[most_important_indices])[::-1]

# Lastly for display purposes, we extract both the the most important present and absent words and their normalized tf-idf values
strongest_present_ham_words = vectorizer.get_feature_names_out()[most_important_indices][most_important_sorting_indices]
strongest_present_ham_word_values = ham_minus_spam[most_important_indices][most_important_sorting_indices]

# Spam

# In spam calculations too, we firstly subtract the normalized values of words between ham and spam e-mails
# We do this to ensure the words within every or most of the mails that doesn't constitute
# meaningful data (Such as "subject" word that is stored within the start of every mail as a way to point at the starting point)
# are given less importance within our current analysis as they do not affect classification of our data.
spam_minus_ham = normalized_spam_tf_idf_values - normalized_ham_tf_idf_values

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_indices = np.argpartition(spam_minus_ham, -10)[-10:]

# Then, we use the indices we got to sort these words depending on their importance
# We intentionally use indices as we can use them in multiple different arrays without question or problem
most_important_sorting_indices = np.argsort(spam_minus_ham[most_important_indices])[::-1]

# Lastly for display purposes, we extract both the the most important present and absent words and their normalized tf-idf values
strongest_present_spam_words = vectorizer.get_feature_names_out()[most_important_indices][most_important_sorting_indices]
strongest_present_spam_word_values = spam_minus_ham[most_important_indices][most_important_sorting_indices]


In [25]:
print("10 non-stop-words whose presence most strongly predicts that the mail is ham:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_present_ham_words, strongest_present_ham_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 non-stop-words whose presence most strongly predicts that the mail is ham:

Words 		 Values
--------------------------------------------------
ect  :		 0.06710361588326001
enron  :	 0.05911483796813067
vince  :	 0.04980996314991822
hou  :		 0.033858681548476995
kaminski  :	 0.030448190247366433
2000  :		 0.029381740475496517
research  :	 0.021639825617346405
pm  :		 0.02146231910782674
cc  :		 0.02139030769584857
2001  :		 0.02125938257272819


In [26]:
print("10 non-stop-words whose presence most strongly predicts that the mail is spam:")
print()
print("Words \t\t Values")
print("-"*50)
for word, value in zip(strongest_present_spam_words, strongest_present_spam_word_values):
  if len(word) < 5:
    print(word, " :\t\t", value)
  else:
    print(word, " :\t", value)

10 non-stop-words whose presence most strongly predicts that the mail is spam:

Words 		 Values
--------------------------------------------------
website  :	 0.02419934399844579
software  :	 0.023269692847953893
adobe  :	 0.02066911745831094
click  :	 0.020586604815313464
money  :	 0.02023999238718702
business  :	 0.01827134422377757
save  :		 0.017850796467998642
logo  :		 0.017060108623563042
online  :	 0.015977863612334595
95  :		 0.015583034757577042


From the lists above, we can see that some unimportant/generally-used words like "you", "your", "am" has been cut off from our vocabularies and lost their importance as a consequence. We can also see that some rare yet important words like 'research' has risen in importance after getting rid of our stopwords. While getting rid of stop words would definetly be generally a good thing, her we can also see some drawbacks that may arise from using general/non-specified stop-word lists as our stop words, as we can see that we lost the words like 'you' and 'your', words that were drastically important both data wise and logically as they showed us informal speech patterns of spammers originally, patterns that we now lost. But in general, the result seems to be adventageous and good for us.

## Part 4

### Performance Metric Calculator
Displays the results of accuracy metrics such as Accuracy, Precision, Recall and F1 values.

### Calculation of Metrics

#### Performance Measurement of Unigram Model by Keeping Stop Words

In [27]:
classifier = NaiveBayesClassifier()
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9886561954624782
Precision: 0.9747292418772563
Recall: 0.9782608695652174
F1: 0.9764918625678118
Confusion Matrix
 [[270   6]
 [  7 863]]
**************************************************


#### Performance Measurement of Bigram Model by Keeping Stop Words

In [28]:
classifier = NaiveBayesClassifier(ngram_range=(2,2))
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.987783595113438
Precision: 0.9924812030075187
Recall: 0.9565217391304348
F1: 0.974169741697417
Confusion Matrix
 [[264  12]
 [  2 868]]
**************************************************


#### Performance Measurement of Unigram-Bigram Models Both by Keeping Stop Words

In [29]:
classifier = NaiveBayesClassifier(ngram_range=(1,2))
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9921465968586387
Precision: 0.9962825278810409
Recall: 0.9710144927536232
F1: 0.98348623853211
Confusion Matrix
 [[268   8]
 [  1 869]]
**************************************************


In the above three examples, we tested our Naive Bayes implementation with Unigram, Bigram and Unigram-Bigram-Hybrid implementations. Within our numerous tests, we have seen each different implementation return the best result at different times, showing us both the luck factor behind their efficency differences and the extreme closeness in results they return. In a plurality of the time, Hybrid implementation return a slightly better results than its parents. This is to be expected as this Hybrid, inevitably includes within itself the whole vocabulary of both Unigram and Bigram at the same time, resulting in a greater amount of data to infer about our predictions. But it also results in a drastically higher time to complete thanks to its data size. Speed wise, we can say that Unigram >>> Bigram > Hybrid, as their vocabulary and matrix sizes are in a close relation to their vocabulary and matrix sizes.

#### Performance Measurement of Unigram Model by Removing Stop Words

In [30]:
classifier = NaiveBayesClassifier(stop_words=ENGLISH_STOP_WORDS)
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9912739965095986
Precision: 0.9854014598540146
Recall: 0.9782608695652174
F1: 0.9818181818181817
Confusion Matrix
 [[270   6]
 [  4 866]]
**************************************************


#### Performance Measurement of Bigram Model by Removing Stop Words

In [31]:
classifier = NaiveBayesClassifier(ngram_range=(2,2), stop_words=ENGLISH_STOP_WORDS)
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.987783595113438
Precision: 0.9888059701492538
Recall: 0.9601449275362319
F1: 0.974264705882353
Confusion Matrix
 [[265  11]
 [  3 867]]
**************************************************


#### Performance Measurement of Unigram-Bigram Models Both by Removing Stop Words

In [32]:
classifier = NaiveBayesClassifier(ngram_range=(1,2), stop_words=ENGLISH_STOP_WORDS)
classifier.fit(train_df)
predictions = classifier.predict(test_df)

calculate_performance(test_df['spam'], predictions)

Accuracy: 0.9912739965095986
Precision: 0.996268656716418
Recall: 0.967391304347826
F1: 0.9816176470588235
Confusion Matrix
 [[267   9]
 [  1 869]]
**************************************************


Just from the above examples and from the numerous tests we have done until now. We can say that the results given by the above three stop-wordless testings' relations to one another are identical to the relations between tests that we did above with stop-words. We can understand that the relations between Unigram, Bigram and Hybrid implementations are identical regardless of stop-word deletion. If we compare the above stop-wordless examples with their implementation counterparts higher above, we can clearly see that there is an overall increase in efficiency and predictive capacity when deleting stop-words. While, rarely, time to time, the deletion of stop-words result in a decrease of accuracy and F1 score, on average, we clearly see a trend of increase in efficency when stop-words are deleted.