# Spam Classifier
## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage you to split out your training and test data. You should consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for your classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [1]:
import numpy as np

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Your training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that you will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


## Part One
Write all of the code for your classifier below this cell. There is some very rough skeleton code in the cell directly below. You may insert more cells below this if you wish, but you must not duplicate any cells as this can break the grading script.

### Submission Requirements
Your code must provide a variable with the name `classifier`. This object must have a method called `predict` which takes input data and returns class predictions. The input will be a single $n \times 54$ numpy array, your classifier should return a numpy array of length $n$ with classifications. There is a demo in the cell below, and a test you can run before submitting to check your code is working correctly.

Your code must run on our test machine in under 30 seconds. If you wish to train a more complicated model (e.g. neural network) which will take longer, you are welcome to save the model's weights as a file and then load these in the cell below so we can test it. You must include the code which computes the original weights, but this must not run when we run the notebook – comment out the code which actually executes the routine and make sure it is clear what we need to change to get it to run. Remember that we will be testing your final classifier on additional hidden data.

In [3]:
class SpamClassifier:
    def __init__(self, k):
        self.k = k
        self.log_class_priors = 0
        self.log_class_conditional_likelihoods = 0
        
    # main function for training
    def train(self, alpha):
        # inner function to create class priors
        def estimate_log_class_priors(data):
            # filter out our class labeled data "y"  which is in the first column
            y = data[:,0]
            
            # create two arrays for two classes (num_C0:ham messages and num_C1:spam messages) 
            num_C0 = np.count_nonzero(y == 0)
            num_C1 = np.count_nonzero(y == 1)
            
            # find the probability of class priors and take their logarithms
            # P(C=0) -- class_priors_C0, P(C=1) -- class_priors_C1
            class_priors_C0 = np.log(num_C0/(num_C0 + num_C1))
            class_priors_C1 = np.log(1 - num_C0/(num_C0 + num_C1))
            
            # create an array including all class priors with length two
            log_class_priors = np.array([class_priors_C0, class_priors_C1])

            return log_class_priors
        
        # inner function to create class conditional likelihoods
        def estimate_log_class_conditional_likelihoods(data, alpha):
            # filter out our class labeled data "y" which is in the first column
            # and create our input data
            y = data[:,0]            
            input_data = data[:, 1:]
            
            # detect the number of words and assign this to self.k
            #self.k = len(data[0, 1:])
                
            # get row indices of each classes (0-ham & 1-spam msg) within our data
            indices_C0 = np.nonzero(y == 0)
            indices_C1 = np.nonzero(y == 1) 
            
            # create our data for each classes (data_C0:ham messages and data_C1:spam messages)
            data_C0 = input_data[indices_C0]
            data_C1 = input_data[indices_C1]
            
            # calculate the frequency of each message for each of the classes
            # within the whole training set
            num_of_w_class0 = np.count_nonzero(data_C0 == 1, axis=0)
            num_of_w_class1 = np.count_nonzero(data_C1 == 1, axis=0)
            
            # create an array for each class, that includes the probabilities of each keyword 
            # drawn from a bag of words and concatenate them into one array(theta)
            prob_array_0 = np.log((num_of_w_class0 + alpha) / ((sum(num_of_w_class0) + self.k*alpha)))
            prob_array_1 = np.log((num_of_w_class1 + alpha) / ((sum(num_of_w_class1) + self.k*alpha)))
            theta = np.concatenate(([prob_array_0], [prob_array_1]), axis=0)
            
            return theta
        
        self.log_class_priors = estimate_log_class_priors(training_spam)
        self.log_class_conditional_likelihoods = estimate_log_class_conditional_likelihoods(training_spam, alpha)
        
    def predict(self, new_data):     
        prediction_matrix = 0
        # matrix product of new_data and the transpose of log_class_conditional_likelihoods (∑ 𝑤i * log(𝜃𝑐,𝑤𝑖))
        # by doing this matrix product we can greatly decrease the complexity of computation
        pre_prediction_matrix = new_data@self.log_class_conditional_likelihoods.T

        # Here we add P(C=c) to the above matrix so that we calculate the predictions of n_test_samples for 2 classes
        prediction_matrix = np.array([pre_prediction_matrix[:, 0] + self.log_class_priors[0], \
                                      pre_prediction_matrix[:, 1] + self.log_class_priors[1]])

        class_predictions = np.where(prediction_matrix[1, :] > prediction_matrix[0, :], 1, 0)
        return class_predictions    

def create_classifier(alpha):
    classifier = SpamClassifier(k=54)
    classifier.train(alpha)
    return classifier

classifier = create_classifier(alpha=236)

### Accuracy Estimate
In the cell below there is a function called `my_accuracy_estimate()` which returns `0.5`. Before you submit the assignment, write your best guess for the accuracy of your classifier into this function, as a percentage between `0` and `1`. So if you think you will get 80% of inputs correct, return the value `0.8`. This will form a small part of the marking criteria for the assignment, to encourage you to test your own code.

In [4]:
def my_accuracy_estimate():
    return 0.9

Write all of the code for your classifier above this cell.

### Testing Details
Your classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods. At the very high end of the grading scale, your accuracy will also be compared to the best submissions from other students (in your own cohort and others!). Your estimate from the cell above will also factor in, and you will be rewarded for being close to your actual accuracy (overestimates and underestimates will be treated the same).

#### Test Cell
The following code will run your classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

The original skeleton code above classifies every row as ham, but once you have written your own classifier you can run this cell again to test it. So long as your code sets up a variable called `classifier` with a method called `predict`, the test code will be able to run. 

Of course you may wish to test your classifier in additional ways, but you *must* ensure this version still runs before submitting.

**IMPORTANT**: you must set `SKIP_TESTS` back to `True` before submitting this file!

In [5]:
SKIP_TESTS = True

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

In [6]:
import sys
import pathlib

fail = False;

if not SKIP_TESTS:
    fail = True;
    print("You must set the SKIP_TESTS constant to True in the cell above.")
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("This notebook file must be named spamclassifier.ipynb")
    
if "create_classifier" not in dir():
    fail = True;
    print("You must include a function called create_classifier.")

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("You must include a function called my_accuracy_estimate.")
else:
    if my_accuracy_estimate() == 0.5:
        print("Warning:")
        print("You do not seem to have provided an accuracy estimate, it is set to 0.5.")
        print("This is the actually the worst possible accuracy – if your classifier")
        print("got 0.1 then it could invert its results to get 0.9!")
    
print("INFO: Make sure you follow the instructions on the assignment page to submit your video.")
print("Failing to include this could result in an overall grade of zero for both parts.")
print()

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("All checks passed. When you are ready to submit, upload the notebook and readme file to the")
    print("assignment page, without changing any filenames.")
    print()
    print("If you need to submit multiple files, you can archive them in a .zip file. (No other format.)")

INFO: Make sure you follow the instructions on the assignment page to submit your video.
Failing to include this could result in an overall grade of zero for both parts.

All checks passed. When you are ready to submit, upload the notebook and readme file to the
assignment page, without changing any filenames.

If you need to submit multiple files, you can archive them in a .zip file. (No other format.)


In [7]:
# This is a test cell. Please do not modify or delete.