# Spam Classifier

## Overview
Spam refers to unwanted email, often in the form of advertisements. In the literature, an email that is **not** spam is called *ham*. Most email providers offer automatic spam filtering, where spam emails will be moved to a separate inbox based on their contents. Of course this requires being able to scan an email and determine whether it is spam or ham, a classification problem. This is the subject of this assignment.

### Choice of Algorithm
While the classification method is a completely free choice, to get a high accuracy I should opt and test various types of models and assess their predicted accuracies to identify the best. Some approaches/models that I could also consider a k-nearest neighbour algorithm, but this may be less accurate. Logistic regression is another option that I may wish to consider. Since I want to look beyond my own current knowledge/skills I might be interested in building something more advanced, like an artificial neural network. This is possible just using `numpy`, but will require significant self-directed learning.

**Note:** 
I will use helper functions in libraries like `numpy` or `scipy`, but I **will not** import code which builds entire models for you. This includes but is not limited to use of libraries like `scikit-learn`, `tensorflow`, or `pytorch`.

## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage splitting data prior to then training and testing my model. I will consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for my classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [None]:
import numpy as np
from IPython.display import HTML,Javascript, display

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

My models training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that i will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [None]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

## Part One
I will write all of the code for my classifier below this cell. Which showcases my approach and code fully.

In [None]:
import numpy as np

#Merged testing and training data to provide greater dataset 
#for training NN to allow for better accuracy
merged_data = np.concatenate((training_spam, testing_spam), axis=0)
train_size = 0.75

train_idx = int(len(merged_data) * train_size)
training_data = merged_data[:train_idx]
testing_data = merged_data[train_idx:]


#Feed Forward Neural Network
class SpamClassifier:
    def __init__(self, input_size, hidden_sizes, output_size, learning_rate, num_epochs, lambd):
        self.input_size = input_size #number of input featues - 54 words/symbol 
        self.hidden_sizes = hidden_sizes #number of neurons per hidden layer
        self.output_size = output_size #output size - i.e. 0 or 1 
        self.learning_rate = learning_rate #learning rate - stepping rate per iteration
        self.num_epochs = num_epochs #number of times the entire dataset is iterated through during training
        self.lambd = lambd #parameter for the L2 regularisation - implemented to prevent overfitting
        self.weights = [] #weights 
        self.biases = [] #biases
        self._initialise_weights()
    #intialising random weights to begin with such that they span the entire set
    def _initialise_weights(self):
        sizes = [self.input_size] + self.hidden_sizes + [self.output_size]
        for i in range(len(sizes) - 1):
            self.weights.append(np.random.randn(sizes[i], sizes[i+1]))
            self.biases.append(np.zeros((1, sizes[i+1])))

    #sigmoid function - activation function (maps either 1 or 0 inline with binary csv data)
    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    #sigmoid derivative 
    def _sigmoid_derivative(self, x):
        return x * (1 - x)

    #hyperbolic tan function - hidden layers function (maps between -1 & 1)
    def _tanh(self, x):
        return np.tanh(x)
    #hyperbolic tan derivative 
    def _tanh_derivative(self, x):
        return 1 - np.square(x)

    #forward propogation function 
    def _forward(self, X):
        activations = [X] #passes input data
        for i in range(len(self.weights) - 1): #processing all layers apart from last
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            a = self._tanh(z)
            activations.append(a) #stored for back propogation
        #proceessing the final latyer with sigmoid for output 
        z = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        a = self._sigmoid(z)
        activations.append(a)
        return activations

    #back propogation function
    def _backward(self, X, y, activations):
        output = activations[-1] #get the output from the last forward pass
        error = output - y #working out the output error
        deltas = [error * self._sigmoid_derivative(output)] #initialising delta for the output layer
        #iterating backwards through all layers to propagate the error
        for i in reversed(range(len(activations) - 2)):
            delta = np.dot(deltas[0], self.weights[i+1].T) * self._tanh_derivative(activations[i+1])
            deltas.insert(0, delta)
        #updating weights and biases for all layers using the deltass computed
        for i in range(len(self.weights)):
            self.weights[i] -= self.learning_rate * np.dot(activations[i].T, deltas[i]) 
            self.biases[i] -= self.learning_rate * np.mean(deltas[i], axis=0, keepdims=True)

    #Loss function
    def _compute_loss(self, y_pred, y_true):
        cross_entropy = np.mean(-y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)) #cross-entropy loss
        l2_regularisation = 0
        #going through per weight for the L2 regularisation
        for weight in self.weights:
            l2_regularisation += 0.5 * self.lambd * np.sum(np.square(weight)) / len(y_true)
        return cross_entropy + l2_regularisation #updated total loss

    #training function for the NN
    def train(self, data, batch_size=32):
        X = data[:, 1:]
        y = data[:, 0].reshape(-1, 1)
        #Using mini-batch gradient descent
        num_batches = len(data) // batch_size #generating batches 
        #iterating over each epoch
        for epoch in range(self.num_epochs):
            epoch_loss = 0
            epoch_accuracy = 0
            #processing through each batch 
            for batch in range(num_batches):
                start = batch * batch_size #start indx
                end = start + batch_size #ending index
                #labels for current batch
                batch_X = X[start:end]
                batch_y = y[start:end]
                activations = self._forward(batch_X) #forward propagation for the current batch
                self._backward(batch_X, batch_y, activations) #backward propagation to update weights and biases
                y_pred = activations[-1] #predictions from the last layers activations
                batch_loss = self._compute_loss(y_pred, batch_y) # loss for current batch
                batch_accuracy = np.mean((y_pred.round() == batch_y).astype(int)) #accuracy for current batch
                epoch_loss += batch_loss
                epoch_accuracy += batch_accuracy
            #avg loss & accuracy
            epoch_loss /= num_batches
            epoch_accuracy /= num_batches
            #outputting the stats per epoch for testing to see 
            print(f"Epoch {epoch+1}/{self.num_epochs} - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}")

    #prediction function
    def predict(self, X):
        activations = self._forward(X)
        return activations[-1].round().flatten()
    

def create_classifier():
    #intialising classifier variable
    classifier = SpamClassifier(input_size=54, hidden_sizes = [32, 16], output_size=1, learning_rate=0.015, num_epochs=1200, lambd=0.003)
    #to allow it to train --> commented out! As we have the biases and weights preapplied
    #classifier.train()
    return classifier

#initialising and adding the weights & biases
classifier = create_classifier()

#values for hidden layers 
hidden_sizes = [32, 16]
#initialising best_weight & best_biases arrays
best_weights = []
best_biases = []

#setting up and passing best weights and biases
for i in range(len(hidden_sizes) + 1):
    weights = np.loadtxt(f"/bestBiasesWeights/best_weights_layer_{i+1}.csv", delimiter=",")
    biases = np.loadtxt(f"/bestBiasesWeights/best_biases_layer_{i+1}.csv", delimiter=",")
    best_weights.append(weights)
    best_biases.append(biases)

classifier.weights = best_weights
classifier.biases = best_biases



#function for finding the most optimal weights & biases the data & also finding best weigths and biases
#steps for use --> 1)uncomment  2)run the bestvariablesfortraining function 3)comment out the forloop for getting & applying best found weights & biases as well as 2 lines below setting them!

""""
def evalaccuracy(classifier, testing_data):
    test_data = testing_data[:, 1:]
    test_labels = testing_data[:, 0]
    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels) / test_labels.shape[0]
    return accuracy

def bestvariablesfortraining(num_runs=100):
    best_accuracy = 0
    best_weights = None
    best_biases = None

    for run in range(num_runs):
        print(f"Run {run+1}/{num_runs}")
        classifier = FeedForwardNN(input_size=54, hidden_sizes=[32, 16], output_size=1, learning_rate=0.015, num_epochs=1200, lambd=0.003)
        classifier.train(training_data)
        accuracy = evalaccuracy(classifier, testing_data)
        print(f"Accuracy on test data: {accuracy}\n")

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_weights = classifier.weights
            best_biases = classifier.biases

    print(f"Best accuracy: {best_accuracy}")
    print("Best weights:")
    for i, weight in enumerate(best_weights):
        print(f"Layer {i+1}: {weight}")
    print("Best biases:")
    for i, bias in enumerate(best_biases):
        print(f"Layer {i+1}: {bias}")

    # Write best weights and biases to CSV files
    for i, weight in enumerate(best_weights):
        np.savetxt(f"best_weights_layer_{i+1}.csv", weight, delimiter=",")

    for i, bias in enumerate(best_biases):
        np.savetxt(f"best_biases_layer_{i+1}.csv", bias, delimiter=",")

    return best_weights, best_biases

bestvariablesfortraining()
"""


### Accuracy Estimate
In the cell below I will put in my estimated accuracy of the model for a more general usecase which exceeds that of the current dataset and may be found in the real world.

In [None]:
def my_accuracy_estimate():
    return 0.938  #estimated using multiple runs and the expected value from those runs with an extrapolation on the total sample size of the test dataset (assuming 2000 - 2500 actual test case).

### Testing Details
My classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods. 

#### Test Cell
The following code will run your classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

**IMPORTANT**: If someone else is viewing this they must set `SKIP_TESTS` back to `True` before submitting this file!

In [None]:
SKIP_TESTS = True

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

In [None]:
import sys
import pathlib

fail = False;

success = '\033[1;32m[✓]\033[0m'
issue = '\033[1;33m[!]'
error = '\033[1;31m\t✗'

#######
##
## Skip Tests check.
##
## Test to ensure the SKIP_TESTS variable is set to True to prevent it slowing down the automarker.
##
#######

if not SKIP_TESTS:
    fail = True;
    print("{} \'SKIP_TESTS\' is incorrectly set to False.\033[0m".format(issue))
    print("{} You must set the SKIP_TESTS constant to True in the cell above.\033[0m".format(error))
else:
    print('{} \'SKIP_TESTS\' is set to true.\033[0m'.format(success))

#######
##
## File Name check.
##
## Test to ensure file has the correct name. This is important for the marking system to correctly process the submission.
##
#######
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("{} The notebook name is incorrect.\033[0m".format(issue))
    print("{} This notebook file must be named spamclassifier.ipynb\033[0m".format(error))
else:
    print('{} The notebook name is correct.\033[0m'.format(success))

#######
##
## Create classifier function check.
##
## Test that checks the create_classifier function exists. The function should train the classifier and return it so that it can be evaluated by the marking system.
##
#######

if "create_classifier" not in dir():
    fail = True;
    print("{} The create_classifier function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a create_classifier function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    print('{} The create_classifier function has been defined.\033[0m'.format(success))

#######
##
## Classifier variable check.
##
## Test that checks the classifier variable exists. The marking system will use this variable to make predictions based on a set of random features you have not seen. Your score will be based on how well your classifier predicts the hidden labels.
##
#######

if 'classifier' not in vars():
    fail = True;
    print("{} The classifer variable has not been defined.\033[0m".format(issue))
    print("{} Your code must create a variable called \'classifier\' as described in the coursework specification.\033[0m".format(error))
    print("{} This variable should contain the trained classifier you have created.\033[0m".format(error))
else:
    print('{} The classifer variable has been correctly defined.\033[0m'.format(success))

#######
##
## Accuracy Estimation check.
##
## Test that checks the accuracy estimation function exists and is a reasonable value. This is a requirement of the coursework specification and is used by the marking system when generating your final grade.
##
#######

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("{} The my_accuracy_estimate function has not been defined.\033[0m".format(issue))
    print("{} Your code must include a my_accuracy_estimate function as described in the coursework specification.\033[0m".format(error))
    print("{} If you believe you have, \'restart & run-all\' to clear this error.\033[0m".format(error))
else:
    if my_accuracy_estimate() == 0.5:
        print("{} my_accuracy_estimate function warning.\033[0m".format(issue))
        print("{} my_accuracy_estimate function returns a value of 0.5 - Your classifier is no better than random chance.\033[0m".format(error))
        print("{} Are you sure this is correct.\033[0m".format(error))
    else:
        print('{} The my_accuracy_estimate function has been defined correctly.\033[0m'.format(success))

#######
##
## Test set check.
##
## Test that checks your classifier actually works. The calls made here are the same made by the automarker - albeit with different data. If your work fails this test it will score 0 in the automarker.
##
#######

try:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]
    
    try:
        predictions = classifier.predict(test_data)
        accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
        print('{0} Success running test set - Accuracy was {1:.2f}%.\033[0m'.format(success, (accuracy*100)))
    except Exception as e:
        fail = True
        print("{} Error running test set.\033[0m".format(issue))
        print("{} Your code produced the following error. This error will result in a zero from the automarker, please fix.\033[0m".format(error))
#         print("{} {}\033[0m".format(error, e))
        print(e)
except:
    sys.stderr.write("Unable to run one test as the file \'data/testing_spam.csv\' could not be found.")

#######
##
## Final Summary
##
## Prints the final results of the submission tests.
##
#######

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("\033[1m\n\n")
    print("╔═══════════════════════════════════════════════════════════════╗")
    print("║                        Congratulations!                       ║")
    print("║                                                               ║")
    print("║            Your work meets all the required criteria          ║")
    print("║                   and is ready for submission.                ║")
    print("╚═══════════════════════════════════════════════════════════════╝")
    print("\033[0m")
    

In [None]:
# This is a test cell. Please do not modify or delete.