# Logistic Regression (Spam Email Classifier)

In this lab we will develop a Spam email classifier using Logistic Regression.

We will use [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) from Kaggle, which was split into two almost equal parts: training dataset (train.csv) and test dataset (test.csv).
Each record in the datasets contains 58 features, one of which is the class label. The class label is the last feature and it takes two values +1 (spam email) and -1 (non-spam email). The other features represent various characteristics of emails such as frequencies of certain words or characters in the text of an email; and lengths of sequences of consecutive capital letters (See [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) for the detailed description of the features).

In [1]:
import numpy as np
import random

We start with implementing some auxiliary functions.

In [2]:
# Implement sigmoid function
def sigmoid(x):
    # Bound the argument to be in the interval [-500, 500] to prevent overflow
    x = np.clip( x, -500, 500 )

    return 1/(1 + np.exp(-x))

In [3]:
def load_data(fname):
    labels = []
    features = []
    
    with open(fname) as F:
        next(F) # skip the first line with feature names
        for line in F:
            p = line.strip().split(',')
            labels.append(int(p[-1]))
            features.append(np.array(p[:-1], float))
    return (np.array(labels), np.array(features))

Next we read the training and the test datasets.

In [4]:
(trainingLabels, trainingData) = load_data("train.csv")
(testLabels, testData) = load_data("test.csv")

In the files the positive objects appear before the negative objects. So we reshuffle both datasets to avoid situation when we present to our training algorithm all positive objects and then all negative objects.

In [5]:
#Reshuffle training data and
permutation =  np.random.permutation(len(trainingData))
trainingLabels = trainingLabels[permutation]
trainingData = trainingData[permutation]

#test data
permutation =  np.random.permutation(len(testData))
testLabels = testLabels[permutation]
testData = testData[permutation]

## Exercise 1

1. Implement Logistic Regression training algorithm.

2. Use the training dataset to train Logistic Regression classifier. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

## Exercise 2

1. Implement Logistic Regression classifier with given bias term and weight vector

2. Use the trained model to classify objects in the test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

## Exercise 3

1. Apply Gaussian Normalisation to the training dataset

2. Train Logistic Regression on the normalised training dataset. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

3. Normalise the test dataset using Means and Standard Deviations of the features *computed on the training dataset*.

4. Use the model trained on the normalised training dataset to classify objects in the normalised test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

5. Compare the quality of the classifier with normalisation and without normalisation