# Logistic Regression (Spam Email Classifier)

In this lab we will develop a Spam email classifier using Logistic Regression.

We will use [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) from Kaggle, which was split into two almost equal parts: training dataset (train.csv) and test dataset (test.csv).
Each record in the datasets contains 58 features, one of which is the class label. The class label is the last feature and it takes two values +1 (spam email) and -1 (non-spam email). The other features represent various characteristics of emails such as frequencies of certain words or characters in the text of an email; and lengths of sequences of consecutive capital letters (See [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) for the detailed description of the features).

In [2]:
import numpy as np
import random

We start with implementing some auxiliary functions.

In [3]:
# Implement sigmoid function
def sigmoid(x):
    # Bound the argument to be in the interval [-500, 500] to prevent overflow
    x = np.clip( x, -500, 500 )

    return 1/(1 + np.exp(-x))

In [4]:
def load_data(fname):
    labels = []
    features = []
    
    with open(fname) as F:
        next(F) # skip the first line with feature names
        for line in F:
            p = line.strip().split(',')
            labels.append(int(p[-1]))
            features.append(np.array(p[:-1], float))
    return (np.array(labels), np.array(features))

Next we read the training and the test datasets.

In [5]:
(trainingLabels, trainingData) = load_data("train.csv")
(testLabels, testData) = load_data("test.csv")

In the files the positive objects appear before the negative objects. So we reshuffle both datasets to avoid situation when we present to our training algorithm all positive objects and then all negative objects.

In [6]:
#Reshuffle training data and
permutation =  np.random.permutation(len(trainingData))
trainingLabels = trainingLabels[permutation] # Y_train
trainingData = trainingData[permutation] # X_train

#test data
permutation =  np.random.permutation(len(testData))
testLabels = testLabels[permutation]
testData = testData[permutation]

## Exercise 1

1. Implement Logistic Regression training algorithm.

In [7]:
trainingData.shape, trainingLabels.shape

((2301, 57), (2301,))

In [16]:
def logisticRegression(trainingData, trainingLabels, learningRate, maxIter):
    #Compute the number of training objects
    numTrainingObj = len(trainingData)
    #Compute the number of features (dimension of our data)
    numFeatures = len(trainingData[0])
    
    #Initialize the bias term and the weights
    b = 0
    W = np.zeros(numFeatures)
    
    for t in range(maxIter):
        #For every training object
        for i in range(numTrainingObj):
            X = trainingData[i]
            y = trainingLabels[i]
            #Compute the activation score
            a = np.dot(X, W) + b
        
            #Update the bias term and the weights
            b = b + learningRate*y*sigmoid(-y*a)
            for s in range(numFeatures):
                W[s] = W[s] + learningRate*y*sigmoid(-y*a)*X[s]
            
            #The above for-loop can be equivalently written in the vector form as follows
            #W = np.add(W, learningRate*y*sigmoid(-y*a)*X)
            
    return (b, W)

2. Use the training dataset to train Logistic Regression classifier. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [17]:
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -359.241556756663 
Weight vector:  [  -3.06886154 -152.73004915  -12.64407314   49.487       -10.60851097
   27.87071877   56.86345449   42.60101464   14.74856774  -14.0462899
    9.94805629 -202.24803715   12.42251917  -14.60491616   21.995
   96.49507949   30.90826492   18.78978616  -42.40899094   24.81674505
   25.36285248   93.09200247   65.12942993   43.9920461  -570.28970002
 -264.27336196 -252.23389513 -139.35554652 -109.55133003  -96.222
  -58.802       -36.34604031 -115.44951341  -37.64904031 -119.26876728
  -72.62911283 -140.99608899   -1.76103242  -89.36814206  -33.59799997
  -55.92059139 -148.25537505  -54.69810707  -66.51215011 -144.87530257
  -72.66167569   -8.101       -42.22314051   -7.13607185  -72.58142022
  -10.88241216  106.38525455   38.89858458   26.87524034 -180.46758984
   95.72074018    6.52413413]


## Exercise 2

1. Implement Logistic Regression classifier with given bias term and weight vector

In [18]:
def logisticRegressionTest(b, W, X):
    #Compute the activation score
    a = np.dot(X, W) + b
    predictedClass = 0;
    confidence = 0;
    
    if a > 0:
        predictedClass = +1
        confidence = sigmoid(a)
    else:
        predictedClass = -1
        confidence = 1-sigmoid(a)
    return (predictedClass, confidence)

2. Use the trained model to classify objects in the test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [20]:
def evaluationReport(classTrue, classPred):
    positive_mask = classTrue == 1

    # Count the number of elements in the positive class 
    positive = np.count_nonzero(positive_mask)
    # Count True Positive
    tp = np.count_nonzero(classPred[positive_mask]==1)
    # Count False Negative
    fn = np.count_nonzero(classPred[positive_mask]==-1)
    
    negative_mask = classTrue == -1

    # Count the number of elements in the negative class 
    negative = np.count_nonzero(negative_mask)
    # Count False Positive
    fp = np.count_nonzero(classPred[negative_mask]==1)
    # Count True Negative
    tn = np.count_nonzero(classPred[negative_mask]==-1)

    # Compute Accuracy, Precision, Recall, and F-score
    accuracy = (tp + tn)/(tp + tn + fp + fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    fscore = 2*precision*recall/(precision + recall)
    print("Evaluation report")
    print("Accuracy: %.2f" % accuracy)
    print("Precision: %.2f" % precision)
    print("Recall: %.2f" % recall)
    print("F-score: %.2f" % fscore)

In [21]:
classTrue = np.array([int(x) for x in testLabels], dtype=int)
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.72
Precision: 0.58
Recall: 0.96
F-score: 0.72


## Exercise 3

1. Apply Gaussian Normalisation to the training dataset

In [22]:
def GaussianNormalisation(dataset):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    featureMean = np.empty(numFeatures, float)
    featureStd = np.empty(numFeatures, float)
    
    #For every feature
    for i in range(numFeatures):
        #find its Mean and Std
        featureMean[i] = dataset[:,i].mean(axis=0)
        featureStd[i] = dataset[:,i].std(axis=0)
        #Apply Gaussian Noramlisation
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

    return (featureMean, featureStd)

In [23]:
#normalize the training dataset
(featureMean, featureStd) = GaussianNormalisation(trainingData)

2. Train Logistic Regression on the normalised training dataset. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [24]:
#Train Logistic Regression classifier on the normalised training data
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -3.014954248709226 
Weight vector:  [-3.33368536e-01 -4.01986832e-01 -3.09385814e-02  1.07389500e+00
  7.12869732e-01  3.09112536e-01  3.92537393e-01  5.63239982e-01
  1.97221668e-01  5.75237149e-01 -1.04425635e-01 -5.27376751e-01
 -4.23379043e-01  1.45851777e-01  3.35020087e-02  1.41728519e+00
  8.36202497e-01  1.74503081e-01  5.18954936e-01 -3.25973707e-02
 -3.60703547e-03  2.18416437e+00  8.07646042e-01  1.18910995e+00
 -4.22275485e+00 -1.25417651e+00 -5.78280911e+00  2.34425724e-01
 -1.09135154e+00 -5.20967628e-01 -1.22232600e+00 -1.37664584e-02
 -4.48857893e-01 -6.31292325e-01 -1.40147458e+00  4.46475941e-01
 -7.01842697e-01  2.43574011e-01 -8.34645247e-02 -1.67400843e-01
 -3.19399377e+00 -1.85706739e+00 -2.94014522e-01 -9.50337895e-01
 -9.40782712e-01 -4.51683904e-01 -4.54988036e-01 -1.27990303e+00
  2.22783245e-01  1.58320828e-01 -2.23687215e-01  1.42230315e+00
  1.75162905e+00  1.27062282e+00  4.56505206e-01  1.08100633e+00
  1.10981980e+00]


3. Normalise the test dataset using Means and Standard Deviations of the features *computed on the training dataset*.

In [25]:
def normalise(dataset, featureMean, featureStd):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    #For every feature
    for i in range(numFeatures):
        #Apply Gaussian Noramlisation with given Mean and Std values
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

In [26]:
#normalize the test dataset using Means and Std computed on the training dataset
normalise(testData, featureMean, featureStd)

4. Use the model trained on the normalised training dataset to classify objects in the normalised test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [27]:
#Predict class labels of test objects for the normalized test dataset
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.89
Precision: 0.81
Recall: 0.94
F-score: 0.87


5. Compare the quality of the classifier with normalisation and without normalisation