<br>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

# Activity 3.1.5 Building a basic neural network

## Scenario
Hopkins et al. (1999) created the Spambase data set donated to the UCI Machine Learning Repository. The data set contains 4,601 emails marked as spam or non-spam by a postmaster or individuals. Fifty-seven features aid in classifying emails as spam (e.g. word frequencies and email characteristics). The Spambase data set is used for developing and benchmarking spam detection models, providing a base for analysing the effectiveness of various machine learning techniques in distinguishing between spam and legitimate emails.

As a data professional, you were tasked by your company to develop a neural network with TensorFlow that can classify emails as spam or non-spam. You were tasked to develop a model based on the Spambase data set.


## Objective
In this portfolio activity, you’ll create a simple neural network using TensorFlow to classify emails as spam or non-spam.

You will complete the activity in your Notebook, where you’ll:
- create a sequential API
- add layers as needed
- employ the model pipeline (compile, fit, and evaluate)
- present your insights based on the performance of the model.


In [58]:
# URL to import data set from GitHub.
url = 'https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/spamdata.csv'

In [59]:
#import relevant libraries
import pandas as pd
import numpy as np
import keras

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense, Input


In [60]:
#create a dataframe of the data
df = pd.read_csv(url)
df.head(5)

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [61]:
#assign all features excluding the last column - the target variable
X = df.iloc[:, :-1]

y = df.iloc[:,-1]

In [62]:
#split the data into train and test sets, using a test percentage of 20%
X_train_full, X_test, y_train_full, y_test = train_test_split(X,
                                                              y,
                                                              test_size = 0.2)

#create a validation data set with a split of 0.1
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full,
                                                      y_train_full,
                                                      test_size = 0.5
                                                      )

In [63]:
#standardise the features
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

#do not fit the validation and test datasets to ensure they remain unseen
X_test = scaler.transform(X_test)
X_valid = scaler.transform(X_valid)

In [67]:
#define the sequential model
model = Sequential()

#create 2 dense hidden layers
model.add(Input(shape=(57,)))
#first hidden layer contains 64 neurons, and ReLU activation
model.add(Dense(units=64, activation='relu'))

#create second hidden layer
model.add(Dense(units=32,
                activation='relu'))

#create output layer, with 1 neuron and sigmoid activation function
model.add(Dense(units=1,
          activation='sigmoid'))

#show summary stats
model.summary()


In [65]:
#compile the model, using binary cross entropy as it is a classification model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

#train the model
model.fit(X_train, y_train,
          epochs = 10,
          batch_size = 64,
          validation_data = (X_valid,y_valid))

Epoch 1/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - accuracy: 0.6197 - loss: 0.6662 - val_accuracy: 0.8609 - val_loss: 0.4207
Epoch 2/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.8811 - loss: 0.3788 - val_accuracy: 0.9049 - val_loss: 0.2896
Epoch 3/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.9163 - loss: 0.2475 - val_accuracy: 0.9141 - val_loss: 0.2412
Epoch 4/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.9212 - loss: 0.2166 - val_accuracy: 0.9245 - val_loss: 0.2221
Epoch 5/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.9344 - loss: 0.1948 - val_accuracy: 0.9337 - val_loss: 0.2121
Epoch 6/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - accuracy: 0.9362 - loss: 0.1707 - val_accuracy: 0.9321 - val_loss: 0.2061
Epoch 7/10
[1m29/29[0m [32m━━━━━

<keras.src.callbacks.history.History at 0x7d5b5e825550>

In [66]:
#evaluate the model using test dataset
loss, accuracy = model.evaluate(X_test, y_test)

print("Test loss:",loss)

[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.9274 - loss: 0.1951
Test loss: 0.19832609593868256


# References

Hopkins, M., Reeber, E., Forman, G., Suermondt, J., 1999. Spambase. [online]. Available at: https://archive.ics.uci.edu/dataset/94. [Accessed 5 March 2024].