# Spam Detection Project
In the era of internet we are able to easily search for information, communicate with one another or even shop on any place on the earth with just a few strokes on the keyboard of our computer. Unfortunately, the same thing that makes our life easier, also introduces new ways on information gathering on people. The most common way to do that these days is by using 'spam' messages, email, social media accounts, ect. When 'spam' is shown to us, sometimes it looks really convincing as if it was a real product, message or email. Therefore we click on it to find out more without realising how much we revel about ourselves.

'Spam' is a huge problem in the current days of the internet. That's why we decided to try to create a program that would be able to scan through many messages, emails, etc. and detect which ones most likely fit the 'spam' profile.


#### spam data (MLP model)
At this cell we load all the necessary files and convert them.

In [1]:
from numpy.ma import indices
import numpy as np
from toolbox.load_assets import load_spam
from toolbox.prepare_data import prepare
from toolbox.conversion import conversion
from toolbox.reshape import reshape
#------------------------------
file = "./data/spam.csv"
X = "./data/train.csv"
#------------------------------

spam_dic = load_spam(file)

data = prepare(X,'spam_text','label')
text =  data[0]
labels = data[1]

text_np = np.asarray(text)
labels_np = np.asarray(labels)

converted_data = conversion(spam_dic, text_np)
reshape(converted_data)

converted_data_np = np.asarray(converted_data)

NR_FEATURES = len(converted_data[0])

  text_np = np.asarray(text)


## MLP spam detection
Our first model is an 'MLP' classifier. But in order to find out what are the bets values for our 'object' we decided to run some tests firstly.

In [2]:
import numpy as np
from sklearn.neural_network import MLPClassifier as mlp
from sklearn.model_selection import KFold

training = KFold(n_splits=5)
result = []
# [DEBUG: Delete before submitting]
counter = 1 ##

for train_index, test_index in training.split(converted_data, labels):
    text_train, text_test = converted_data_np[train_index], converted_data_np[test_index]
    labels_train, labels_test = labels_np[train_index], labels_np[test_index]
    avg = []
    for i in range(10,26):
        new_learn = mlp(solver='adam',hidden_layer_sizes=i, max_iter=3000)
        temp = []
        for c in range(5):
            new_learn.fit(text_train, labels_train)
            temp.append(1 - new_learn.score(text_test, labels_test))
        avg.append((i, np.mean(temp)))
        # [DEBUG: Delete before submitting]
        print("Model trained with layer={}, KFold split {}".format(i,counter)) ##
    # [DEBUG: Delete before submitting]
    counter += 1 ##
    result.append(avg)

print(result)

Model trained with layer=10, KFold split 1
Model trained with layer=11, KFold split 1
Model trained with layer=12, KFold split 1
Model trained with layer=13, KFold split 1
Model trained with layer=14, KFold split 1
Model trained with layer=15, KFold split 1
Model trained with layer=16, KFold split 1
Model trained with layer=17, KFold split 1
Model trained with layer=18, KFold split 1
Model trained with layer=19, KFold split 1
Model trained with layer=20, KFold split 1
Model trained with layer=21, KFold split 1
Model trained with layer=22, KFold split 1
Model trained with layer=23, KFold split 1
Model trained with layer=24, KFold split 1
Model trained with layer=25, KFold split 1
Model trained with layer=10, KFold split 2
Model trained with layer=11, KFold split 2
Model trained with layer=12, KFold split 2
Model trained with layer=13, KFold split 2
Model trained with layer=14, KFold split 2
Model trained with layer=15, KFold split 2
Model trained with layer=16, KFold split 2
Model train

Nextly we have to find out what is the optimal value for 'hidden_layer_sizes' parameter.

In [3]:
from statistics import mode
final_results = []
answer = []

# here we return all 'layers' with the smallest classification errors
for b in result:
    temp = []
    for n in range(len(b)):
        temp.append(b[n][1])
    index_val = temp.index(np.min(temp))
    final_results.append(b[index_val])

# here we take the 'layer' parameter from the pair = i, (i,np.min(temp))
for z in final_results:
    answer.append(z[0])

# here we calculate the most frequent 'layer', that has the lowest classification error
int_answer = mode(answer)
print("The best number for 'hidden_layers_sizes' parameter is: {} (int_answer)".format(int_answer))

good_model = mlp(solver='adam',hidden_layer_sizes=int_answer, max_iter=3000)
good_model.fit(converted_data_np,labels_np)

print("-------------------------------------------------------------------", end="\n")
print("The accuracy of the model with 'hidden_layer_sizes'={} is {}".format(int_answer,good_model.score(converted_data_np,labels)))



The best number for 'hidden_layers_sizes' parameter is: 13 (int_answer)
-------------------------------------------------------------------
The accuracy of the model with 'hidden_layer_sizes'=13 is 0.9226767133118048


## Checking the model
In order to check the model, we will check the 'score' on a new data set composed only of 'spam' messages.

In [8]:
from pandas import read_csv
from toolbox.test import reshape_for_model
load = read_csv("./data/test_spam_model.csv")
load_arr = []
converted = []
#--------------------------------------------------
test_label = []
#--------------------------------------------------data loading

for index, row in load.iterrows():
    load_arr.append(row['Message'])
    if row['Category'] == "spam":
        test_label.append(1)
    else:
        test_label.append(0)

for i in range(len(load_arr)):
    converted.append(load_arr[i].split(" "))

#--------------------------------------------------data preparation

# the result should be really close 100% mark, might not be exactly a 100% because 'spam_dic' might not contain all the words in the 'test_spam_model.csv' file
test_spam = conversion(spam_dic, converted)
reshape_for_model(test_spam, NR_FEATURES)
#--------------------------------------------------scoring of the model accuracy [new data set]

print("The accuracy of the MLP model on completely new data set = {}".format(good_model.score(np.asarray(test_spam), test_label)))



The accuracy of the MLP model on completely new data set = 0.9215721464465183


In [5]:
# import numpy as np
# from toolbox.load_assets import load_spam
# from toolbox.prepare_data import prepare
# from toolbox.conversion import conversion
# from toolbox.reshape import reshape
# from sklearn.neural_network import MLPClassifier
# #------------------------------
# file = "./data/spam.csv"
# X = "./data/train.csv"
# #------------------------------
#
# spam_dic = load_spam(file)
# temp = []
#
# data = prepare(X,'spam_text','label')
# text = data[0]
# labels = data[1]
#
#
# converted_data = conversion(spam_dic, text)
# reshape(converted_data)
#
# mlp = MLPClassifier(solver='adam',hidden_layer_sizes=15,max_iter=3000)
# mlp.fit(converted_data,labels)
#
# print(mlp.score(converted_data, labels))