Skip to content

Developing an efficient Python based spam classifier using Multi-Layer Perceptron

Notifications You must be signed in to change notification settings

harshsingh-24/spam-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam Classifier

Animated gif of spam clasffication

I have developed a spam classifier program in Python which classifies given emails as spam or ham using Multilayer Perceptron (MLP).

🌟 Overview

I used the Apache SpamAssassin public data to train and test a ML-based classification model based on Multilevel Perceptron because of their high efficacy in terms of precision and recall. If you want to run this project, you only need the dependencies (see below). No extra files are needed as the Jupyter notebook will download all the required files.

-----------------------------------------------------

💾 Project Files Description

This Project includes 1 executable file and 2 output files. The description is as follows:

Executable Files:

  • spam-classifier-optimized.ipynb - A Jupyter Notebook consisiting of all the functions required for training, testing and classification of the emails. Includes all functions required for classification operations.

Result Files:

  • evaluation.txt - Contains evaluation results table as well as Confusion Matrix of Spam and Ham classes.
  • spam_classifier_best.sav - Contains the weights of the most optimized model.
  • confusion_matrix.png - Confusion Matrix of the final result.

-----------------------------------------------------

📚 Multilevel Perceptron

The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a threshold logic unit (TLU), or sometimes a linear threshold unit (LTU). The inputs and output are numbers (instead of binary on/off values), and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs (z = w1x1 +... + wnxn = xTw), then applies a step function to that sum and outputs the result: hw(x) = step(z), where z = xTw.

A single Perceptron

An MLP is composed of one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

Multilevel Perceptron

-----------------------------------------------------

📋 Stages in development

Every stage described here has been followed in the attached Jupyter Notebook.
  1. Download the dataset.
  2. Prepare the data
    • Remove all the email headers(like sender details, receiver details, subject, and date)
    • Convert the whole email into lowercase
    • Replace all the url's present with the word 'URL' in email
    • Replace all the numbers present with the word 'NUM' in email
    • Remove all the punctuations present in email
  3. Split it into two sets - test and train.
  4. Convert the resulting text into bag-of-words representation (vector of counts of all words that appears in the training instance)
  5. Train and evaluate the MLP model on recall, precision and ROC
  6. Fine-tune the MLP classifier
  7. Evaluate it on the test set

-----------------------------------------------------

⚡ Technologies Used

python pandas numpy jupyter scikit-learn matplotlib-logo

📋 Dependencies

  • NumPy v1.16.2
  • Scikit-Learn v0.20.3
  • Matplotlib v3.0.2
  • Joblib v0.13.2

-----------------------------------------------------

📜 Credits

Harsh Singh Jadon

Twitter Badge GitHub Badge LinkedIn Badge

About

Developing an efficient Python based spam classifier using Multi-Layer Perceptron

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published