Skip to content

jayesh15111988/PythonSpamFiltering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PythonSpamFiltering

Welcome to the README about Data mining project to filter Spam messages from collection of regular messages. This algorithm was created from the training dataset of approximately 5600 messages collected from anonymous messages exchange data. This data was provided by the courtesy of UCI Machine Learning Repository.

This dataset was utilized to train our Naive Bayes and K-Means clustering algorithm models which was later used to classify unknown messages received as a part of what we call as 'Production Data'.

In order to verify correctness of model, we ran the model built from the same database for more discrete measures on utility of models created so far. Following are results of those runs respectively:

  1. Naive Bayes Algorithm - 99%
  2. K-Means data mining classification algorithm - 82%

The reason for low efficiency of K-Means could be explained with presence of very large number of attributes for each vector thus generated by training data and availability of only two class labels viz. Spam and Regular Messages. Which created certain confusion while classifying the unknown message into correct category.

Below are brief details about training process, verification and classification of unknown corpse of SMS messages

Training Process: We used file names as 'SMSSpamCollection' downloaded from UCI Repository mentioned above. Following function was used to populate and create a collection of vectors for each input message

getAndFilterMessagesInDataStructureWithFileName(Constants.TRAINING_DATA_FILE,frequencyOfWordsInRegularMessages,frequencyOfWordsInSpamMessages,collectionOfVectorsOfAllMessages,dynamicAttributrMappingDictionary);

Where,

  1. Constants.TRAINING_DATA_FILE - Name of the training file. We assume that each training data file has fixed format like class_label (spam or ham) full_message_value

  2. frequencyOfWordsInRegularMessages - A Dictionary with each word from Regular message collection along with its overall frequency of occurrence in input corpse

  3. frequencyOfWordsInSpamMessages - Similar to 'frequencyOfWordsInRegularMessages', but only applicable to spam messages collection

  4. collectionOfVectorsOfAllMessages - A vector calculated for each message. Contains attribute such as length of longest word, number of capital letters and category if message was labeled as spam or regular

  5. dynamicAttributrMappingDictionary - A Dictionary with list of dynamic attributes along with their position of their values in 'collectionOfVectorsOfAllMessages' vector

We will give output of this function as an input to next two functions as follows:

Note : First parameter in both functions represents the production data file which is basically collection of messages of unknown categories. Algorithm will actually classify them and output the resultant category on standard output console.

Naive Bayes:

runNaiveBayesOnDataFromFileWithName(Constants.PRODUCTION_DATA_FILE,collectionOfVectorsOfAllMessages);

This function will be responsible for first creating model from training data and then classifying each input message into respective category

K-Means:

runKMeansClusteringOnDataFromFileWithName(Constants.PRODUCTION_DATA_FILE,collectionOfVectorsOfAllMessages);

This function will be responsible for creating collection of centroids based on a training data and then classifying input message into respective category based on the distances between two centroids corresponding to spam and regular messages categories.

Verification of Models:
Since we already created models for further classification of unknown data, we will need to test correctness of our model by some means. This is achieved by feeding same file to algorithm and checking accuracy based on number of correctly classified messages. This is straightforward as we already know class of each input message

For above mentioned two methods for Naive Bayes and K-Means clustering algorithm, we can verify and perform classification based on type of input production file

Classifications:
When performing classification on unknown data the format of input file is just collection of messages with one message per line. Algorithm will automatically detect that this is for classification purpose and output the class of each message accordingly.

Verifications:
In order to performance verification give same file that we used for training as input to out algorithm. This has same format as Class followed by an actual message each residing on the same line. Algorithm will compute number of correctly classified messages by each method and will output accuracy based on them.

Note:

  • In Order to verify algorithm on smaller bases (As it takes very long time to run on single monolithic file structure) we have created a smaller version of training data named as 'sample'. Which could be used as dummy in some cases
  • We are creating intermediate file for training purpose which will be generated only for first time and will be used same version for subsequent runs.
  • Expiration timestamp for such files is set to 2 months maximum. After which those files will be overwritten
  • Please note that, it takes ~35 Minutes to run the complete program as creation of vector models for K-Means is ridiculously slow and involves the vectors of several length. This could be one factor towards future improvement of this algorithm.
  • For reference a sample file (productionMessageData) containing production messages is added. You can replace this file with the one containing collection of similar messages.

About

This is Project to do spam filtering based on baysian network and K-means clustering to separate out regular messages from spams

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages