# Spam Detector

In [1]:
import os                                                    # for filesystem access
import fnmatch                                               # for Unix filename pattern matching
from sklearn.feature_extraction.text import CountVectorizer  # for data analysis

In [2]:
LINGSPAM_BARE_DATASET_PATH = "datasets/lingspam_public/bare"

In [3]:
documents = []
labels = []
count_vectorizer = CountVectorizer()

In [4]:
def is_spam_file_name(file_name):
    return fnmatch.fnmatchcase(file_name, 'spmsg*')

## Reading and Preprocessing Data

### Read all the emails in the ten folders & save the labels (spam/not spam, or 0/1) of each email to a list

In [5]:
for root, dirs, file_names in os.walk(LINGSPAM_BARE_DATASET_PATH):
    for file_name in fnmatch.filter(file_names, '*.txt'):
        with open(os.path.join(root, file_name), 'r') as file:
            documents.append(file.read())
            labels.append(1 if is_spam_file_name(file_name) else 0)

In [6]:
documents_length = len(documents)

if documents_length > 0:
    print("✅ Read %i documents" % len(documents))
else:
    print("❌ Could not read any documents")

✅ Read 2893 documents


### Split the emails & labels into 80% training & 20% testing

In [7]:
training_documents_count = round(documents_length * 0.8)

training_documents = documents[:training_documents_count]
training_labels = labels[:training_documents_count]

testing_documents = documents[training_documents_count:]
testing_labels = labels[training_documents_count:]

### Fit and transform the training emails & transform the testing emails using a CountVectorizer

In [8]:
count_vectorizer.fit(training_documents)

training_document_term_matrix = count_vectorizer.transform(training_documents)
testing_document_term_matrix = count_vectorizer.transform(testing_documents)

## Scikit-Learn Classifiers
##### For each classifier, print the precision, recall and f-score on the testing data

### Multinomial Naive Bayes

### K Neighbors Classifier

### Random Forest Classifier
##### you can set random_state=0

## Classifying using Readability Features

Rather than using the whole text content of an email, some characteristic features can be extracted per email, that will be fed to the classifier. Extract some features. The features are:

    a) F1: The number of sentences in an email.
    b) F2: The number of verbs in an email.
    c) F3: The number of words containing both numeric and alphabetical characters.
    d) F4: The number of words in an email that are found in the spam list.
    e) F5: The number of words in an email that have more than 3 syllables.
    f) F6: The average number of syllables of words in an email.
    
For F2, you can find useful code in Lab Assignment 5 solution on the MET website. For F4, you will be checking how many words in a given email are found in a spam word-list. The word-list you will be using can be found here. For F5 and F6, you can use the library Pyphen (with lang=’en_GB’).

The steps are:

    a) Create a list for every feature, where every element is the feature value of a given email (or use a
    dictionary, key is feature name, value is feature list).
    b) Build a feature matrix (list of lists), where every row corresponds to an email, and every column
    corresponds to a feature value of this email.
    c) Feed the feature matrix and the labels to any of the sklearn classifiers.
    
On the MET website, you will find a file titled “feature-construction”. This is an example of building a
feature matrix (steps “a” and “b”). Note that this is just a sample, the documents and the features to be
extracted will be different in the project.

##### For the classifier, print the precision, recall and f-score on the testing data.