Code for a CAB420 final project which predicts email authorship within the Enron corpus.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
classifier_output
preprocessors
Makefile
README.md
bayers.py
load_set.py
svm.py
svm_test.py

README.md

This is a final project for CAB420 at QUT, at least most of it (I didn't include a third classifier which I didn't work on). The project has code for programs which do the following using the Enron email corpus (as a single CSV file):

  • Split up the dataset by author. (C)
  • Limit the authors to the original set of executives. (Ruby)
  • Select training and testing subsets of emails from each author. (C)
  • Preprocess the subsets into vectors of TF-IDF scores for the top words in the corpus. (Ruby)
  • Build and test naive Bayes and SVM classifiers to predict who wrote the email by its content (or, TF-IDF scores). (Python)

The datasets are not included due to their sheer size.