# Introduction

In this project, we aim to address the usage of topic-modelling in a cybersecurity dataset. We based this project on the work done by a previous year group [linked here](https://github.com/xiaozhang-github), though we take a different approach to solving a similar problem. The problem in question is: using the Enron email dataset, can we use topic-modelling to classify emails into spam/ham, and how effectively can we do this?

### Initial Plan

Our initial proposed idea was to use Topic Models to enhance the performance of a spam/ham classifier on the Enron email dataset and in doing so compare the performance increase (if any) that different topic models gave. The general idea was to use a Topic Model to assign each document and create a `topic` feature vector to add to the original dataset and run the classifier on. This would be akin to a semi-supervised approach to learning how to classify spam/ham and we hoped to see what the results of this approach would be. Our thinking was that this approach would give some intuition as to what each topic corresponds to and enable us to measure how informative each Topic Model was through the comparison of performance increase and feature importance.

However, we partly abandoned this approach because it quickly turned into a feature-engineering problem wherein we needed to create features for each set of emails that was not generated by our topic models. We initially considered:
- Word count
- Count of capital letters
    - Count of consecutive capital letters
- Appearance/count of 're:' or 'fw:' in the emails
- Misspelling count

as possible features to feed to a classifier such as SVM, assess its performance, and then apply a topic model to generate more features and assess the outcome. 

However, it was unclear if this would be a sufficient amount of features, and applyin them correctly was going to take us too much time and go beyond the scope of this assignment. Instead, we went with a different approach:

### Our Decided Approach

We decided to use different topic models both as classifiers and as feature generators for a separate classifier. We look at:
  - Term Frequency
    - Term frequency is often used as part of tf-idf for topic modelling.
    - We use term frequency as 'null model' of sorts by using term frequencies as features for an SVM classifier.
  - Latent Dirichlet Allocation (LDA) modelling: 
    - Here we explore how LDA works, take a brief look at the topics that it generates and the coherence of those topics. 
    - We look at spam prevalence in a topic as an indication of it being a spam/ham topic and use lda probabilities for those topics to assign spam/ham predictions
    - We use LDA probabilities as features for a custom simple classifier and an SVM and assess their performance individually and in comparison to the way we used LDA as a classifier.
  - Hierarchical Dirichlet Process (HDP) modelling:
    - We use HDP similarly to how we used LDA in creating custom classifiers.
    - We use HDP features to feed to multiple classifiers such as SVM, Random Forests, and Voting models.
    - We assess these performances individually and against one another.
    
By doing so we create classifiers that can predict ham/spam and compare their performance using a decided metric, which in our case is ROC curves and AUC scores for ease of comparability across multiple models. More importantly, we should be able to explore how topic models play a role in this and how to better understand them through adjusting their parameters for improved performance. However,o ur goal is not necessarily to create a winning classifier but to rather assess topic models as classifiers and feature generators for other classifiers.