A preprocessed TREC 2007 Public Corpus Dataset suitable for building Spam Detection Models. The original dataset is from https://plg.uwaterloo.ca/~gvcormac/treccorpus07/about.html, here I just preprocessed the data so that it can be used simply.
TREC 2007 Public Corpus Dataset is an email spam detection email. It contains 50199 spam emails and 25220 ham (not spam) emails.
In the dataset, there is one CSV file. In the CSV file, there are 5 columns. I'm detailing them below.
- label: This is the label for the email, if it is 1 then spam else ham
- subject: Subject of the email
- email_to: Receiver of the email
- email_from: Sender of the email
- message: Email body
If you want to download the processed data, then please check this Kaggle dataset: https://www.kaggle.com/imdeepmind/preprocessed-trec-2007-public-corpus-dataset
Here is the link to the original dataset: https://plg.uwaterloo.ca/~gvcormac/treccorpus07/about.html
I really thankful to these peoples/sources for providing this amazing dataset