This repository contains the source code and the executable jar of the java application which builds a csv dataset file from the enron data folders.
This application converts unstructured enron dataset into structured dataset which can serve aas an input for data cleaning operations during the preprocessing stage.
The unstructured dataset is avaiable to download from here
downloadEnronDataset.sh
: Shell script to download the enron dataset file and extract the same.
All executable code is present in the /executable directory.
execute.sh
: Driver program - shell script used to run the java application.createJar.sh
: Shell script to compile the maven project and build the .jar file, and create a copy of it in this directory.enron_to_csv-1.0.jar
: .jar file encapsulating the java application.
The enron_to_csv/
directory is the maven project consisting of all the java source code.
The structuredData/
directory consists of the output csv file generated by this application.
This application takes path to the maildir directory as input and produces one output csv file.
- The output csv file consists of raw email text.
CSV format: "id","message"
This application requires 2 input parameters:
- overAllLimitier: the value of this argument specifies the upper limit of the total no. of emails to be read and hence written to the output csv dataset file. -1 indicates no limit.
- emailLimiterPerUser: the value of this argument specifies the upper limit of the no. of emails per user to be read and hence written to the output csv dataset file. -1 indicates no limit.
- Navigate to the /executable directory
- Download and extarct the enron dataset by executing the script
downloadEnronDataset.sh
. To execute the script run the following command: -
./downloadEnronDataset.sh
- Execute the jar application by running the following command:
-
./execute.sh -1 -1
Following are the specifications of the environment on which this application was last executed:
- Maven version: 3.8.6
- openjdk version: "11.0.16.1" 2022-08-12
- OpenJDK Runtime Environment Homebrew (build 11.0.16.1+0)
- OpenJDK 64-Bit Server VM Homebrew (build 11.0.16.1+0, mixed mode)