Skip to content

An application which converts Enron dataset into a single CSV file

Notifications You must be signed in to change notification settings

mitrjain/convertEnronToCsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

convertEnronToCsv

This repository contains the source code and the executable jar of the java application which builds a csv dataset file from the enron data folders.

This application converts unstructured enron dataset into structured dataset which can serve aas an input for data cleaning operations during the preprocessing stage.

The unstructured dataset is avaiable to download from here

Code organization/ Directory structure

downloadEnronDataset.sh : Shell script to download the enron dataset file and extract the same.

All executable code is present in the /executable directory.

  • execute.sh : Driver program - shell script used to run the java application.
  • createJar.sh : Shell script to compile the maven project and build the .jar file, and create a copy of it in this directory.
  • enron_to_csv-1.0.jar : .jar file encapsulating the java application.

The enron_to_csv/ directory is the maven project consisting of all the java source code.

The structuredData/ directory consists of the output csv file generated by this application.

Running the application

Understanding flow of operations

This application takes path to the maildir directory as input and produces one output csv file.

  • The output csv file consists of raw email text.

CSV format: "id","message"

This application requires 2 input parameters:

  • overAllLimitier: the value of this argument specifies the upper limit of the total no. of emails to be read and hence written to the output csv dataset file. -1 indicates no limit.
  • emailLimiterPerUser: the value of this argument specifies the upper limit of the no. of emails per user to be read and hence written to the output csv dataset file. -1 indicates no limit.

Steps to run the application

  • Navigate to the /executable directory
  • Download and extarct the enron dataset by executing the script downloadEnronDataset.sh. To execute the script run the following command:
    • ./downloadEnronDataset.sh
  • Execute the jar application by running the following command:
    • ./execute.sh -1 -1

Environment specifications

Following are the specifications of the environment on which this application was last executed:

  • Maven version: 3.8.6
  • openjdk version: "11.0.16.1" 2022-08-12
  • OpenJDK Runtime Environment Homebrew (build 11.0.16.1+0)
  • OpenJDK 64-Bit Server VM Homebrew (build 11.0.16.1+0, mixed mode)

About

An application which converts Enron dataset into a single CSV file

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published