hiren-pandya

This file contains the installation and execution steps for this Project. I have used python’s pyspark module for creating Spark applications

Installation Instructions

Required Python vesion

Python 2.7.12

Required python modules:

pyspark
py4j

Third party modules:

nltk
bs4
sh
sklear-

The description of each script is as follow :

preprocessingDataDiscovery.py (For Data preprocessing and data discovery and graph representation)
bag_of_words.py (for creating bag of words over all emails which will be input for LDA and Kmeans)
Kmeans.py (for running K-means algorithm)
LDA.py (for running LDA algorithm)

Configure Spark cluster on Amazon AWS : Please follow the steps shared by TA. Please also execute those three commands which copy maildir dataset on AWS hadoop file system.

After this, please create one folder on AWS called ‘source_code’. It would be /home/hadoop/source_code. Copy all the files from this link or attached folder Source_code/ to the /home/hadoop/source_code folder. Now Go to this source_code folder using below commands.

cd /home/hadoop/source_code

Please follow below instructions to run all spark applications. Execution Instructions are as follows:

Preprocessing and Data Discovery (Task 1 and Task 2): It has been implemented in the script preprocessingDataDiscovery.py. This file needs
one class file userObject.py. The command is as follow :

spark-submit --py-files userObject.py preprocessingDataDiscovery.py

It will save the following four folders in the hdfs. The output directory in hdfs is output_assignmeent2. s01_DataAfterStoppingAndStemming s02_DataDiscovery_stastistical_summary s03_Directional_Graph_sent_Edges s04_Directional_Graph_received_Edges
Creating Bag of words : It has been implemented in the script bag_of_words.py. The command is as follows :

spark-submit bag_of_words.py

It will create bag of words Bag_of_words.txt and vocabList.txt files in the output directory. We need to copy these files to hdfs folder / output_assignment2.
Running K-means Algorithm

spark-submit Kmeans.py <INPUT_PATH> <K_VALUE> The output file is Kmeans_clustering_output.txt.
Running LDA Algorithm

spark-submit LDA.py <INPUT_PATH> <NO_OF_TOPICS> The output file is LDA_clustering_output.txt The LDAModel is saved in hdfs output directory (/output_assignment2/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hiren-pandya

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Kmeans.py		Kmeans.py
LDA.py		LDA.py
README.md		README.md
bag_of_words.py		bag_of_words.py
preprocessingDataDiscovery.py		preprocessingDataDiscovery.py
userObject.py		userObject.py

hirenpandya/Analysing-Large-Email-datasets-using-Spark-Big-Data-on-Amazon-AWS

Folders and files

Latest commit

History

Repository files navigation

hiren-pandya

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages