This repository is the submission for the CS6240 project group Komal Pardeshi and Sean Yu. The project structure isadapted from template repository followed for the course assignments.The report for this project is here.
- Joe Sackett (2018)
- Updated by Nikos Tziavelis (2023)
- Updated by Mirek Riedewald (2024)
- Updated by Komal Pardeshi and Sean Yu (2024)
These components need to be installed first:
-
OpenJDK 11
-
Hadoop 3.3.5
-
Maven (Tested with version 3.6.3)
-
AWS CLI (Tested with version 1.22.34)
-
Scala 2.12.17 (you can install this specific version with the Coursier CLI tool which also needs to be installed)
-
Spark 3.3.2 (without bundled Hadoop)
After downloading the hadoop and spark installations, move them to an appropriate directory:
mv hadoop-3.3.5 /usr/local/hadoop-3.3.5
mv spark-3.3.2-bin-without-hadoop /usr/local/spark-3.3.2-bin-without-hadoop
-
Example ~/.bash_aliases:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop-3.3.5 export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SCALA_HOME=/usr/share/scala export SPARK_HOME=/usr/local/spark-3.3.2-bin-without-hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin export SPARK_DIST_CLASSPATH=$(hadoop classpath) -
Explicitly set
JAVA_HOMEin$HADOOP_HOME/etc/hadoop/hadoop-env.sh:export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
The data to run the experiments can be found in the following OneDrive folder: https://northeastern-my.sharepoint.com/:f:/g/personal/yu_sea_northeastern_edu/EuaqPtrv3_tIgLCFd6aNUDoBMId3wMIE_k5u3T5q10Auiw?e=zd84c8 To run the code locally add the input file to the inputs folder and update the local filepath in the makefile.
Matrix Multiplcation - H-V To run this program, change the input program in the makefile to: matmulGeneric.matmul
Params aws.bucket.name, aws.log.dir, aws.instance.type, aws.num.nodes may be changed as per requirements.
Change the aws params under change for every execution-> aws.experiment_id, aws.input1, aws.input2, aws.input, aws.output, aws.size1, aws.size2, aws.cluster.
Matrix Multiplcation - V-H adapted for A$^T$A To run this program change the input program in the makefile to: ATAMatMulMain
Matrix Inversion To run this program change the input program in the makefile to: matrixInversion.matrixInversionMain
Linear Regression
There are four inputs required to run this program. Namely, the train feature matrix, test feature matrix, train values vector, and test values vector. When running locally the program takes all the inputs as a single string separated by a comma.
For example, in the makefile the line for local input should look like the following:
local.input=input/trainA.csv,input/testA.csv,input/trainb.csv,input/testb.csv
To run on AWS, the input arguments are separated. In addition to the bucket name being correct, the aws inputs should look like the following:
aws.input1=normTrainA.csv aws.input2=normTestA.csv aws.input3=normTrainb.csv aws.input4=normTestb.csv
Lastly, several lines in the program need to be commented out/in in the file src\main\scala\linearRegression\linearRegressionFinal.scala depending if the code is to run locally or on aws.
To run locally, lines 161-165 should be kept and lines 168-171 should be commented out. Conversly, to run on AWS lines 168-171 should be kept, whereas lines 161-165 should be commented out.
All of the build & execution commands are organized in the Makefile.
- Unzip project file.
- Open command prompt.
- Navigate to directory where project files unzipped.
- Edit the Makefile to customize the environment at the top. Sufficient for standalone: hadoop.root, jar.name, local.input Other defaults acceptable for running standalone.
- Standalone Hadoop:
make switch-standalone-- set standalone Hadoop environment (execute once)make local
- Pseudo-Distributed Hadoop: (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
make switch-pseudo-- set pseudo-clustered Hadoop environment (execute once)make pseudo-- first executionmake pseudoq-- later executions since namenode and datanode already running
- AWS EMR Hadoop: (you must configure the emr.* config parameters at top of Makefile)
make make-bucket-- only before first executionmake upload-input-aws-- only before first executionmake aws-- check for successful execution with web interface (aws.amazon.com)download-output-aws-- after successful execution & termination
The make file was edited to allow the program to be run locally and on aws. Similarly, the spark master was changed on line 16-17 of WordCount.scala to allow the program to run locally and on aws.