Name		Name	Last commit message	Last commit date
parent directory ..
scala-hw-0		scala-hw-0
README.md		README.md
run-local.sh		run-local.sh
text.txt		text.txt
word_count.py		word_count.py

README.md

Homework 0: Word Count of a File using Map-Reduce

The purpose of this homework is to get familiar with Vocareum Lab infrastructure and the environment to be used to run the project submission on.

Homework 0: Word Count of a File using Map-Reduce

Python (with pyspark)

To run the python programs on Vocareum terminal the following steps are needed:

Set Java version to 1.8 using: export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
Then add the path to the environment variables: export PATH=$JAVA_HOME/bin:$PATH
Set the PySpark Python version to 3.6, do this by entering export PYSPARK_PYTHON=python3.6
Upload the files with code in the work section in the workspace area.
Finally, run you script script.py using the following command: python word_count.py. If that doesn't run, try:

opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --executor-memory 4G --driver-memory 4G script.py

Alternatively, execute ./run.sh.

Vocareum demo run example:

To run the python files locally, perform the following steps:

Install JDK.
Install Spark Hadoop for mac and set SPARK_HOME, HADOOP_HOME environment variables to the spark root directory and the hadoop root directory respectively.
Install python 3.6 (ref. here for macs with M series chips) though newer versions may work.
Alternatively, refer this video.
Run the program using python word_count.py.

Local demo run example:

Versions

JDK - 1.8 (jdk1.8.0_361.jdk)
SPARK - 3.3.1 (spark-3.3.1-bin-hadoop3)
python - 3.6 or (3.x > 3.6 but some syntax might not work on vocareum)
pyspark - 3.3.1

Scala

To run Scala jar on Vocareum, and use the following command:

./run.sh --class org.rpatel.dsci553_assignments.WordCount ./scala-hw-0.jar ../text.txt

replace the spark-submit and .jar file output accordingly.

Vocareum demo run example:

To run Scala program locally, build the artifacts (jars) and use the following command:

./run-local.sh --class org.rpatel.dsci553_assignments.WordCount ./out/artifacts/scala_hw_0_jar/scala-hw-0.jar ../text.txt

replace the spark-submit and .jar file output accordingly.

Local demo run example:

Versions

JDK - 1.8 (jdk1.8.0_361.jdk)
SPARK - 3.1.2 (spark-3.1.2-bin-hadoop3.2)
scala - 2.12.17 (sbt: org.scala-lang:scala-library:2.12.17:jar)

Troubleshoot

Path not configured properly for spark or hadoop - add the following in ~/.zshrc or ~/.zprofile depending on whichever is used:

# Set env vars for spark and hadoop
export SPARK_HOME=/Library/spark-3.3.1-bin-hadoop3
export PYTHONPATH=/Library/spark-3.3.1-bin-hadoop3/python

JDK is not available - add the following in ~/.zshrc or ~/.zprofile depending on whichever is used:

# Set env var for Java
export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-8.jdk

or use the path of any other JDK distribution if multiple of them are installed.

PORT is unassigned (Can't assign requested address: Service 'sparkDriver' failed after 16 retries) - Manually set SPARK_LOCAL_IP in /Library/spark-3.3.1-bin-hadoop3/bin/load-spark-env.sh using:

export SPARK_LOCAL_IP="127.0.0.1"

For running with scala, a completely different version is used for building the .jar. This is done to make the versions compatible for scala development. Hence, everytime a job is submitted to spark, first the correct environment variables are set in the ./run.sh script. The project is set up in IntelliJ idea with the versions as described above. For additional references on how to setup environment in IntelliJ refer the following links:
- Setup
- Config
- Make sure to set up the following:
  - Libraries with spark jars and scala compiler.
  - Project with correct JDK.
  - Artifacts from module with dependencies with correct main class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework-assignment-0

homework-assignment-0

README.md

Homework 0: Word Count of a File using Map-Reduce

Python (with pyspark)

To run the python programs on Vocareum terminal the following steps are needed:

To run the python files locally, perform the following steps:

Versions

Scala

To run Scala jar on Vocareum, and use the following command:

To run Scala program locally, build the artifacts (jars) and use the following command:

Versions

Troubleshoot

Files

homework-assignment-0

Directory actions

More options

Directory actions

More options

Latest commit

History

homework-assignment-0

Folders and files

parent directory

README.md

Homework 0: Word Count of a File using Map-Reduce

Python (with pyspark)

To run the python programs on Vocareum terminal the following steps are needed:

To run the python files locally, perform the following steps:

Versions

Scala

To run Scala jar on Vocareum, and use the following command:

To run Scala program locally, build the artifacts (jars) and use the following command:

Versions

Troubleshoot