The purpose of this homework is to get familiar with Vocareum Lab infrastructure and the environment to be used to run the project submission on.
- Set Java version to 1.8 using:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
- Then add the path to the environment variables:
export PATH=$JAVA_HOME/bin:$PATH
- Set the PySpark Python version to 3.6, do this by entering
export PYSPARK_PYTHON=python3.6
- Upload the files with code in the work section in the workspace area.
- Finally, run you script script.py using the following command:
python word_count.py
. If that doesn't run, try:
opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --executor-memory 4G --driver-memory 4G script.py
- Alternatively, execute
./run.sh
.
Vocareum demo run example:
- Install JDK.
- Install Spark Hadoop for mac and set
SPARK_HOME
,HADOOP_HOME
environment variables to the spark root directory and the hadoop root directory respectively. - Install python 3.6 (ref. here for macs with M series chips) though newer versions may work.
- Alternatively, refer this video.
- Run the program using
python word_count.py
.
Local demo run example:
JDK - 1.8 (jdk1.8.0_361.jdk)
SPARK - 3.3.1 (spark-3.3.1-bin-hadoop3)
python - 3.6 or (3.x > 3.6 but some syntax might not work on vocareum)
pyspark - 3.3.1
./run.sh --class org.rpatel.dsci553_assignments.WordCount ./scala-hw-0.jar ../text.txt
replace the spark-submit
and .jar
file output accordingly.
Vocareum demo run example:
./run-local.sh --class org.rpatel.dsci553_assignments.WordCount ./out/artifacts/scala_hw_0_jar/scala-hw-0.jar ../text.txt
replace the spark-submit
and .jar
file output accordingly.
Local demo run example:
JDK - 1.8 (jdk1.8.0_361.jdk)
SPARK - 3.1.2 (spark-3.1.2-bin-hadoop3.2)
scala - 2.12.17 (sbt: org.scala-lang:scala-library:2.12.17:jar)
- Path not configured properly for spark or hadoop - add the following in
~/.zshrc
or~/.zprofile
depending on whichever is used:
# Set env vars for spark and hadoop
export SPARK_HOME=/Library/spark-3.3.1-bin-hadoop3
export PYTHONPATH=/Library/spark-3.3.1-bin-hadoop3/python
- JDK is not available - add the following in
~/.zshrc
or~/.zprofile
depending on whichever is used:
# Set env var for Java
export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-8.jdk
or use the path of any other JDK distribution if multiple of them are installed.
- PORT is unassigned (Can't assign requested address: Service 'sparkDriver' failed after 16 retries) - Manually set
SPARK_LOCAL_IP
in/Library/spark-3.3.1-bin-hadoop3/bin/load-spark-env.sh
using:
export SPARK_LOCAL_IP="127.0.0.1"
- For running with
scala
, a completely different version is used for building the.jar
. This is done to make the versions compatible for scala development. Hence, everytime a job is submitted to spark, first the correct environment variables are set in the./run.sh
script. The project is set up in IntelliJ idea with the versions as described above. For additional references on how to setup environment in IntelliJ refer the following links: