This is a Spark application for generating TeraSort data and running TeraSort benchmarks. It is variant of TeraSort program listed on http://sortbenchmark.org/.
This is fork and rework of ehiggs/spark-terasort,
build/mvn clean package -DskipTests -Pscala-2.12 -Dspark.version=3.3.2
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraGen \
spark-terasort_2.12-2.0.0.jar \
<size> <generate_data_path>
or
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraGen \
--packages io.github.pan3793:spark-terasort_2.12:2.0.0 \
spark-internal \
<size> <generate_data_path>
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraSort \
spark-terasort_2.12-2.0.0.jar \
<input_data_path> <output_data_path>
or
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraSort \
--packages io.github.pan3793:spark-terasort_2.12:2.0.0 \
spark-internal \
<input_data_path> <output_data_path>
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraValidate \
spark-terasort_2.12-2.0.0.jar \
<output_data_path>
or
$SPARK_HOME/bin/spark-submit \
--class io.github.pan3793.spark.terasort.TeraValidate \
--packages io.github.pan3793:spark-terasort_2.12:2.0.0 \
spark-internal \
<output_data_path>
PRs are very welcome!