## Alluxio with Spark and Minio Demo
This is a demonstration of a data analystics stack using Alluxio to integrate Minio for storage, Spark for processing, and Jupyter notebook with Beakerx.

## Requirements
- Alluxio 2.0.0-preview
- Spark 2.3.3 with Hadoop 2.7.3
- Minio 2019-04-23
- Jupterlab 0.35.5
- Beakerx 1.4.1

## Install Beakerx using Scala kernel

In [None]:
import sys.process._
"conda install -y maven beakerx"!

## Setup Classpath
- The Alluxio Maven POM has hadoop-common as a hard dependency.  This conflicts with the hadoop jars on the Spark worker side.  The workaround(below) is to manually construct a classpath that excludes the hadoop classes.
- In addition, Alluxio uses later version of Protobuf and Guava.  The Netty jars are only needed for this Jupyter notebook.

In [1]:
%%classpath add mvn 
com.google.protobuf protobuf-java 3.7.1
com.google.guava guava 20.0
io.netty netty-all 4.1.17.Final
org.apache.hadoop hadoop-common 2.7.3
org.apache.hadoop hadoop-client 2.7.3
org.apache.spark spark-sql_2.11 2.3.3
org.alluxio alluxio-core-client-runtime 2.0.0-preview

## Be sure not to include the hadoop jars in the classpath

In [2]:
import java.io._
val jars = ClasspathManager.getJars().toArray
    .filter(x => 
            x.toString.contains("/guava") || 
            x.toString.contains("/protobuf")|| 
            x.toString.contains("alluxio")||
            x.toString.contains("/grpc") ||
            x.toString.contains("/opencensus")
           )
    .mkString(",")
new PrintWriter(new File("jars.txt" )){write(jars); close()}

$line25.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anon$1@41fa4465

## DNS is not always avaliable.  Using IP address is safer.

In [3]:
import sys.process._
val ip = "hostname -i"!!

250.2.146.2


## Create the Spark Session
- Note the classpath(from above) is loaded and put in the front of the classpath on the Spark worker
- Be sure to set the master and driver IP of your Spark cluster

In [4]:
%%spark --noUI
import org.apache.spark.sql.SparkSession
import scala.io.Source
val cp = Source.fromFile("jars.txt").getLines.mkString
val spark = SparkSession.builder()
    .appName("Simple Application")
    .master("spark://spark-master.spark.svc.cluster.local:7077")
    .config("spark.driver.host", "250.2.146.2")
    .config("spark.driver.userClassPathFirst", "true")
    .config("spark.executor.userClassPathFirst", "true")
    .config("spark.jars", cp)

SparkSession is available by 'spark'


## The Spark cluster is not hardcoded to work with Alluxio so we dynamically configure the classpath(above) and setup HDFS here.

In [5]:
spark.sparkContext.hadoopConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")

## This is just a sample file I have uploaded to Minio.  You can use any file.

In [6]:
val textFile = spark.sparkContext.textFile("alluxio://alluxio-master.alluxio.svc.cluster.local:19998/TitanicPassengersTrainData.csv")

alluxio://alluxio-master.alluxio.svc.cluster.local:19998/TitanicPassengersTrainData.csv MapPartitionsRDD[1] at textFile at <console>:108

## If everything works, you should see the correct count.

In [7]:
textFile.count()

891

## Clean

In [8]:
spark.close()