## CS 297.2 Big Data Processing: Spark SQL and SparkML

Datasets used for this notebook can be found here: https://drive.google.com/drive/folders/1qYg9SXcc9minIErchqWXR9Rq4CYj7dqf

wyu@ateneo.edu

---

### Installing Spark on the machine

Once you have installed java, the next steps should be similar. You will likely want to put the Spark application folder wherever you put your user-installed applications.

In [None]:
!rm -r spark*
!wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!ls
!tar xvf ./spark-3.5.1-bin-hadoop3.tgz > /dev/null 2>/dev/null
!ls
!pip install -q findspark

In [None]:
# Set environment variables
import os
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3/"
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/jre/bin/java
!java -version

____
## Sample 1: Spark SQL examples

Using SQL in Spark

In [None]:
# import new dataset for plotting (for this activity is should be irdata-v3.csv.gz)
from google.colab import files
uploaded = files.upload()

In [None]:
# If Spark is installed and SPARK_HOME is set, this will find the spark installation so spark libraries can be imported.
# findspark is necessary if you want to use Spark in the IDE of your choice.
import findspark
findspark.init()

# Imports the basic spark functions needed
from pyspark import SparkConf, SparkContext
from operator import add
import io

# Sets the Spark configuration. The AppName is arbitrary, but setting the master to local
# specifies that the application is not running on a distributed system
conf = SparkConf().setMaster("local").setAppName("SparkQLExample")
sc = SparkContext.getOrCreate(conf = conf)

In [None]:
# check if context if available. This might require some waiting
sc

In [None]:
# import necessary libraries
from pyspark import SQLContext

In [None]:
# create spark sql context
sqlContext = SQLContext(sc)

In [None]:
# read data from filesystem
contraceptionData = sqlContext.read.csv('irdata-v3.csv.gz', header = True, inferSchema = True).cache()

In [None]:
# did we get the data?
contraceptionData.printSchema()

In [None]:
contraceptionData.toPandas().info()

In [None]:
# is my data clean?
contraceptionData.toPandas()["Used method in last 12 months"].value_counts()

In [None]:
# let us start cleaning it.
# give all null values a value of -1
from pyspark.sql.functions import col
contraceptionDataStringReplaced = contraceptionData.na.replace("", "-1")

# cast a string column to float
contraceptionDataCasted = contraceptionDataStringReplaced.withColumn("Religion", col("Religion").cast("float"))

# use 2 to mean no answer since the classification model only accepts labels from 0,1,9
notNull = contraceptionDataCasted.fillna({ 'Used method in last 12 months':2 })

# Other features will be relabeled to -1
df = notNull.fillna(-1)

In [None]:
# look at cleaned version
df.toPandas().info()
df.toPandas()["Used method in last 12 months"].value_counts()

In [None]:
# show first few cleaned up rows
df.toPandas()

In [None]:
# select contents of one column
df.select("Land Owner").show()

In [None]:
# filter data
df.filter(df.Age > 30).show()

In [None]:
# groupby religion
df.select("Age","Residence Type","Religion").groupby("Religion").count().show()

In [None]:
# groupby method
df.select("Age","Residence Type","Used method in last 12 months").groupby("Used method in last 12 months").count().show()

In [None]:
# groupby religion
df.select("Age","Residence Type","Religion").filter(df.Age > 30).groupby("Religion").count().show()

In [None]:
# query as SQL
df.registerTempTable("loans")
sqlContext.sql("SELECT Religion, count(*) FROM loans WHERE Age > 30 GROUP BY Religion").show()


---
## Exercise: Studying input file details in SQL
This activity is to process the CC CSV file using SparkQL

1.   Load the file cc.csv in the SampleData directory
2.   Create a query using both data frame and SQL mechanism to show the number of entries by gender but only count the TRUE entries with amounts greater than $1000
3.   What do you think the cc.csv file represents?



In [None]:
# show query using data frame mechanism

In [None]:
# show query using SQL mechanism

____
## Sample 2: Lets jump to Machine Learning

Using ML in Spark

In [None]:
# ML stuff
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [None]:
# recap on what is in the schema
df.toPandas().info()

In [None]:
# split data
train, test = df.randomSplit([0.7, 0.3], seed=30)
train.count()

In [None]:
test.count()

In [None]:
# assemble ML pipeline using LR
assembler = VectorAssembler(
    inputCols=['Age', 'Residence Type', 'Religion', 'Educ in single years', 'Encouraged FP', 'HH Head', 'Land Owner', 'Earns more', 'DM Contraception', 'Depression/ anxiety', 'Total CEB', 'Number of SC'],
    outputCol="features")
lr = LogisticRegression(featuresCol = 'features', labelCol = 'Used method in last 12 months', maxIter=10)
classificationPipeline = Pipeline(stages=[assembler, lr])
classificationPipelineModel = classificationPipeline.fit(train)
classified = classificationPipelineModel.transform(train)

# create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.25, 0.5, 1.0, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10, 15, 20])
             .build())
prediction = classificationPipelineModel.transform(test)

In [None]:
# evaluate model based on default parameters and start training!
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol = "Used method in last 12 months")
print('Accuracy before Cross Validation ', evaluator.evaluate(prediction))

cv = CrossValidator(estimator=classificationPipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

cvModel = cv.fit(train)

In [None]:
# evaluate model after CV validation
predictions_cvModel = cvModel.transform(test)

print('Accuracy after Cross Validation: ', evaluator.evaluate(predictions_cvModel))

---
## Exercise: This is so slow why not run it in the cloud
Convert the SparkML script above into a EMR/Dataproc/HDInsight Spark job
