## Data Processing Using Apache Spark on Openshift

This notebook server is hosted on the OpenShift platform which provides a dedicated notebook server for each individual user. The platform takes care of provisioning the cluster resources including the allocation related to storage resources.

### Install and import required libraries, watermark

In [None]:
# %pip install watermark
# %pip install Minio
# %pip install pyspark
# %pip install matplotlib

# import json
# import watermark
# from minio import Minio

# %matplotlib inline
# %load_ext watermark
# %watermark -n -v -m -g -iv
# %pip install pyspark

#os.environ['S3_ENDPOINT'] = "http://minio-ml-workshop:9000"

###  Connect to Spark Cluster provided by OpenShift Platform
Using the given spark_util library, create a Spark session that connects to a Spark cluster dedicated for this notebook. You may add additional Spark submit arguments in the second argument of spark_util.getOrCreateSparkSession() such as additional packages and or override some configuration items.

In [None]:
import os
import spark_util

s3_config = f"--conf spark.hadoop.fs.s3a.endpoint={os.environ['S3_ENDPOINT']} \
--conf spark.hadoop.fs.s3a.access.key=minio \
--conf spark.hadoop.fs.s3a.secret.key=minio123 \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

spark = spark_util.getOrCreateSparkSession("CustomerChurn", s3_config, "INFO")

###  Create dataframes from CSV files

Using Spark, read the CSV filed from S3 storage and load them as Spark dataframes.

In [None]:
dataFrame_Customer = spark.read\
                .options(delimeter=',', inferSchema='True', header='True') \
                .csv("s3a://rawdata/Customer-Churn_P1.csv")
dataFrame_Customer.printSchema()

dataFrame_Products = spark.read\
                .options(delimeter=',', inferSchema='True', header='True') \
                .csv("s3a://rawdata/Customer-Churn_P2.csv")
dataFrame_Products.printSchema()

### Load from Kafka
You may also read data from a KAfka topic and create a Spark dataframe out of it.

In [None]:
# from pyspark.sql.types import *
# from  pyspark.sql.functions import *

# srcKafkaBrokers = "odh-message-bus-kafka-bootstrap:9092"
# srcKakaTopic = "datatelco"



# schema = StructType()\
#     .add("customerID", IntegerType())\
#     .add("PhoneService", StringType())\
#     .add("MultipleLines", StringType())\
#     .add("InternetService", StringType())\
#     .add("OnlineSecurity", StringType())\
#     .add("OnlineBackup", StringType())\
#     .add("DeviceProtection", StringType())\
#     .add("TechSupport", StringType())\
#     .add("StreamingTV", StringType())\
#     .add("StreamingMovies", StringType())\
#     .add("Contract", StringType())\
#     .add("PaperlessBilling", StringType())\
#     .add("PaymentMethod", StringType())\
#     .add("MonthlyCharges", StringType())\
#     .add("TotalCharges", DoubleType())\
#     .add("Churn", StringType())



# #Read from JSON Kafka messages into a dataframe
# dfKafka = spark.read.format("kafka")\
#     .option("kafka.bootstrap.servers", srcKafkaBrokers)\
#     .option("subscribe", srcKakaTopic)\
#     .option("startingOffsets", "earliest")\
#     .load()\
#     .withColumn("value", regexp_replace(col("value").cast("string"), "\\\\", "")) \
#     .withColumn("value", regexp_replace(col("value"), "^\"|\"$", "")) \
#     .selectExpr("CAST(value AS STRING) as jsonValue")\
#     .rdd.map(lambda row: row["jsonValue"])

# dataFrame_Products = spark.read.schema(schema).json(dfKafka)
# dataFrame_Products.printSchema()
# dataFrame_Products.show(n=2)


### Join dataframes
Perform a full outer join on two dataframes using ```customerID``` as key

In [None]:

dataFrom_All = dataFrame_Customer.join(dataFrame_Products, "customerID", how="full")


###  Push the prepared data to the object storage and stop the Spark application
Write the joined dataframe to an S3 bucket. Because this is last step of our data preparation, we don't need the Spark cluster anymore. We will stop the Spark context which will remove the Spark application from the cluster.

<span style="color:red">Note: Change this value of user_id to your assigned username (something in the range user1 ... user30)</span>. 

In [None]:
user_id = "user29"
file_location = "s3a://data/full_data_csv" + user_id
dataFrom_All.repartition(1).write.mode("overwrite")\
    .option("header", "true")\
    .format("csv").save(file_location)

In [None]:
spark.stop()