## Data Processing Using Apache Spark on Openshift

This notebook server is hosted on the OpenShift platform which provides a dedicated notebook server for each individual user. The platform takes care of provisioning the cluster resources including the allocation related to storage resources.

### Manually set variables in notebook mode

In [4]:
# import os
# os.environ['S3_ENDPOINT'] = "http://minio-ml-workshop:9000"

###  Prepare S3 URL
Define a function that will convert an S3 URL into URL that works with MINIO

In [5]:
import os, socket
from urllib.parse import urlparse

# Get the S3 URL information and use it in Spark Context
# NOTE: S3 Hadoop API for spark does not work with domain name, use IP address instead
def domain_to_ip(url):
    domain = urlparse(url).netloc.split(":")[0]
    ip_address = socket.gethostbyname(domain)
    ip_url = url.replace(domain, ip_address)
    return ip_url

###  Connect to Spark Cluster provided by OpenShift Platform
Using the given spark_util library, create a Spark session that connects to a Spark cluster dedicated for this notebook. You may add additional Spark submit arguments in the second argument of spark_util.getOrCreateSparkSession() such as additional packages and or override some configuration items.

In [6]:
import spark_util

submit_args = f"--conf spark.hadoop.fs.s3a.endpoint={domain_to_ip(os.environ['S3_ENDPOINT_URL'])} \
--conf spark.hadoop.fs.s3a.access.key=minio \
--conf spark.hadoop.fs.s3a.secret.key=minio123 \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.multipart.size=104857600 \
--conf spark.jars.ivy=/tmp \
--packages org.apache.hadoop:hadoop-aws:3.2.0"
# ,com.amazonaws:aws-java-sdk:1.11.968"

spark = spark_util.getOrCreateSparkSession("ANZ Join Tables", submit_args)

Initializing environment variables for Spark
Cluter name: spark-cluster-rbrigoli
PYSPARK_SUBMIT_ARGS: --conf spark.hadoop.fs.s3a.endpoint=http://172.30.29.255:9000 --conf spark.hadoop.fs.s3a.access.key=minio --conf spark.hadoop.fs.s3a.secret.key=minio123 --conf spark.hadoop.fs.s3a.path.style.access=true --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.multipart.size=104857600 --conf spark.jars.ivy=/tmp --packages org.apache.hadoop:hadoop-aws:3.2.0 --master spark://spark-cluster-rbrigoli:7077 pyspark-shell 
Driver IP address: 10.130.3.188
Creating a spark session...
Spark session created


###  Create dataframes from CSV files

Using Spark, read the CSV filed from S3 storage and load them as Spark dataframes.

In [22]:
df_cfms_cncrn = spark.read\
                .options(delimeter=',', inferSchema='True', header='True') \
                .csv("s3a://raw-data-anz/CFMS_CNCRN.csv")
df_cfms_cncrn.printSchema()

df_cfms_issue = spark.read\
                .options(delimeter=',', inferSchema='True', header='True') \
                .csv("s3a://raw-data-anz/CFMS_ISSUE.csv")
df_cfms_issue.printSchema()

df_salesforce = spark.read\
                .options(delimeter=',', inferSchema='True', header='True') \
                .csv("s3a://raw-data-anz/SALESFORCECMOSAU_CASE.csv")
df_salesforce.printSchema()

root
 |-- CNCRN_ID: string (nullable = true)
 |-- SOURCE_SYSTEM: string (nullable = true)
 |-- RECVD_D: string (nullable = true)
 |-- CNCRN_DS: string (nullable = true)
 |-- STAT: string (nullable = true)

root
 |-- CNCRN_ID: string (nullable = true)
 |-- ISSUE_DS: string (nullable = true)
 |-- END_DATE: string (nullable = true)

root
 |-- CASENUMBER_C: string (nullable = true)
 |-- ISSUE_DS: string (nullable = true)
 |-- END_DATE: string (nullable = true)



### Join dataframes
Perform a full outer join on two dataframes using ```CNCRN_ID``` as key

In [35]:
from pyspark.sql.functions import col

df_cfms_joined = df_cfms_cncrn.join(df_cfms_issue, "CNCRN_ID", how="left_outer")
#df_cfms_joined.printSchema()

df_salesforce_alias = df_salesforce.select(col("CASENUMBER_C"), col("ISSUE_DS").alias("SF_ISSUE_DS"), col("END_DATE").alias("SF_END_DATE"))

join_pred = df_cfms_joined.CNCRN_ID == df_salesforce_alias.CASENUMBER_C

df_cfms_salesforce_joined = df_cfms_joined.join(df_salesforce_alias, join_pred, how="left_outer")
#df_cfms_salesforce_joined.printSchema()


df_cfms_salesforce_joined.show()

+--------+-------------+----------+--------------------+-------+--------------------+----------+------------+--------------------+-----------+
|CNCRN_ID|SOURCE_SYSTEM|   RECVD_D|            CNCRN_DS|   STAT|            ISSUE_DS|  END_DATE|CASENUMBER_C|         SF_ISSUE_DS|SF_END_DATE|
+--------+-------------+----------+--------------------+-------+--------------------+----------+------------+--------------------+-----------+
|    A111|         CFMS|2019-06-14|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A112|         CFMS|2020-06-12|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A113|         CFMS|2021-06-01|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A114|         CFMS|2019-07-23|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|

###  Push the prepared data to the object storage and stop the Spark application
Write the joined dataframe to an S3 bucket as CSV and as Parquet file.

In [32]:
file_location = "s3a://data-anz/cfms_concern"
df_cfms_salesforce_joined.coalesce(1).write.mode("overwrite")\
    .option("header", "true")\
    .format("csv").save(file_location)

In [34]:
file_location = "s3a://data-anz/cfms_concern_parquet"
df_cfms_salesforce_joined.coalesce(1).write.parquet(file_location)

### Query Parquet file

In [50]:
from pyspark.sql.functions import *  

file_location = "s3a://data-anz/cfms_concern_parquet"
df_queried = spark.read.parquet(file_location)

#df_queried.show()

df_formatted = df_queried.select(col("CNCRN_ID"), col("SOURCE_SYSTEM").alias("Source"),\
                                col("RECVD_D").alias("Date Recvd"), col("CNCRN_DS").alias("Concern Description"),\
                                col("STAT").alias("Status"), concat_ws("", df_queried.ISSUE_DS, df_queried.SF_ISSUE_DS).alias("Issue Description"),\
                                concat_ws("", df_queried.END_DATE, df_queried.SF_END_DATE).alias("End Date"))

df_formatted.show()

+--------+-------------+----------+--------------------+-------+--------------------+----------+------------+--------------------+-----------+
|CNCRN_ID|SOURCE_SYSTEM|   RECVD_D|            CNCRN_DS|   STAT|            ISSUE_DS|  END_DATE|CASENUMBER_C|         SF_ISSUE_DS|SF_END_DATE|
+--------+-------------+----------+--------------------+-------+--------------------+----------+------------+--------------------+-----------+
|    A111|         CFMS|2019-06-14|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A112|         CFMS|2020-06-12|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A113|         CFMS|2021-06-01|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|
|    A114|         CFMS|2019-07-23|Something's wrong...|   Open|Something's wrong...|2025-01-01|        null|                null|       null|

### Stop Spark Session
Because this is last step of our data preparation, we don't need the Spark cluster anymore. We will stop the Spark context which will remove the Spark application from the cluster.

In [9]:
spark.stop()