# Spark Environment Options and Dependencies

* This notebook includes examples of how to connect and configure Spark to run in a local environment and to manage the set of dependencies and options required to read/write data to an S3 compatible object storage.

In [1]:
pip install delta

Collecting delta
  Downloading delta-0.4.2.tar.gz (4.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: delta
  Building wheel for delta (setup.py) ... [?25ldone
[?25h  Created wheel for delta: filename=delta-0.4.2-py3-none-any.whl size=2928 sha256=01782673ecb5f9201ee04c7f44764f4552a59bd327cbdf4fadd308ca1d0dadb3
  Stored in directory: /home/jovyan/.cache/pip/wheels/06/c9/f4/15ff81c648b9fc73aae5886b41204ada25bd73cbb41b9fad78
Successfully built delta
Installing collected packages: delta
Successfully installed delta-0.4.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install delta-spark

Collecting delta-spark
  Downloading delta_spark-2.0.0-py3-none-any.whl (20 kB)
Collecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.8/198.8 KB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: py4j, delta-spark
Successfully installed delta-spark-2.0.0 py4j-0.10.9.2
Note: you may need to restart the kernel to use updated packages.


## Spark Session

* This is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.
* As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).

In [3]:
# Import Necessary Libraries
import os
from delta.tables import *
from delta.tables import DeltaTable
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,array,ArrayType,DateType,TimestampType, FloatType
from pyspark.sql import functions as f
from pyspark.sql.functions import udf
import hashlib
import datetime
import urllib.request
import json
from datetime import timedelta, date
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
import sys

In [4]:
# Builder API
# Spark session & context
spark=SparkSession.builder.master("local").appName("Hive-Test").enableHiveSupport().getOrCreate()


# Get configurations
# configurations = spark.sparkContext.getConf().getAll()
# for item in configurations: print(item)

## MinIO Storage Initialization

In [5]:
# Read CustomersData From Minio
customers = spark.read.option("header",True).csv("s3a://bronze/sales/customers/2022/07/02/09/customers.csv")

# Show Top 5 
customers.show(5)

+-----------+--------------+--------------------+-----------+----------+--------------+----------------+--------------------+----------------+--------------------+
|customer_id| customer_name|             address|       city|postalcode|       country|           phone|               email|     credit_card|          updated_at|
+-----------+--------------+--------------------+-----------+----------+--------------+----------------+--------------------+----------------+--------------------+
|          1|    Ariel Hale|Ap #660-3260 Pell...|    College|     98362| United States|  1-973-833-9836|amet.metus@Nullat...|5124442517412973|2022-07-21 09:14:...|
|          2| Aubrey Norris|Ap #943-1347 Impe...| Coldstream|   D10 5JV|United Kingdom|    07672 321093|sollicitudin@enim...|5103696625359419|2022-07-21 09:14:...|
|          3|  Bruno Hebert|    8566 Nisi Avenue| Llangollen|   CE2 4WW|United Kingdom|    02794 010514|Donec.non@dapibus...|5132188470727440|2022-07-21 09:14:...|
|          4|   

## Hive Metastore Initialization

In [6]:
spark.sql("show databases;").show()

+---------+
|namespace|
+---------+
|  default|
+---------+



## Read Data from S3 Object Storage (Minio)

In [10]:
# Read SQL
customers.createOrReplaceTempView("customers")

## Perform Transformations on Data

In [3]:
##########################################
# Performing Transformations 
##########################################

uk_customers = spark.sql("""
    SELECT *
    FROM customers
    WHERE country = 'United Kingdom'
    ORDER BY customer_id ASC
    LIMIT 100
""")

## Writing Results to S3 Object Storage (Minio)

### Writing in CSV Format

In [None]:
##########################################
# Writing Results to S3
##########################################
uk_customers.write.option("header","true").csv("s3a://silver/CSV/jupyter/United-Kingdom-Customers")

### Writing in Delta Format

In [None]:
##########################################
# Writing Results to S3
##########################################
uk_customers.write.format("delta").mode("overwrite").option('overwriteSchema','true').save("s3a://silver/Delta/customers") 