<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Introduction to Spark Programming</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

# Cloud Computing Recap
- Typically, when we think of a “computer,” we think about one machine sitting on our desk at home or at work.
- There are some things that our computer is not powerful enough to perform. One particularly challenging area is data processing.
- Single machines do not have enough power and resources to perform computations on huge amounts of information (or the user probably does not have the time to wait for the computation to finish). 
- A cluster, or group, of computers, pools the resources of many machines together, giving us the ability to use all the cumulative resources as if they were a single computer.


# Cloud Computing and Spark
- Now, a group of machines alone is not powerful, we need a framework to coordinate work across them. 
- Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers.
- The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like
 - * Spark’s standalone cluster manager, 
 - * YARN, 
 - * or Mesos. 
- We then submit Spark Applications to these cluster managers, which will grant resources to our application so that we can complete our work.


# Spark Cluster
Spark cluster consists of driver and worker nodes.
![](https://i.imgur.com/6I4l6ZH.png)

# Spark Applications
Spark Applications consist of a driver process and a set of executor processes. 
![](https://i.imgur.com/zF7Ngfc.png)




# Spark Executor
Each executor’s core gets a partition of data to work on
![](https://i.imgur.com/FZN5dCb.png)

# Spark Driver
- Each Spark driver creates one or more Spark jobs; 
- Each Spark job creates one or more stages; 
- Each Spark stage creates one or more tasks to be distributed to executors
![](https://i.imgur.com/lMTRd1c.png)

# Spark Programs
A program consists of a sequence of transformations followed by an action.
![](https://i.imgur.com/NLTWLo0.png)


# Spark RDD
- The primary data abstraction structure for Spark applications, is one of the main differentiators between Spark and other cluster computing frameworks. 
- In-memory collections of data distributed across a cluster. 
- Spark programs using the Spark core API consist of
- * loading input data into an RDD
- * transforming the RDD into subsequent RDDs
- * storing or presenting the final output for an application from the resulting final RDD.


# DataFrames
The most common Structured API and simply represents a table of data with rows and columns
![](https://i.imgur.com/qvdB1rT.png)

# Transformations
- In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created. 
- This might seem like a strange concept at first: if we cannot change it, how are we supposed to use it? 
- To “change” a DataFrame, we need to instruct Spark how we would like to modify it to do what we want.
- These instructions are called transformations. 


# Narrow Transformation
Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are
those for which each input partition will contribute to only one output partition.
# Wide Transformation
A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby Spark
will exchange partitions across the cluster. With narrow transformations, Spark will automatically
perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames,
they’ll all be performed in-memory. The same cannot be said for shuffles. When we perform a shuffle,
Spark writes the results to disk. 
![narrow vs wide](https://i.imgur.com/jJ4fypS.png)

# Lazy Evaluation
Lazy evaluation means that Spark will wait until the very last moment to execute the graph of
computation instructions. 

In Spark, instead of modifying the data immediately when you express some
operation, you build up a plan of transformations that you would like to apply to your source data. 

By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame
transformations to a streamlined physical plan that will run as efficiently as possible across the
cluster. 

This provides immense benefits because Spark can optimize the entire data flow from end to
end. 

An example of this is something called predicate pushdown on DataFrames. If we build a large
Spark job but specify a filter at the end that only requires us to fetch one row from our source data,
the most efficient way to execute this is to access the single record that we need. Spark will actually
optimize this for us by pushing the filter down automatically.


# Actions
- To trigger the computation, we run an action - instructs Spark to compute a result from a series of transformations. 
- There are three kinds of actions:
- * Actions to view data in the console
- * Actions to collect data to native objects in the respective language
- * Actions to write to output data sources


# Spark Example: Counting MnM for the Cookie Monster

Let’s solve problem, but with a larger data set and using more of Spark’s distribution
functionality and DataFrame APIs. We will cover the APIs used in this program later.

Let’s write a Spark program that reads a file with over 100,000 entries (where each
row or line has a <state, mnm_color, count>) and computes and aggregates the
counts for each color and state. These aggregated counts tell us the colors of M&Ms
favored by students in each state.

In [1]:
spark.version

'2.4.8'

### 1. Get the M&M data set filename from the command-line arguments

In [4]:
## Change the following path to your bucket
mnm_file = "gs://info323-ya45-spring2023/notebooks/jupyter/mnm_dataset.csv"

### 2. Read the file into a Spark DataFrame using the CSV format by inferring the schema and specifying that the file contains a header, which provides column names for comma-separated fields.

In [5]:
mnm_df = (spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(mnm_file))

In [4]:
print(mnm_df.schema)

StructType(List(StructField(State,StringType,true),StructField(Color,StringType,true),StructField(Count,IntegerType,true)))


In [6]:
mnm_df.printSchema()

root
 |-- State: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Count: integer (nullable = true)



### 3. We use the DataFrame high-level APIs. 
1. Select from the DataFrame the fields "State", "Color", and "Count"
2. Since we want to group each state and its M&M color count, we use groupBy()
3. Aggregate counts of all colors and groupBy() State and Color
4. orderBy() in descending order


In [8]:
from pyspark.sql.functions import count

count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.groupBy("State", "Color")
.agg(count("Count").alias("Total"))
.orderBy("Total", ascending=False))

### 4. Show the resulting aggregations for all the states and colors; a total count of each color per state. Note show() is an action, which will trigger the above query to be executed.

In [9]:
count_mnm_df.show(n=60, truncate=False)
print("Total Rows = %d" % (count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
|AZ   |Brown |1698 |
|WY   |Green |1695 |
|CO   |Blue  |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CO   |Brown |1656 |
|CA   |Red   |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|OR   |Red   |1645 |
|UT   |Yellow|1645 |
|CO   |Orange|1642 |
|TX   |Brown 

### 5. While the above code aggregated and counted for all the states, what if we just want to see the data for a single state, e.g., CA?
1. Select from all rows in the DataFrame
2. Filter only CA state
3. groupBy() State and Color as we did above
4. Aggregate the counts for each color
5. orderBy() in descending order
6. Find the aggregate count for California by filtering

In [10]:
ca_count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.where(mnm_df.State == "CA")
.groupBy("State", "Color")
.agg(count("Count").alias("Total"))
.orderBy("Total", ascending=False))

### Show the resulting aggregation for California. As above, show() is an action that will trigger the execution of the entire computation.

In [11]:
ca_count_mnm_df.show(n=10, truncate=False)

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|CA   |Green |1723 |
|CA   |Brown |1718 |
|CA   |Orange|1657 |
|CA   |Red   |1656 |
|CA   |Blue  |1603 |
+-----+------+-----+



## An End-to-End Example

Spark reads in a DataFrame from a file. The DataFrame has a set of columns with an unspecified number of rows. The reason the number of rows is unspecified is because reading data is a transformation, and
is therefore a lazy operation. Spark peeked at only a couple of rows of data to try to guess what types
each column should be.
![](https://i.imgur.com/o1y1JzQ.png)

In [1]:
file="gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv"
flightdata = spark.read.option("inferSchema", "true").option("header", "true").csv(file)

In [2]:
flightdata.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

Let us specify a wide transformation, sort(). Nothing happens to the data when we call sort because it’s just a transformation. However, we can
see that Spark is building up a plan for how it will execute this across the cluster by looking at the
explain plan. We can call explain on any DataFrame object to see the DataFrame’s lineage.

In [3]:
flightdata.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#12 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#12 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


Now we have a sequence of transofrmations as: narror (read) -> wide (sort)

You can read explain plans from top to bottom, the top being the
end result, and the bottom being the source(s) of data. In this case, take a look at the first keywords.
You will see sort, exchange, and FileScan. That’s because the sort of our data is actually a wide
transformation because rows will need to be compared with one another. Don’t worry too much about
understanding everything about explain plans at this point, they can just be helpful tools for debugging
and improving your knowledge as you progress with Spark.

Next, we can specify an action to kick off this plan. However, before doing
that, we’re going to set a configuration. By default, when we perform a shuffle, Spark outputs 200
shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the
shuffle.

In [4]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [5]:
flightdata.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

## DataFrame and SQL

We worked through a simple transformation in the previous example, let’s now work through a more
complex one and follow along in both DataFrames and SQL. Spark can run the same transformations,
regardless of the language, in the exact same way. You can express your business logic in SQL or
DataFrames (either in R, Python, Scala, or Java) and Spark will compile that logic down to an
underlying plan (that you can see in the explain plan) before actually executing your code. With Spark
SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure
SQL. There is no performance difference between writing SQL queries or writing DataFrame code,
they both “compile” to the same underlying plan that we specify in DataFrame code.
You can make any DataFrame into a table or view with one simple method call:
![](https://i.imgur.com/uFU3s01.png)

In [6]:
flightdata.createOrReplaceTempView("flight_data_table")

We can query our data in SQL. To do so, we’ll use the spark.sql function (remember, spark is our SparkSession variable) that conveniently returns a new DataFrame. Although this might seem a bit circular in logic—that a SQL query against a DataFrame returns another DataFrame—it’s actually
quite powerful. This makes it possible for you to specify transformations in the manner most convenient to you at any given point in time and not sacrifice any efficiency to do so! To understand
that this is happening, let’s take a look at two explain plans:

In [7]:
flightdata.groupBy("DEST_COUNTRY_NAME").count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [8]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_table
GROUP BY DEST_COUNTRY_NAME
""")

In [9]:
sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


Let’s pull out some interesting statistics from our data. One thing to understand is that DataFrames
(and SQL) in Spark already have a huge number of manipulations available. There are hundreds of
functions that you can use and import to help you resolve your big data problems faster. We will use
the max function, to establish the maximum number of flights to and from any given location. This just
scans each value in the relevant column in the DataFrame and checks whether it’s greater than the
previous values that have been seen. This is a transformation, because we are effectively filtering
down to one row. Let’s see what that looks like:

In [10]:
spark.sql("SELECT max(count) FROM flight_data_table").take(1)

[Row(max(count)=370002)]

Let’s perform something a bit more
complicated and find the top five destination countries in the data. This is our first multitransformation
query, so we’ll take it step by step. Let’s begin with a fairly straightforward SQL
aggregation:

In [11]:
maxsql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_table
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

In [12]:
maxsql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



Now, let’s move to the DataFrame syntax that is semantically similar but slightly different in
implementation and ordering. But, as we mentioned, the underlying plans for both of them are the
same. Let’s run the queries and see their results as a sanity check:

In [29]:
from pyspark.sql.functions import desc

In [30]:
flightdata.groupBy("DEST_COUNTRY_NAME")\
.sum("count").withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total")).limit(5).show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



The above sequence of transformations has 7 transformation steps: read->groupBy->sum->withColumnRenamed->sort->limit->collect

In [15]:
from pyspark.sql.functions import desc

In [16]:
flightdata.groupBy("DEST_COUNTRY_NAME")\
.sum("count").withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total")).limit(5).explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#77L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#10,destination_total#77L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[sum(cast(count#12 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_sum(cast(count#12 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


# Reader & Writer
1. Read from CSV files
1. Read from JSON files
1. Write DataFrame to files
1. Write DataFrame to tables

##### Methods
- DataFrameReader (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html" target="_blank">Scala</a>): `csv`, `json`, `option`, `schema`
- DataFrameWriter (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameWriter" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html" target="_blank">Scala</a>): `mode`, `option`, `parquet`, `format`, `saveAsTable`
- StructType (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=structtype#pyspark.sql.types.StructType" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/types/StructType.html" target="_blank" target="_blank">Scala</a>): `toDDL`

##### Spark Types
- Types (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/types/index.html" target="_blank">Scala</a>): `ArrayType`, `DoubleType`, `IntegerType`, `LongType`, `StringType`, `StructType`, `StructField`

### Read from CSV files
Read from CSV with DataFrameReader's `csv` method and the following options:

Tab separator, use first line as header, infer schema

In [32]:
csvPath = "gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.csv"

In [33]:
df = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .option("inferSchema", True)
  .csv(csvPath))

In [34]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count: string (nullable = true)



In [35]:
df.show(2)

+-------------------------------------------+
|DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count|
+-------------------------------------------+
|                       United States,Rom...|
|                       United States,Cro...|
+-------------------------------------------+
only showing top 2 rows



In [36]:
df.count()

256

In [40]:
usersPath = "gs://info323-ya45-spring2023/notebooks/jupyter/users-500k/"

In [41]:
usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .option("inferSchema", True)
  .csv(usersPath))

                                                                                

In [42]:
usersDF.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)
 |-- email: string (nullable = true)



In [43]:
usersDF.count()

                                                                                

500000

Manually define the schema by creating a `StructType` with column names and data types

In [44]:
from pyspark.sql.types import LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("user_id", StringType(), True),  
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("email", StringType(), True)
])

Read from CSV using this user-defined schema instead of inferring schema

In [45]:
usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .schema(userDefinedSchema)
  .csv(usersPath))

Alternatively, define the schema using a DDL formatted string.

In [46]:
DDLSchema = "user_id string, user_first_touch_timestamp long, email string"

usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .schema(DDLSchema)
  .csv(usersPath))

In [48]:
outputPath = "gs://info323-ya45-spring2023/notebooks/jupyter/output/users-500k"

In [49]:
usersDF.write.option("header", True).option("delimiter", "\t").csv(outputPath)

                                                                                

### Read from JSON files

Read from JSON with DataFrameReader's `json` method and the infer schema option

In [50]:
jsonPath = "gs://info323-ya45-spring2023/notebooks/jupyter/2015-summary.json"

In [53]:
jdf = (spark.read
  .option("inferSchema", True)
  .json(jsonPath))

In [54]:
jdf.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [55]:
jdf.show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



In [4]:
jFolderPath = "gs://info323-ya45-spring2023/notebooks/jupyter/events-500k/"

In [57]:
eventsDF = (spark.read
  .option("inferSchema", True)
  .json(jFolderPath))

                                                                                

In [58]:
eventsDF.printSchema()

root
 |-- device: string (nullable = true)
 |-- ecommerce: struct (nullable = true)
 |    |-- purchase_revenue_in_usd: double (nullable = true)
 |    |-- total_item_quantity: long (nullable = true)
 |    |-- unique_items: long (nullable = true)
 |-- event_name: string (nullable = true)
 |-- event_previous_timestamp: long (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coupon: string (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- item_revenue_in_usd: double (nullable = true)
 |    |    |-- price_in_usd: double (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- traffic_source: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)

In [59]:
eventsDF.count()

                                                                                

500000

In [2]:
jOutputPath = "gs://info323-ya45-spring2023/notebooks/jupyter/output/events-500k/"

In [61]:
eventsDF.write.json(jOutputPath)

                                                                                

Read data faster by creating a `StructType` with the schema names and data types

In [5]:
from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("device", StringType(), True),  
  StructField("ecommerce", StructType([
    StructField("purchaseRevenue", DoubleType(), True),
    StructField("total_item_quantity", LongType(), True),
    StructField("unique_items", LongType(), True)
  ]), True),
  StructField("event_name", StringType(), True),
  StructField("event_previous_timestamp", LongType(), True),
  StructField("event_timestamp", LongType(), True),
  StructField("geo", StructType([
    StructField("city", StringType(), True),
    StructField("state", StringType(), True)
  ]), True),
  StructField("items", ArrayType(
    StructType([
      StructField("coupon", StringType(), True),
      StructField("item_id", StringType(), True),
      StructField("item_name", StringType(), True),
      StructField("item_revenue_in_usd", DoubleType(), True),
      StructField("price_in_usd", DoubleType(), True),
      StructField("quantity", LongType(), True)
    ])
  ), True),
  StructField("traffic_source", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("user_id", StringType(), True)
])

eventsDF = (spark.read
  .schema(userDefinedSchema)
  .json(jFolderPath))

In [6]:
DDLSchema = "`device` STRING,`ecommerce` STRUCT<`purchase_revenue_in_usd`: DOUBLE, `total_item_quantity`: BIGINT, `unique_items`: BIGINT>,`event_name` STRING,`event_previous_timestamp` BIGINT,`event_timestamp` BIGINT,`geo` STRUCT<`city`: STRING, `state`: STRING>,`items` ARRAY<STRUCT<`coupon`: STRING, `item_id`: STRING, `item_name`: STRING, `item_revenue_in_usd`: DOUBLE, `price_in_usd`: DOUBLE, `quantity`: BIGINT>>,`traffic_source` STRING,`user_first_touch_timestamp` BIGINT,`user_id` STRING"

eventsDF = (spark.read
  .schema(DDLSchema)
  .json(jFolderPath))

In [7]:
eventsDF.show(3)

[Stage 0:>                                                          (0 + 1) / 1]

+-------+------------------+----------+------------------------+----------------+--------------------+-----+--------------+--------------------------+-----------------+
| device|         ecommerce|event_name|event_previous_timestamp| event_timestamp|                 geo|items|traffic_source|user_first_touch_timestamp|          user_id|
+-------+------------------+----------+------------------------+----------------+--------------------+-----+--------------+--------------------------+-----------------+
|Android|{null, null, null}|mattresses|        1593445137069608|1593445139973236|      {Lakeland, FL}|   []|        google|          1593445100860131|UA000000106062296|
|Android|{null, null, null}|  warranty|        1593448606629109|1593448809022141|     {Rock Hill, SC}|   []|     instagram|          1593448606629109|UA000000106082500|
|Android|{null, null, null}|mattresses|        1593461473732425|1593461479624958|{Stone Mountain, GA}|   []|        google|          1593460845947854|UA000

                                                                                

### Write DataFrames to files

Write `usersDF` to parquet with DataFrameWriter's `parquet` method and the following configurations:

In [10]:
workingDir = "gs://info323-ya45-spring2023/notebooks/jupyter/output"

In [66]:
usersOutputPath = workingDir + "/users.parquet"

(usersDF.write
  .option("compression", "snappy")
  .mode("overwrite")
  .parquet(usersOutputPath)
)

                                                                                

In [8]:
eventsDF.show(3)

[Stage 1:>                                                          (0 + 1) / 1]

+-------+------------------+----------+------------------------+----------------+--------------------+-----+--------------+--------------------------+-----------------+
| device|         ecommerce|event_name|event_previous_timestamp| event_timestamp|                 geo|items|traffic_source|user_first_touch_timestamp|          user_id|
+-------+------------------+----------+------------------------+----------------+--------------------+-----+--------------+--------------------------+-----------------+
|Android|{null, null, null}|mattresses|        1593445137069608|1593445139973236|      {Lakeland, FL}|   []|        google|          1593445100860131|UA000000106062296|
|Android|{null, null, null}|  warranty|        1593448606629109|1593448809022141|     {Rock Hill, SC}|   []|     instagram|          1593448606629109|UA000000106082500|
|Android|{null, null, null}|mattresses|        1593461473732425|1593461479624958|{Stone Mountain, GA}|   []|        google|          1593460845947854|UA000

                                                                                

In [11]:
eventsOutputPath = workingDir + "/events.parquet"

(eventsDF.write
  .option("compression", "snappy")
  .mode("overwrite")
  .parquet(eventsOutputPath)
)

                                                                                