In [0]:
## Apache Spark's architecture is designed for speed and scalability, utilizing a master-worker model that enables efficient distributed computing for big data processing.
# Core Components of Spark Architecture
# Driver Program: The driver acts as the master node, running the main application and creating the Spark context. It converts user-defined transformations into a Directed Acyclic Graph (DAG) and splits jobs into stages and tasks, which are then sent to executors for execution. 
# 2
# Executors: These are the worker nodes that execute the tasks assigned by the driver. Each executor runs in its own JVM and is responsible for executing the tasks and returning the results to the driver. Executors also manage memory and cache data for faster access. 
# 2
# Cluster Manager: The cluster manager allocates resources (CPU, memory) across the worker nodes. Spark can work with various cluster managers, including its own standalone manager, Apache Mesos, and Hadoop YARN. 

In [0]:
# RDD (Resilient Distributed Dataset)

# RDD is the fundamental data structure in Spark, representing an immutable distributed collection of objects. It allows for parallel processing and can be created from various data sources, including Hadoop Distributed File System (HDFS), local file system, and relational databases. RDDs offer two types of operations: transformations and actions. Transformations create a new RDD from an existing one, while actions return a result to the driver program or write data to external storage.

# DataFrame

# DataFrame is a higher-level abstraction introduced in Spark 1.3 to overcome the limitations of RDDs. It represents a distributed collection of data organized into named columns, similar to a table in a relational database or a spreadsheet. DataFrames support a wide range of operations and transformations, such as filtering, aggregating, joining, and grouping data.

In [0]:
# `%fs`: Access Databricks File System (DBFS)
# `%sql`: Run SQL Queries
# `%sh`: Execute Shell Commands
# `%md`: Add Markdown
# `%run`: Run External Notebooks
# `%python`, `%r`, `%scala`: Switch Between Languages
# `%pip`: Manage Python Packages

In [0]:
df = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data",
    header=True,
    inferSchema=True
)

df.show(5)


+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code| brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+
|2019-11-01 00:00:00|      view|   1003461|2053013555631882655|electronics.smart...|xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:00|      view|   5000088|2053013566100866035|appliances.sewing...|janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:01|      view|  17302664|2053013553853497655|                NULL| creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:01|      view|   3601530|2053013563810775923|appliances.kitche...|    lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:01|      view|   1004775|2053013555631882655|electronics.smart...|xiaomi

In [0]:
df.select("event_type", "brand", "price").show(10)

df.filter(df.price > 100).count()

df.groupBy("event_type").count().show()

df.groupBy("brand") \
  .count() \
  .orderBy("count", ascending=False) \
  .limit(5) \
  .show()


+----------+--------+------+
|event_type|   brand| price|
+----------+--------+------+
|      view|  xiaomi|489.07|
|      view|  janome|293.65|
|      view|   creed| 28.31|
|      view|      lg|712.87|
|      view|  xiaomi|183.27|
|      view|      hp|360.09|
|      view|      hp|514.56|
|      view| rondell| 30.86|
|      view|michelin| 72.72|
|      view|   apple|732.07|
+----------+--------+------+
only showing top 10 rows
+----------+---------+
|event_type|    count|
+----------+---------+
|  purchase|  1659788|
|      cart|  3955446|
|      view|104335509|
+----------+---------+

+-------+--------+
|  brand|   count|
+-------+--------+
|   NULL|15331243|
|samsung|13172020|
|  apple|10381933|
| xiaomi| 7721825|
| huawei| 2521331|
+-------+--------+

