### Example 1 - Wordcount (Plain Map/Reduce)

**Wordcount**: find the number of occurences of each word in a body of text.

#### Data format
The file `text.txt` contains the following short text:

```
this is a text file with random words like text , words , like this is an example of a text file
```

The data is stored in HDFS: `s3://initial-notebook-data-bucket-dblab-905418150721/example_data/text.txt`.

We use the **RDD API**.

In [1]:
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("wordcount example") \
    .getOrCreate() \
    .sparkContext

wordcount = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/text.txt") \
    .flatMap(lambda x: x.split(" ")) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x,y: x+y) \
    .sortBy(lambda x: x[1], ascending=False)

print(wordcount.collect())

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
324,application_1761923966900_0339,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('text', 3), ('this', 2), ('is', 2), ('like', 2), ('a', 2), ('file', 2), ('words', 2), (',', 2), ('an', 1), ('of', 1), ('with', 1), ('random', 1), ('example', 1)]

#### Code explanation
- First, we create a `sparkSession` and a `SparkContex`:
    - `sparkSession` is an entry point for every programming library in Spark and we need to create one in order to execute code.
    - The `sparkContext` is an entry point specific for RDDs.

- Then, the program reads the `text.txt` file from HDFS. With the use of a `lambda function` we split the data every time there is a whitespace between them.
    - A lambda function is essentially an anonymous function we can use to write quick throwaway functions without defining or naming them.
    - The lambda function the program uses as a `flatMap` argument: `Lambda x: x.split(" ")`
    - `flatMap` vs `map`: instead of creating multiple lists -> single list with all values.

- Next, with the use of a `map` function we create a `(key,value)` pair for every word.
    - We set `key = $word` and `value = 1`

- We use the `reduceByKey` function: every tuple with the same `key` is sent to the same `reducer` so it **aggregate** them and create the result.
    - In our case, if more than one tuples with the same `key` exist,  we **add** their `values`

- Finally, we `sortBy` `value` (number of occurrence) and print the result.

### Example 2 - Simple Database
#### Data format
`Employees.csv` contains the ID of the employee, the name of the employee, the salary of the employee and the ID of the department that the employee works at.

| ID          | Name        | Salary      | DepartmentID |
| ----------- | ----------- | ----------- | ------------ |
| 1           | George R    | 2000        | 1            |

`Departments.csv` contains the ID of the department and the name of the department.

| ID          | Name        |
| ----------- | ----------- |
| 1           | Dep A       |

The data is stored in HDFS: `s3://initial-notebook-data-bucket-dblab-905418150721/example_data/`.


#### QUERY 1: Find the 5 worst paid employees

In [2]:
# Implementation 1: RDD API
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 1 execution") \
    .getOrCreate() \
    .sparkContext
    
# Load and process data
# CSV Columns: "id", "name", "salary", "dep_id"
employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv") \
                .map(lambda x: (x.split(","))) # Split lines into a list of elements -> delimiter: ","
# print(employees.collect())

# Create tupples: (salary, [id, name, dep_id]); then sortByKey
# Αντιστοίχιση στηλών:
#   x[0] = id
#   x[1] = name
#   x[2] = salary
#   x[3] = dep_id
sorted_employees = employees.map(lambda x: [int(x[2]), [x[0], x[1], x[3]]]) \
                    .sortByKey()

# Different output options
print(sorted_employees.take(5))
print("===========================================================================================================================")
for item in sorted_employees.coalesce(1).collect():
    print(item)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(550, ['6', 'Jerry L', '3']), (1000, ['2', 'John K', '2']), (1000, ['7', 'Marios K', '1']), (1050, ['5', 'Helen K', '2']), (1500, ['10', 'Yiannis T', '1'])]
(550, ['6', 'Jerry L', '3'])
(1000, ['2', 'John K', '2'])
(1000, ['7', 'Marios K', '1'])
(1050, ['5', 'Helen K', '2'])
(1500, ['10', 'Yiannis T', '1'])
(2000, ['1', 'George R', '1'])
(2100, ['3', 'Mary T', '1'])
(2100, ['4', 'George T', '1'])
(2500, ['8', 'George K', '2'])
(2500, ['11', 'Antonis T', '2'])
(3500, ['9', 'Vasilios D', '3'])

Things to consider:
-  `flatMmap` vs `map`: why `map` here?
- what does the second `lambda` function do?

In [3]:
# Implementation 2: DataFrame API
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
from pyspark.sql.functions import col

spark = SparkSession \
    .builder \
    .appName("DF query 1 execution") \
    .getOrCreate()

# Define the schema for the employees DataFrame
employees_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
])


employees_df = spark.read.csv("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv", \
    header=False, \
    schema=employees_schema)
# Use "printSchema()" to display the datatypes of dataframes:
employees_df.printSchema()

# Alternative way to read csv:
employees_df = spark.read.format('csv') \
    .options(header='false') \
    .schema(employees_schema) \
    .load("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv")

sorted_employess_df = employees_df.sort(col("salary"))
# Use "explain()" to display physical plan:
sorted_employess_df.explain(mode="formatted")
sorted_employess_df.show(5)

print("===========================================================================================================================")
# To experiment more with the optimizer: 
print(sorted_employess_df._jdf.queryExecution().logical().toString()) #Get logical plan
print(sorted_employess_df._jdf.queryExecution().optimizedPlan().toString()) #Get optimized plan
print(sorted_employess_df._jdf.queryExecution().executedPlan().toString()) # Get physical plan

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- dep_id: integer (nullable = true)

== Physical Plan ==
AdaptiveSparkPlan (4)
+- Sort (3)
   +- Exchange (2)
      +- Scan csv  (1)


(1) Scan csv 
Output [4]: [id#19, name#20, salary#21, dep_id#22]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv]
ReadSchema: struct<id:int,name:string,salary:float,dep_id:int>

(2) Exchange
Input [4]: [id#19, name#20, salary#21, dep_id#22]
Arguments: rangepartitioning(salary#21 ASC NULLS FIRST, 1000), ENSURE_REQUIREMENTS, [plan_id=7]

(3) Sort
Input [4]: [id#19, name#20, salary#21, dep_id#22]
Arguments: [salary#21 ASC NULLS FIRST], true, 0

(4) AdaptiveSparkPlan
Output [4]: [id#19, name#20, salary#21, dep_id#22]
Arguments: isFinalPlan=false


+---+---------+------+------+
| id|     name|salary|dep_id|
+---+---------+------+------+
|  6|  Jerry L| 550.0|     3|
|  

Remember to use `explain()` to check if the physical plan is what you expect.

#### QUERY 2: Find the 3 best paid employees from "Dep A"


In [4]:
# Implementation 1: RDD API
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 2 execution") \
    .getOrCreate() \
    .sparkContext

    
employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv") \
                .map(lambda x: (x.split(","))) # → [emp_id, emp_name, salary, dep_id]
departments = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/departments.csv") \
                .map(lambda x: (x.split(","))) # → [id, dpt_name]

# =======================
# SCHEMA DETAILS:
# employees:   "emp_id", "emp_name", "salary", "dep_id"
# departments: "id", "dpt_name"
#
# Contents of employees RDD per row:
#   x[0] = emp_id
#   x[1] = emp_name
#   x[2] = salary
#   x[3] = dep_id
#
# Contents of departments RDD per row:
#   x[0] = id
#   x[1] = dpt_name
# =======================

# Filter & only keep departments named "Dep A"
depA = departments.filter(lambda x: x[1] == "Dep A")

# (k, v) -> (dep_id, [emp_id, emp_name, salary])
employees_formatted = employees.map(lambda x: [x[3] , [x[0],x[1],x[2]] ] )
# (k, v) -> (id, [dpt_name])
depA_formatted = depA.map(lambda x: [x[0], [x[1]]])
# print(employees_formatted.collect())
# print(depA_formatted.collect())


# Guess the data format????
joined_data = employees_formatted.join(depA_formatted)
# print(joined_data.collect())

get_employees = joined_data.map(lambda x: (x[1][0]))
# print(get_employees.collect())

sorted_employees= get_employees.map(lambda x: [int(x[2]),[x[0], x[1]] ] ) \
                                .sortByKey(ascending=False)
print(sorted_employees.take(3))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(2100, ['3', 'Mary T']), (2100, ['4', 'George T']), (2000, ['1', 'George R'])]

In [5]:
# Implementation 2: SQL API
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
spark = SparkSession \
    .builder \
    .appName("DF query 2 execution") \
    .getOrCreate()

employees_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
])

employees_df = spark.read.format('csv') \
    .options(header='false') \
    .schema(employees_schema) \
    .load("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv")

departments_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
])

departments_df = spark.read.format('csv') \
    .options(header='false') \
    .schema(departments_schema) \
    .load("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/departments.csv")

# To utilize as SQL tables
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

### USE TEMPORARY TABLE depA ###
id_query = "SELECT departments.id, departments.name \
    FROM departments \
    WHERE departments.name == 'Dep A'"

depA_id = spark.sql(id_query)
depA_id.createOrReplaceTempView("depA")
inner_join_query = "SELECT employees.name, employees.salary \
    FROM employees \
    INNER JOIN depA ON employees.dep_id == depA.id \
    ORDER BY employees.salary DESC"
joined_data = spark.sql(inner_join_query)
################################
### OR NOT ###
# inner_join_query = """
#     SELECT employees.name, employees.salary
#     FROM employees
#     INNER JOIN departments ON employees.dep_id == departments.id
#     WHERE departments.name == 'Dep A'
#     ORDER BY employees.salary DESC
# """
# joined_data = spark.sql(inner_join_query)
################################
joined_data.show(3)
joined_data.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------+
|    name|salary|
+--------+------+
|  Mary T|2100.0|
|George T|2100.0|
|George R|2000.0|
+--------+------+
only showing top 3 rows

== Physical Plan ==
AdaptiveSparkPlan (11)
+- Sort (10)
   +- Exchange (9)
      +- Project (8)
         +- BroadcastHashJoin Inner BuildRight (7)
            :- Filter (2)
            :  +- Scan csv  (1)
            +- BroadcastExchange (6)
               +- Project (5)
                  +- Filter (4)
                     +- Scan csv  (3)


(1) Scan csv 
Output [3]: [name#50, salary#51, dep_id#52]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees.csv]
PushedFilters: [IsNotNull(dep_id)]
ReadSchema: struct<name:string,salary:float,dep_id:int>

(2) Filter
Input [3]: [name#50, salary#51, dep_id#52]
Condition : isnotnull(dep_id#52)

(3) Scan csv 
Output [2]: [id#57, name#58]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/ex

### Example 3 - Simple Database with a twist (DataFrame UDFs)

Sometimes we need to define functions that process the values of specific columns of a single row.

Motivating example: a database with salaries and bonuses for our employees:

| ID          | Name        | Salary      | DepartmentID | Bonus        |
| ----------- | ----------- | ----------- | ------------ | ------------ |
| 1           | George R    | 2000        | 1            | 500          |

We wan to calculate the total yearly income for each one of them: `14 x Salary + Bonus`

In [6]:
# Implementation 1: RDD API
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 3 execution") \
    .getOrCreate() \
    .sparkContext
    
employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees2.csv") \
                .map(lambda x: (x.split(",")))
# print(employees.collect())

employees_yearly = employees.map(lambda x: [x[1], 14*(int(x[2]))+int(x[4])])

                    
print(employees_yearly.collect())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['George R', 28500], ['John K', 14150], ['Mary T', 29850], ['George T', 29720], ['Helen K', 14900], ['Jerry L', 7900], ['Marios K', 14550], ['George K', 36500], ['Vasilios D', 50000], ['Yiannis T', 21450], ['Antonis T', 35620]]

In [7]:
# Implementation 2: DataFrame API
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
from pyspark.sql.functions import col, udf


spark = SparkSession.builder \
    .appName("DataFrame query 3 execution (UDF example)") \
    .getOrCreate()
()
employees2_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
    StructField("bonus", FloatType())
])


employees_df = spark.read.csv("s3://initial-notebook-data-bucket-dblab-905418150721/example_data/employees2.csv", header=False, schema=employees2_schema)



###  WITH UDF  ###
def calculate_yearly_income(salary, bonus):
    return 14*salary+bonus
# Register the UDF
calculate_yearly_income_udf = udf(calculate_yearly_income, FloatType())
employees_yearly_df = employees_df \
    .withColumn("yearly", calculate_yearly_income_udf(col("salary"), col("bonus"))).select("name", "yearly")
##################

### WITHOUT UDF ###
# employees_yearly_df = employees_df \
#     .withColumn("yearly", (14*col("salary")+col("bonus"))).select("name", "yearly")
####################

employees_yearly_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-------+
|      name| yearly|
+----------+-------+
|  George R|28500.0|
|    John K|14150.0|
|    Mary T|29850.0|
|  George T|29720.0|
|   Helen K|14900.0|
|   Jerry L| 7900.0|
|  Marios K|14550.0|
|  George K|36500.0|
|Vasilios D|50000.0|
| Yiannis T|21450.0|
| Antonis T|35620.0|
+----------+-------+