# <center> <img src="../../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
### <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
**Primer Examen**

**Fecha**: 14 de Marzo del 2025

**Nombre del estudiante**: Marco Albanese

**Professor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQL-Exam-1-Marco-Albanese") \
    .master("spark://cd68d43f7ac6:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/14 13:39:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


1. **Read the Data (10 points):**

Load the `employees.csv` and `departments.csv` files into PySpark DataFrames.

In [64]:
from equipo_mcqueen.spark_utils import SparkUtils

employees_schema = SparkUtils.generate_schema([("employee_id", "IntegerType"), ("employee_info", "StringType")])
departments_schema = SparkUtils.generate_schema([("department_id", "IntegerType"), ("department_name", "StringType"), ("location", "StringType")])

employees = spark.read.schema(employees_schema).option("header", "true").csv("/home/jovyan/notebooks/data/employees.csv")
departments = spark.read.schema(departments_schema).option("header", "true").csv("/home/jovyan/notebooks/data/departments.csv")

employees.printSchema()
departments.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- employee_info: string (nullable = true)

root
 |-- department_id: integer (nullable = true)
 |-- department_name: string (nullable = true)
 |-- location: string (nullable = true)



2. **Extract the employee info from the JSON column (20 points):**

Extract the following the columns: name (string), department_id (integer), salary (double), and hire_date (date) from the employee_info column

In [65]:
from pyspark.sql.functions import get_json_object

employees = employees.withColumn("name", get_json_object(employees.employee_info, "$.name"))
employees = employees.withColumn("department_id", get_json_object(employees.employee_info, "$.department_id"))
employees = employees.withColumn("salary", get_json_object(employees.employee_info, "$.salary"))
employees = employees.withColumn("hire_date", get_json_object(employees.employee_info, "$.hire_date"))

employees = employees.drop("employee_info")

employees.show(3)

+-----------+-------+-------------+---------+----------+
|employee_id|   name|department_id|   salary| hire_date|
+-----------+-------+-------------+---------+----------+
|          1|Caitlyn|          103|115959.78|2002-06-10|
|          2| Rachel|          104|100820.16|2009-07-01|
|          3| Carrie|          105|114421.44|1998-12-10|
+-----------+-------+-------------+---------+----------+
only showing top 3 rows



3. **Join Operations (10 points):**

Join the `employees` DataFrame with the `departments` DataFrame on `department_id` to enrich the employee data with department details.

In [66]:
employees = employees.join(departments, employees.department_id == departments.department_id, "inner")
employees.show(3)

+-----------+-------+-------------+---------+----------+-------------+-------------------+--------+
|employee_id|   name|department_id|   salary| hire_date|department_id|    department_name|location|
+-----------+-------+-------------+---------+----------+-------------+-------------------+--------+
|          1|Caitlyn|          103|115959.78|2002-06-10|          103|Sales and Marketing| Chicago|
|          2| Rachel|          104|100820.16|2009-07-01|          104|   Data Engineering| Zapopan|
|          3| Carrie|          105|114421.44|1998-12-10|          105|       Data Science| Seattle|
+-----------+-------+-------------+---------+----------+-------------+-------------------+--------+
only showing top 3 rows



4. **Transformations (Using `when()`) (10 points):**

- Add a new column `salary_category` to the enriched `employee` DataFrame:
  - If `salary` is greater than or equal to **55000**, set `salary_category` to "High".
  - Otherwise, set `salary_category` to "Low".

In [67]:
from pyspark.sql.functions import when

employees = employees.withColumn("salary_category", when(employees.salary >= 55000, "High").otherwise("Low"))
employees.show(5)

+-----------+---------+-------------+---------+----------+-------------+-------------------+-------------+---------------+
|employee_id|     name|department_id|   salary| hire_date|department_id|    department_name|     location|salary_category|
+-----------+---------+-------------+---------+----------+-------------+-------------------+-------------+---------------+
|          1|  Caitlyn|          103|115959.78|2002-06-10|          103|Sales and Marketing|      Chicago|           High|
|          2|   Rachel|          104|100820.16|2009-07-01|          104|   Data Engineering|      Zapopan|           High|
|          3|   Carrie|          105|114421.44|1998-12-10|          105|       Data Science|      Seattle|           High|
|          4|    Renee|          104| 54688.13|1995-03-17|          104|   Data Engineering|      Zapopan|            Low|
|          5|Gabriella|          109|106267.03|1995-02-09|          109|   Customer Service|San Francisco|           High|
+-----------+---

5. **Filter and Group (20 points):**

- Create two new data frames: one that filters employees with a “High” salary and another that filters employees with a “Low” salary.

In [68]:
employees_salaries = employees.select("employee_id", "name", "salary", "department_name", "salary_category")

employees_high_salaries = employees_salaries.filter(employees_salaries.salary_category == "High")
employees_high_salaries.show(3)

employees_low_salaries = employees_salaries.filter(employees_salaries.salary_category == "Low")
employees_low_salaries.show(3)

+-----------+-------+---------+-------------------+---------------+
|employee_id|   name|   salary|    department_name|salary_category|
+-----------+-------+---------+-------------------+---------------+
|          1|Caitlyn|115959.78|Sales and Marketing|           High|
|          2| Rachel|100820.16|   Data Engineering|           High|
|          3| Carrie|114421.44|       Data Science|           High|
+-----------+-------+---------+-------------------+---------------+
only showing top 3 rows

+-----------+--------+--------+--------------------+---------------+
|employee_id|    name|  salary|     department_name|salary_category|
+-----------+--------+--------+--------------------+---------------+
|          4|   Renee|54688.13|    Data Engineering|            Low|
|          7|Jonathan|39323.42|Finance and Accou...|            Low|
|         13|    Lisa|36032.49|    Data Engineering|            Low|
+-----------+--------+--------+--------------------+---------------+
only showing top

- Calculate the average salary **per department** for the two newly created data frames, which contain the salaries of employees categorized as “High” and “Low.”  Resulting data frame for this transformation should contain only **department_name** and **avg_salary** columns

In [69]:
from pyspark.sql.functions import avg

employees_high_salaries_avg = employees_high_salaries.groupBy("department_name").agg(avg("salary").alias("avg_salary"))
employees_high_salaries_avg.show(3)

employees_low_salaries_avg = employees_low_salaries.groupBy("department_name").agg(avg("salary").alias("avg_salary"))
employees_low_salaries_avg.show(3)

+--------------------+------------------+
|     department_name|        avg_salary|
+--------------------+------------------+
|Corporate Strateg...|102741.38324414717|
| Sales and Marketing|100839.65275449108|
|    Data Engineering|101626.29492163012|
+--------------------+------------------+
only showing top 3 rows

+--------------------+------------------+
|     department_name|        avg_salary|
+--------------------+------------------+
|Corporate Strateg...|41590.741833333326|
| Sales and Marketing| 41150.40277777778|
|    Data Engineering| 41358.50794117647|
+--------------------+------------------+
only showing top 3 rows



6. **Sort (10 points):**
- Find the Top 5 employees with highest salaries from employees categorized as “**High**”

In [70]:
top_high_salaries = employees_high_salaries.select("name", "salary")
top_high_salaries = top_high_salaries.orderBy(top_high_salaries.salary.desc()).limit(5)
top_high_salaries.show()

+--------+--------+
|    name|  salary|
+--------+--------+
|Jennifer| 99968.9|
|     Ana|99937.19|
|  Thomas|99807.14|
| Raymond|99793.82|
|  Monica|99777.47|
+--------+--------+



- Find the Top 5 employees with highest salaries from employees categorized as “**Low**”

In [71]:
top_low_salaries = employees_low_salaries.select("name", "salary")
top_low_salaries = top_low_salaries.orderBy(top_low_salaries.salary.desc()).limit(5)
top_low_salaries.show()

+-----+--------+
| name|  salary|
+-----+--------+
|Linda|54993.53|
|Tammy|54991.71|
|Aaron|54989.45|
|Craig| 54945.2|
|Aaron| 54937.3|
+-----+--------+



7. **Find the number of employees with more years in the company (15 points).**
- Compute a new column with the years in company for each employee

In [72]:
from pyspark.sql.functions import datediff, current_date, floor

employment_years = employees.select("employee_id", "name", "hire_date")

employment_years = employment_years.withColumn("employment_years", floor(datediff(current_date(), "hire_date") / 365))
employment_years.show(3)

+-----------+-------+----------+----------------+
|employee_id|   name| hire_date|employment_years|
+-----------+-------+----------+----------------+
|          1|Caitlyn|2002-06-10|              22|
|          2| Rachel|2009-07-01|              15|
|          3| Carrie|1998-12-10|              26|
+-----------+-------+----------+----------------+
only showing top 3 rows



- Find the list of employees with more years in company and **count** them.

In [73]:
max_employment_years = employment_years.groupBy("employment_years").count()
max_employment_years = max_employment_years.orderBy(max_employment_years.employment_years.desc()).limit(1)
max_employment_years.show()

+----------------+-----+
|employment_years|count|
+----------------+-----+
|              37|    2|
+----------------+-----+



8. **Display the  Spark Plan (5 points):**

- Add to your Notebook an Screenshot of the DAG associated with one of the actions of your spark application. Your name should be visible and it should be the name of the application.

# <center> <img src="./DAG 68 Examen Marco Albanese.png"> </center>