# <center> <img src="../../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
### <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
**Primer Examen**

**Fecha**: 14 de Marzo del 2025

**Nombre del estudiante**: Alberto Renteria Camacho

**Professor**: Pablo Camarillo Ramirez

In [25]:
import findspark
findspark.init()

In [26]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQL-Exam-1-ALBERTO-RENTERIA") \
    .master("spark://70f60bba1584:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()
sc = spark.sparkContext

Read the Data (10 points):
    Load the employees.csv and departments.csv files into PySpark DataFrames.

In [27]:
from team_name.spark_utils import SparkUtils

employees_schema = SparkUtils.generate_schema([("employee_id", "integer"), ("employee_info", "string")])

employees_df = spark.read \
                .schema(employees_schema) \
                .option("header", "true") \
                .csv("/home/jovyan/notebooks/data/employees.csv")

employees_df.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- employee_info: string (nullable = true)



In [28]:
from team_name.spark_utils import SparkUtils

departments_schema = SparkUtils.generate_schema([("department_id", "integer"), ("department_name", "string"), ("location", "string")])

departments_df = spark.read \
                .schema(departments_schema) \
                .option("header", "true") \
                .csv("/home/jovyan/notebooks/data/departments.csv")

departments_df.printSchema()

root
 |-- department_id: integer (nullable = true)
 |-- department_name: string (nullable = true)
 |-- location: string (nullable = true)



Extract the employee info from the JSON column (20 points):
    Extract the following the columns: name (string), department_id (integer), salary (double), and hire_date (date) from the employee_info column

In [29]:
from pyspark.sql.functions import get_json_object, col
from pyspark.sql.types import IntegerType, DoubleType, DateType

employees_df = employees_df.withColumn("name", get_json_object(col("employee_info"), "$.name")) \
    .withColumn("department_id", get_json_object(col("employee_info"), "$.department_id").cast(IntegerType())) \
    .withColumn("salary", get_json_object(col("employee_info"), "$.salary").cast(DoubleType())) \
    .withColumn("hire_date", get_json_object(col("employee_info"), "$.hire_date").cast(DateType()))


Join Operations (10 points):
    Join the employees DataFrame with the departments DataFrame on department_id to enrich the employee data with department details.

In [30]:
employees_info_df = employees_df.join(departments_df, employees_df["department_id"] == departments_df["department_id"], "left")
employees_info_df.show(5)

+-----------+--------------------+---------+-------------+---------+----------+-------------+-------------------+-------------+
|employee_id|       employee_info|     name|department_id|   salary| hire_date|department_id|    department_name|     location|
+-----------+--------------------+---------+-------------+---------+----------+-------------+-------------------+-------------+
|          1|{'name': 'Caitlyn...|  Caitlyn|          103|115959.78|2002-06-10|          103|Sales and Marketing|      Chicago|
|          2|{'name': 'Rachel'...|   Rachel|          104|100820.16|2009-07-01|          104|   Data Engineering|      Zapopan|
|          3|{'name': 'Carrie'...|   Carrie|          105|114421.44|1998-12-10|          105|       Data Science|      Seattle|
|          4|{'name': 'Renee',...|    Renee|          104| 54688.13|1995-03-17|          104|   Data Engineering|      Zapopan|
|          5|{'name': 'Gabriel...|Gabriella|          109|106267.03|1995-02-09|          109|   Customer

Transformations (Using when()) (10 points):
Add a new column salary_category to the enriched employee DataFrame:
- If salary is greater than or equal to 55000, set salary_category to "High".
- Otherwise, set salary_category to "Low".

In [31]:
from pyspark.sql.functions import when

employees_info_df = employees_info_df.withColumn(
    "salary_category",
    when(employees_info_df["salary"] >= 55000, "High").otherwise("Low")
)

employees_info_df.show(5)

+-----------+--------------------+---------+-------------+---------+----------+-------------+-------------------+-------------+---------------+
|employee_id|       employee_info|     name|department_id|   salary| hire_date|department_id|    department_name|     location|salary_category|
+-----------+--------------------+---------+-------------+---------+----------+-------------+-------------------+-------------+---------------+
|          1|{'name': 'Caitlyn...|  Caitlyn|          103|115959.78|2002-06-10|          103|Sales and Marketing|      Chicago|           High|
|          2|{'name': 'Rachel'...|   Rachel|          104|100820.16|2009-07-01|          104|   Data Engineering|      Zapopan|           High|
|          3|{'name': 'Carrie'...|   Carrie|          105|114421.44|1998-12-10|          105|       Data Science|      Seattle|           High|
|          4|{'name': 'Renee',...|    Renee|          104| 54688.13|1995-03-17|          104|   Data Engineering|      Zapopan|         



Filter and Group (20 points):
- Create two new data frames: one that filters employees with a “High” salary and another that filters employees with a “Low” salary.
- Calculate the average salary per department for the two newly created data frames, which contain the salaries of employees categorized as “High” and “Low.”  Resulting data frame for this transformation should contain only department_name and avg_salary columns



In [None]:
from pyspark.sql.functions import col, sum, avg, max, min

employees_high_df = employees_info_df.filter(employees_info_df["salary_category"] == "High")
employees_low_df = employees_info_df.filter(employees_info_df["salary_category"] == "Low")

high_avg = employees_high_df.select(["department_name", "salary"]).groupBy("department_name").agg(avg("salary").alias("avg_salary"))
low_avg = employees_low_df.select(["department_name", "salary"]).groupBy("department_name").agg(avg("salary").alias("avg_salary"))

Sort (10 points): 
- Find the Top 5 employees with highest salaries from employees categorized as “High”
- Find the Top 5 employees with highest salaries from employees categorized as “Low”

In [47]:
high_salary = employees_high_df.orderBy(employees_high_df["salary"].desc()).limit(5)
low_salary = employees_low_df.orderBy(employees_high_df["salary"].desc()).limit(5)

high_salary.show()
low_salary.show()

+-----------+--------------------+---------+-------------+---------+----------+-------------+--------------------+-----------+---------------+
|employee_id|       employee_info|     name|department_id|   salary| hire_date|department_id|     department_name|   location|salary_category|
+-----------+--------------------+---------+-------------+---------+----------+-------------+--------------------+-----------+---------------+
|       1778|{'name': 'Gabriel...|Gabriella|          101|149989.73|2018-09-14|          101|     Human Resources|  San Diego|           High|
|       3621|{'name': 'Katheri...|Katherine|          101| 149979.3|2017-07-26|          101|     Human Resources|  San Diego|           High|
|        346|{'name': 'Ryan', ...|     Ryan|          110| 149963.1|1990-07-03|          110|Corporate Strateg...|Los Angeles|           High|
|       3807|{'name': 'Caitlyn...|  Caitlyn|          107|149956.54|2000-07-27|          107|               Legal|    Chicago|           High|

Find the number of employees with more years in the company (15 points).
- Compute a new column with the years in company for each employee
- Find the list of employees with more years in company and count them.

In [None]:
from pyspark.sql.functions import datediff, current_date

employees_info_df = employees_info_df.withColumn(
    "years_in_company",
    (datediff(current_date(), employees_info_df["hire_date"]) / 365).cast(IntegerType())
)

max_years_df = employees_info_df.select(["years_in_company"]).groupBy("years_in_company").max("years_in_company")

max_years_df.show()


PySparkTypeError: [NOT_ITERABLE] Column is not iterable.



Display the  Spark Plan (5 points):
- Add to your Notebook an Screenshot of the DAG associated with one of the actions of your spark application. Your name should be visible and it should be the name of the application.

