# PySpark SQL and UDFs

## Using SQL Queries in PySpark
- you can run SQL queries directly on PySpark DataFrames.
- helpful when you need to work with complex queries or when transitioning from SQL-based operations

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("TestSQL").getOrCreate()

df = spark.read.csv('./resources/7_employee.csv', header=True, inferSchema=True)

df.createOrReplaceTempView('employees')

sql = spark.sql('select * from employees')
sql.show()

+-----------+---------+----------+------+
|employee_id|     name|department|salary|
+-----------+---------+----------+------+
|          1|     John|        HR| 55000|
|          2|     Jane|   Finance| 80000|
|          3|    James|        HR| 60000|
|          4|     Anna|   Finance| 90000|
|          5|      Bob| Marketing| 75000|
|          6|    Emily| Marketing| 82000|
|          7|    David|        HR| 65000|
|          8|   George|   Finance| 95000|
|          9|   Olivia| Marketing| 68000|
|         10|     Liam|        HR| 54000|
|         11|   Sophia|   Finance| 85000|
|         12|    Lucas| Marketing| 78000|
|         13| Isabella|   Finance| 92000|
|         14|    Mason|        HR| 63000|
|         15|   Amelia| Marketing| 79000|
|         16|    Ethan|        HR| 67000|
|         17|  Abigail|   Finance| 87000|
|         18|    Aiden|        HR| 56000|
|         19|Charlotte| Marketing| 81000|
|         20|     Jack|        HR| 69000|
+-----------+---------+----------+

## Using UDFs (User-Defined Functions)
- when PySpark's built-in functions aren't enough, you can write your own UDFs. UDFs allow you to apply custom Python functions to PySpark DataFrames

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, FloatType

# define simple UDF to convert salary to a bonus
def calculate_bonus(salary):
    return salary * 0.1

# register udf; StringType() -> specifies the return type like: StringType(), FloatType(), DoubleType()
bonus_udf = udf(calculate_bonus, FloatType())

spark = SparkSession.builder.appName("TestSQL").getOrCreate()

df = spark.read.csv('./resources/7_employee.csv', header=True, inferSchema=True)

df_with_bonus = df.withColumn('bonus', bonus_udf(df['salary']))
df_with_bonus.show()
df_with_bonus.printSchema()

+-----------+---------+----------+------+------+
|employee_id|     name|department|salary| bonus|
+-----------+---------+----------+------+------+
|          1|     John|        HR| 55000|5500.0|
|          2|     Jane|   Finance| 80000|8000.0|
|          3|    James|        HR| 60000|6000.0|
|          4|     Anna|   Finance| 90000|9000.0|
|          5|      Bob| Marketing| 75000|7500.0|
|          6|    Emily| Marketing| 82000|8200.0|
|          7|    David|        HR| 65000|6500.0|
|          8|   George|   Finance| 95000|9500.0|
|          9|   Olivia| Marketing| 68000|6800.0|
|         10|     Liam|        HR| 54000|5400.0|
|         11|   Sophia|   Finance| 85000|8500.0|
|         12|    Lucas| Marketing| 78000|7800.0|
|         13| Isabella|   Finance| 92000|9200.0|
|         14|    Mason|        HR| 63000|6300.0|
|         15|   Amelia| Marketing| 79000|7900.0|
|         16|    Ethan|        HR| 67000|6700.0|
|         17|  Abigail|   Finance| 87000|8700.0|
|         18|    Aid