# PySpark Introductory Lab Exercise

Prepared by: **Mr. Hieng MAO**

Date: **02 April 2025**

Source Code: [Here](https://github.com/maohieng/learn_ai/blob/main/larg_data/Hieng_MAO_Spark_lab.ipynb)

## Install PySpark

In [11]:
!pip install pyspark



## Create a spark session

In [12]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySpark Exercise").getOrCreate()

print(spark)

<pyspark.sql.session.SparkSession object at 0x7f02a78a5910>


## Create a DataFrame

In [13]:
data = [
    ("Alice", "HR", 50000),
    ("Bob", "Engineering", 75000),
    ("Cathy", "Marketing", 60000),
]

column = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, column)

df.show()

+-----+-----------+------+
| Name| Department|Salary|
+-----+-----------+------+
|Alice|         HR| 50000|
|  Bob|Engineering| 75000|
|Cathy|  Marketing| 60000|
+-----+-----------+------+



## Perform Data Transformation

In [14]:
high_earner = df.filter(df.Salary >= 55000)
print("Employee with salary > 55000:")
high_earner.show()

# Add a column for a 10% bonus
df_with_bonus = df.withColumn("Bonus", df.Salary * 0.1)
print("DataFrame with bonus:")
df_with_bonus.show()

Employee with salary > 55000:
+-----+-----------+------+
| Name| Department|Salary|
+-----+-----------+------+
|  Bob|Engineering| 75000|
|Cathy|  Marketing| 60000|
+-----+-----------+------+

DataFrame with bonus:
+-----+-----------+------+------+
| Name| Department|Salary| Bonus|
+-----+-----------+------+------+
|Alice|         HR| 50000|5000.0|
|  Bob|Engineering| 75000|7500.0|
|Cathy|  Marketing| 60000|6000.0|
+-----+-----------+------+------+



Show employee in the "Engineering" department only

In [15]:
enigneering_department = df.filter(df.Department == "Engineering")
print("Employee in the Engineering department:")
enigneering_department.show()

Employee in the Engineering department:
+----+-----------+------+
|Name| Department|Salary|
+----+-----------+------+
| Bob|Engineering| 75000|
+----+-----------+------+



## Explore the DataFrame

In [16]:
# Count the number of rows
row_count = df.count()
print("Total number of employee:", row_count)

# Get the schema of the DataFrame
print("DataFrame schema:")
df.printSchema()

Total number of employee: 3
DataFrame schema:
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Salary: long (nullable = true)



## Challenge Questions

In [19]:
# Add a new column
df_with_total_compensation = df_with_bonus.withColumn("Total_Compensation", df_with_bonus.Salary + df_with_bonus.Bonus)

# Show the DataFrame with the new column
print("DataFrame with new column:")
df_with_total_compensation.show()

DataFrame with new column:
+-----+-----------+------+------+------------------+
| Name| Department|Salary| Bonus|Total_Compensation|
+-----+-----------+------+------+------------------+
|Alice|         HR| 50000|5000.0|           55000.0|
|  Bob|Engineering| 75000|7500.0|           82500.0|
|Cathy|  Marketing| 60000|6000.0|           66000.0|
+-----+-----------+------+------+------------------+



In [20]:
# Sort the DataFrame by Salary in decending order
sorted_df = df_with_total_compensation.orderBy(df_with_total_compensation.Salary.desc())

# Show the sorted DataFrame
print("Sorted DataFrame by Salary:")
sorted_df.show()

Sorted DataFrame by Salary:
+-----+-----------+------+------+------------------+
| Name| Department|Salary| Bonus|Total_Compensation|
+-----+-----------+------+------+------------------+
|  Bob|Engineering| 75000|7500.0|           82500.0|
|Cathy|  Marketing| 60000|6000.0|           66000.0|
|Alice|         HR| 50000|5000.0|           55000.0|
+-----+-----------+------+------+------------------+



In [21]:
# Show only Employee whose names start with "A"
employees_start_with_a = df.filter(df.Name.startswith("A"))
print("Employee whose names start with 'A':")
employees_start_with_a.show()

Employee whose names start with 'A':
+-----+----------+------+
| Name|Department|Salary|
+-----+----------+------+
|Alice|        HR| 50000|
+-----+----------+------+



## Clean Up

In [22]:
spark.stop()