# Final Project: Data Analysis using Spark

Estimated time needed: **60** minutes

This final project is similar to the Practice Project you did. In this project, you will not be provided with hints or solutions. You will create a DataFrame by loading data from a CSV file and apply transformations and actions using Spark SQL. This needs to be achieved by performing the following tasks:

- Task 1: Generate DataFrame from CSV data.
- Task 2: Define a schema for the data.
- Task 3: Display schema of DataFrame.
- Task 4: Create a temporary view.
- Task 5: Execute an SQL query.
- Task 6: Calculate Average Salary by Department.
- Task 7: Filter and Display IT Department Employees.
- Task 8: Add 10% Bonus to Salaries.
- Task 9: Find Maximum Salary by Age.
- Task 10: Self-Join on Employee Data.
- Task 11: Calculate Average Employee Age.
- Task 12: Calculate Total Salary by Department.
- Task 13: Sort Data by Age and Salary.
- Task 14: Count Employees in Each Department.
- Task 15: Filter Employees with the letter o in the Name.


In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [3]:
sc = SparkContext.getOrCreate()

spark = SparkSession.builder.appName("Python Spark Dataframes").config("spark.sql.legacy.timeParserPolicy", "LEGACY").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/13 11:50:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
employees_df = spark.read.csv('employees.csv', header=True, inferSchema=True)
employees_df.show(5)

                                                                                

+------+--------+------+---+----------+
|Emp_No|Emp_Name|Salary|Age|Department|
+------+--------+------+---+----------+
|   198|  Donald|  2600| 29|        IT|
|   199| Douglas|  2600| 34|     Sales|
|   200|Jennifer|  4400| 36| Marketing|
|   201| Michael| 13000| 32|        IT|
|   202|     Pat|  6000| 39|        HR|
+------+--------+------+---+----------+
only showing top 5 rows



In [5]:
employees_df.printSchema()

root
 |-- Emp_No: integer (nullable = true)
 |-- Emp_Name: string (nullable = true)
 |-- Salary: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Department: string (nullable = true)



In [6]:
employees_df.createTempView("employees")

In [7]:
spark.sql("SELECT * FROM employees WHERE Age > 30").show()

+------+-----------+------+---+----------+
|Emp_No|   Emp_Name|Salary|Age|Department|
+------+-----------+------+---+----------+
|   199|    Douglas|  2600| 34|     Sales|
|   200|   Jennifer|  4400| 36| Marketing|
|   201|    Michael| 13000| 32|        IT|
|   202|        Pat|  6000| 39|        HR|
|   203|      Susan|  6500| 36| Marketing|
|   205|    Shelley| 12008| 33|   Finance|
|   206|    William|  8300| 37|        IT|
|   100|     Steven| 24000| 39|        IT|
|   102|        Lex| 17000| 37| Marketing|
|   103|  Alexander|  9000| 39| Marketing|
|   104|      Bruce|  6000| 38|        IT|
|   105|      David|  4800| 39|        IT|
|   106|      Valli|  4800| 38|     Sales|
|   107|      Diana|  4200| 35|     Sales|
|   109|     Daniel|  9000| 35|        HR|
|   110|       John|  8200| 31| Marketing|
|   111|     Ismael|  7700| 32|        IT|
|   112|Jose Manuel|  7800| 34|        HR|
|   113|       Luis|  6900| 34|     Sales|
|   116|     Shelli|  2900| 37|   Finance|
+------+---

In [8]:
spark.sql("SELECT Department, ROUND(AVG(Salary),2) AS Average_Salary FROM employees GROUP BY Department").show()

+----------+--------------+
|Department|Average_Salary|
+----------+--------------+
|     Sales|       5492.92|
|        HR|        5837.5|
|   Finance|        5730.8|
| Marketing|       6633.33|
|        IT|        7400.0|
+----------+--------------+



In [9]:
IT_df = spark.sql("SELECT * FROM employees WHERE Department = 'IT'")
IT_df.show()


+------+--------+------+---+----------+
|Emp_No|Emp_Name|Salary|Age|Department|
+------+--------+------+---+----------+
|   198|  Donald|  2600| 29|        IT|
|   201| Michael| 13000| 32|        IT|
|   206| William|  8300| 37|        IT|
|   100|  Steven| 24000| 39|        IT|
|   104|   Bruce|  6000| 38|        IT|
|   105|   David|  4800| 39|        IT|
|   111|  Ismael|  7700| 32|        IT|
|   129|   Laura|  3300| 38|        IT|
|   132|      TJ|  2100| 34|        IT|
|   136|   Hazel|  2200| 29|        IT|
+------+--------+------+---+----------+



In [10]:
from pyspark.sql import functions as F 
import pandas as pd

In [11]:
@F.pandas_udf("float")
def salary_bonus(s: pd.Series) -> pd.Series:
    return s * 1.1

spark.udf.register("salary_after_bonus", salary_bonus)

<pyspark.sql.udf.UserDefinedFunction at 0x11a9b7160>

In [12]:
employees_df_with_bonus = spark.sql("SELECT *, salary_after_bonus(Salary) AS Salary_After_Bonus FROM employees")
employees_df_with_bonus.show()

[Stage 8:>                                                          (0 + 1) / 1]

+------+---------+------+---+----------+------------------+
|Emp_No| Emp_Name|Salary|Age|Department|Salary_After_Bonus|
+------+---------+------+---+----------+------------------+
|   198|   Donald|  2600| 29|        IT|            2860.0|
|   199|  Douglas|  2600| 34|     Sales|            2860.0|
|   200| Jennifer|  4400| 36| Marketing|            4840.0|
|   201|  Michael| 13000| 32|        IT|           14300.0|
|   202|      Pat|  6000| 39|        HR|            6600.0|
|   203|    Susan|  6500| 36| Marketing|            7150.0|
|   204|  Hermann| 10000| 29|   Finance|           11000.0|
|   205|  Shelley| 12008| 33|   Finance|           13208.8|
|   206|  William|  8300| 37|        IT|            9130.0|
|   100|   Steven| 24000| 39|        IT|           26400.0|
|   101|    Neena| 17000| 27|     Sales|           18700.0|
|   102|      Lex| 17000| 37| Marketing|           18700.0|
|   103|Alexander|  9000| 39| Marketing|            9900.0|
|   104|    Bruce|  6000| 38|        IT|

                                                                                

In [13]:
employees_df.groupBy('Age').agg(F.max('Salary').alias('Max_Salary')).show()

+---+----------+
|Age|Max_Salary|
+---+----------+
| 31|      8200|
| 34|      7800|
| 28|     12008|
| 27|     17000|
| 26|      3600|
| 37|     17000|
| 35|      9000|
| 39|     24000|
| 38|      6000|
| 29|     10000|
| 32|     13000|
| 33|     12008|
| 30|      8000|
| 36|      7900|
+---+----------+



In [14]:
spark.sql("SELECT Age, MAX(Salary) FROM employees GROUP BY Age").show()

+---+-----------+
|Age|max(Salary)|
+---+-----------+
| 31|       8200|
| 34|       7800|
| 28|      12008|
| 27|      17000|
| 26|       3600|
| 37|      17000|
| 35|       9000|
| 39|      24000|
| 38|       6000|
| 29|      10000|
| 32|      13000|
| 33|      12008|
| 30|       8000|
| 36|       7900|
+---+-----------+



In [15]:
employees_df.agg(F.avg('Age')).show()

+--------+
|avg(Age)|
+--------+
|   33.56|
+--------+



In [17]:
spark.sql("SELECT AVG(Age) FROM employees").show()

+--------+
|avg(Age)|
+--------+
|   33.56|
+--------+



In [18]:
employees_df.groupBy('Department').agg(F.sum('Salary').alias('Total Salary by Department')).show()

+----------+--------------------------+
|Department|Total Salary by Department|
+----------+--------------------------+
|     Sales|                     71408|
|        HR|                     46700|
|   Finance|                     57308|
| Marketing|                     59700|
|        IT|                     74000|
+----------+--------------------------+



In [20]:
spark.sql("SELECT Department, SUM(Salary) AS Total_Salary_by_Department FROM employees GROUP BY Department").show()

+----------+--------------------------+
|Department|Total_Salary_by_Department|
+----------+--------------------------+
|     Sales|                     71408|
|        HR|                     46700|
|   Finance|                     57308|
| Marketing|                     59700|
|        IT|                     74000|
+----------+--------------------------+



In [21]:
employees_df_sorted = employees_df.orderBy(F.asc('Age'), F.desc('Salary'))
employees_df_sorted.show()

+------+---------+------+---+----------+
|Emp_No| Emp_Name|Salary|Age|Department|
+------+---------+------+---+----------+
|   137|   Renske|  3600| 26| Marketing|
|   101|    Neena| 17000| 27|     Sales|
|   114|      Den| 11000| 27|   Finance|
|   108|    Nancy| 12008| 28|     Sales|
|   130|    Mozhe|  2800| 28| Marketing|
|   126|    Irene|  2700| 28|        HR|
|   204|  Hermann| 10000| 29|   Finance|
|   115|Alexander|  3100| 29|   Finance|
|   134|  Michael|  2900| 29|     Sales|
|   198|   Donald|  2600| 29|        IT|
|   140|   Joshua|  2500| 29|   Finance|
|   136|    Hazel|  2200| 29|        IT|
|   120|  Matthew|  8000| 30|        HR|
|   110|     John|  8200| 31| Marketing|
|   127|    James|  2400| 31|        HR|
|   201|  Michael| 13000| 32|        IT|
|   111|   Ismael|  7700| 32|        IT|
|   119|    Karen|  2500| 32|   Finance|
|   205|  Shelley| 12008| 33|   Finance|
|   124|    Kevin|  5800| 33| Marketing|
+------+---------+------+---+----------+
only showing top

In [22]:
spark.sql("SELECT * FROM employees ORDER BY Age ASC, Salary DESC").show()

+------+---------+------+---+----------+
|Emp_No| Emp_Name|Salary|Age|Department|
+------+---------+------+---+----------+
|   137|   Renske|  3600| 26| Marketing|
|   101|    Neena| 17000| 27|     Sales|
|   114|      Den| 11000| 27|   Finance|
|   108|    Nancy| 12008| 28|     Sales|
|   130|    Mozhe|  2800| 28| Marketing|
|   126|    Irene|  2700| 28|        HR|
|   204|  Hermann| 10000| 29|   Finance|
|   115|Alexander|  3100| 29|   Finance|
|   134|  Michael|  2900| 29|     Sales|
|   198|   Donald|  2600| 29|        IT|
|   140|   Joshua|  2500| 29|   Finance|
|   136|    Hazel|  2200| 29|        IT|
|   120|  Matthew|  8000| 30|        HR|
|   110|     John|  8200| 31| Marketing|
|   127|    James|  2400| 31|        HR|
|   201|  Michael| 13000| 32|        IT|
|   111|   Ismael|  7700| 32|        IT|
|   119|    Karen|  2500| 32|   Finance|
|   205|  Shelley| 12008| 33|   Finance|
|   124|    Kevin|  5800| 33| Marketing|
+------+---------+------+---+----------+
only showing top

In [23]:
employees_df.groupBy('Department').agg(F.count('Department').alias('Employee per Department')).show()

+----------+-----------------------+
|Department|Employee per Department|
+----------+-----------------------+
|     Sales|                     13|
|        HR|                      8|
|   Finance|                     10|
| Marketing|                      9|
|        IT|                     10|
+----------+-----------------------+



In [25]:
spark.sql("SELECT Department, COUNT(Department) AS Employee_per_Department FROM employees GROUP BY Department").show()

+----------+-----------------------+
|Department|Employee_per_Department|
+----------+-----------------------+
|     Sales|                     13|
|        HR|                      8|
|   Finance|                     10|
| Marketing|                      9|
|        IT|                     10|
+----------+-----------------------+



In [26]:
employees_df.filter(F.col('Emp_Name').like("%o%")).show()

+------+-----------+------+---+----------+
|Emp_No|   Emp_Name|Salary|Age|Department|
+------+-----------+------+---+----------+
|   198|     Donald|  2600| 29|        IT|
|   199|    Douglas|  2600| 34|     Sales|
|   110|       John|  8200| 31| Marketing|
|   112|Jose Manuel|  7800| 34|        HR|
|   130|      Mozhe|  2800| 28| Marketing|
|   133|      Jason|  3300| 38|     Sales|
|   139|       John|  2700| 36|     Sales|
|   140|     Joshua|  2500| 29|   Finance|
+------+-----------+------+---+----------+



In [27]:
spark.sql("SELECT * FROM employees WHERE Emp_Name LIKE '%o%' ").show()

+------+-----------+------+---+----------+
|Emp_No|   Emp_Name|Salary|Age|Department|
+------+-----------+------+---+----------+
|   198|     Donald|  2600| 29|        IT|
|   199|    Douglas|  2600| 34|     Sales|
|   110|       John|  8200| 31| Marketing|
|   112|Jose Manuel|  7800| 34|        HR|
|   130|      Mozhe|  2800| 28| Marketing|
|   133|      Jason|  3300| 38|     Sales|
|   139|       John|  2700| 36|     Sales|
|   140|     Joshua|  2500| 29|   Finance|
+------+-----------+------+---+----------+

