## Job Profile Analysis

#### 1 - Please load the dataset into a Spark dataframe

In [1]:
%%time

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, asc, desc
from modules.common import get_flattened_job_profile_data

spark = SparkSession.builder.appName("job-profile-analysis").getOrCreate()

df = spark.read.option("inferSchema", "true").json("test_data/*.json")

# flatten the df to make analysis easier
df = get_flattened_job_profile_data(df)

df.show(10)

+--------------------+---------+--------+--------------------+
|                  id|firstName|lastName|           jobDetail|
+--------------------+---------+--------+--------------------+
|e23c7ab2-6479-401...|Elizabeth| Robledo|{2013-03-13, Pert...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2013-10-12, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2011-11-25, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2008-11-18, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2006-09-02, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2003-07-19, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2001-01-26, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{2000-04-14, Hoba...|
|a4c6238d-0aed-4eb...|    Karen|   Bozek|{1996-08-04, Hoba...|
|dcbae85f-4971-4fd...|     Lisa|   Grell|{2015-07-14, Bris...|
+--------------------+---------+--------+--------------------+
only showing top 10 rows

CPU times: user 528 ms, sys: 52.3 ms, total: 581 ms
Wall time: 1min 23s


#### 2 - Print the schema

In [2]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- jobDetail: struct (nullable = true)
 |    |-- fromDate: string (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- salary: long (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- toDate: string (nullable = true)



#### 3 - How many records are there in the dataset?

In [3]:
df.count()

77135383

#### 4 - What is the average salary for each profile?
##### Display the first 10 results, ordered by lastName in descending order

In [4]:
%%time

from modules.dataframes_by_profile import get_average_salaries_by_profile

get_average_salaries_by_profile(df) \
    .orderBy(desc('avgSalary')) \
    .limit(10) \
    .show()

+--------------------+---------+----------+---------+
|                  id|firstName|  lastName|avgSalary|
+--------------------+---------+----------+---------+
|01603b1b-34cd-49c...|   Hector|     Myers| 159000.0|
|5a0a4a63-cdc4-4db...|    Daren| Bjorklund| 159000.0|
|0159f5b4-87e7-40f...|       Ha|   Pearson| 159000.0|
|e9a4feb9-490b-4c3...|  Christy|    Packer| 159000.0|
|00ec95b7-8ae5-457...|     Tana|Lethbridge| 159000.0|
|d28f679e-26c8-4a4...|      Eva|   Barrese| 159000.0|
|01c195ad-2ae3-4eb...|    Karen|   Kilgore| 159000.0|
|9a7072e2-7023-491...|  Pauline|   Wallace| 159000.0|
|0186b17d-97eb-43c...|   Freida|   Collier| 159000.0|
|84ce538f-0803-432...|  William|   Cernoch| 159000.0|
+--------------------+---------+----------+---------+

CPU times: user 484 ms, sys: 64.3 ms, total: 549 ms
Wall time: 3min 5s


#### 5 - What is the average salary across the whole dataset?

In [5]:
%%time

from modules.dataframes_by_profile import get_average_salary_for_all_profiles

get_average_salary_for_all_profiles(df).show()

+---------+
|avgSalary|
+---------+
| 97473.62|
+---------+

CPU times: user 482 ms, sys: 57.6 ms, total: 540 ms
Wall time: 2min 45s


#### 6 - On average, what are the top 5 paying jobs? Bottom 5 paying jobs?
##### If there is a tie, please order by title, ~~location~~.

In [6]:
%%time

from modules.dataframes_by_title import get_average_salaries_by_job_title

result = get_average_salaries_by_job_title(df)

print('Top 5 paying jobs')
result.orderBy(desc('avgSalary'), 'jobTitle').limit(5).show()

print('Bottom 5 paying jobs')
result.orderBy(asc('avgSalary'), 'jobTitle').limit(5).show()

Top 5 paying jobs
+--------------------+---------+
|            jobTitle|avgSalary|
+--------------------+---------+
|      internal sales| 97555.94|
|  service technician| 97539.87|
|     support analyst| 97515.95|
|clinical psycholo...| 97515.49|
|             dentist| 97515.09|
+--------------------+---------+

Bottom 5 paying jobs
+--------------------+---------+
|            jobTitle|avgSalary|
+--------------------+---------+
|business developm...| 97410.55|
|    research analyst| 97412.93|
|retail sales cons...| 97419.07|
|administration of...| 97423.83|
|           paralegal| 97432.44|
+--------------------+---------+

CPU times: user 1.08 s, sys: 160 ms, total: 1.24 s
Wall time: 5min 56s


#### 7 - Who is currently making the most money?
##### If there is a tie, please order in lastName descending, fromDate descending.

In [7]:
%%time

from modules.common import get_max_rows_for_column
from modules.dataframes_by_profile import get_current_salaries_by_profile

result = get_current_salaries_by_profile(df)
result = get_max_rows_for_column(result, 'currentSalary')
result.show()

+--------------------+---------+--------+-------------+
|                  id|firstName|lastName|currentSalary|
+--------------------+---------+--------+-------------+
|986bd01f-ce46-419...| Clifford|  Gaines|       159000|
|c931437e-fff3-44c...|     Vera|  Foster|       159000|
|f24dd42a-a7d1-40b...|      Bob|  Barron|       159000|
|6e239288-f398-4c5...|    Marie|   Davis|       159000|
|1bcfde3d-0ded-4e7...| Danielle|   Lopez|       159000|
|7c45c326-90cf-45b...|   Rachel|  Morton|       159000|
|a875bc9e-77bc-424...|  Rebecca|  Carter|       159000|
|712d1c3e-f86b-4cc...|   Amelia|Pressley|       159000|
|699660ff-3f27-4aa...|   Stacey|   Lundy|       159000|
|ac4ba9d2-2ca1-466...| Virginia|  Cawyer|       159000|
|a06acd0e-1a32-40d...|   Dennis|  Jacobo|       159000|
|a77956d0-6c32-4e7...|    Cathy|  Hodges|       159000|
|842c26cc-53ad-44e...|      Eva|   Wolff|       159000|
|ee5bd600-31d7-45e...|     Alma| Santana|       159000|
|4cd6e9e1-e494-40e...|     Anna|   Roque|       

#### 8 - What was the most popular job title that started in 2019?

In [8]:
%%time

from modules.dataframes_by_title import get_most_popular_job_titles

get_most_popular_job_titles(df, 2019).show(1)

+-----+-------------+----------+
|title|firstSeenDate|occurrence|
+-----+-------------+----------+
+-----+-------------+----------+

CPU times: user 1.63 s, sys: 274 ms, total: 1.9 s
Wall time: 7min 24s


#### 9 - How many people are currently working?

In [9]:
%%time

from pyspark.sql.functions import countDistinct
from modules.common import get_all_current_jobs

get_all_current_jobs(df) \
    .select(countDistinct('id').alias('count_of_current_people_working')) \
    .show()

+-------------------------------+
|count_of_current_people_working|
+-------------------------------+
|                        7710613|
+-------------------------------+

CPU times: user 332 ms, sys: 44.1 ms, total: 376 ms
Wall time: 2min 38s


#### 10 - For each person, list only their latest job
##### Display the first 10 results, ordered by lastName descending, firstName ascending order.

In [10]:
%%time

from modules.dataframes_by_profile import get_most_recent_jobs_by_profile

get_most_recent_jobs_by_profile(df) \
    .orderBy(desc('lastName'), asc('firstName')) \
    .limit(10) \
    .show()

+--------------------+---------+--------+--------------------+
|                  id|firstName|lastName|           jobDetail|
+--------------------+---------+--------+--------------------+
|ba24222d-6e39-40d...|  Matthew|  Zywiec|{2017-04-23, Pert...|
|5894afab-574f-429...|  Richard|  Zywiec|{2018-07-23, Sydn...|
|82dab74c-3946-45b...|   Robert|  Zywiec|{2016-08-08, Adel...|
|4e26c80a-8e84-46f...|    Bobby| Zywicki|{2017-12-11, Pert...|
|f643f39c-e18a-430...|   Calvin| Zywicki|{2015-04-24, Adel...|
|03aeca24-7be1-42a...|  Charles| Zywicki|{2016-06-10, Sydn...|
|cc529ff4-2dbf-4ce...|  Cherryl| Zywicki|{2017-06-01, Pert...|
|296999c2-8951-405...|Christine| Zywicki|{2018-09-16, Melb...|
|f16672c0-424c-48c...|  Darlene| Zywicki|{2014-02-23, Adel...|
|eeb15ed5-fb0d-4d6...|    Donna| Zywicki|{2019-01-23, Bris...|
+--------------------+---------+--------+--------------------+

CPU times: user 1.78 s, sys: 193 ms, total: 1.97 s
Wall time: 10min 33s


#### 11 - For each person, list their highest paying job along with their first name, last name, salary and the year they made this salary
##### Store the results in a dataframe, and then print out 10 results

In [11]:
%%time

from modules.dataframes_by_profile import get_highest_paying_job_by_profile

df_result = get_highest_paying_job_by_profile(df)
df_result.show(truncate=False)

+------------------------------------+---------+------------+------------------------------+----------------------+--------------------+
|id                                  |firstName|lastName    |highestPayingJobTitle         |highestPayingJobSalary|highestPayingJobYear|
+------------------------------------+---------+------------+------------------------------+----------------------+--------------------+
|00008a82-3345-419f-92ec-517bca432ba4|Alan     |Hawkins     |evaluator                     |87000                 |2016                |
|00008d2e-3760-4527-94ee-a8e79e1f9209|Alfred   |Siu         |devops engineer               |61000                 |2015                |
|0000aa9b-7c17-4894-8b7f-232c30b318b1|Juan     |Moss        |service technician            |85000                 |2018                |
|0000b622-2709-4ef3-b651-acd05b101634|Mark     |Roy         |Support Analyst               |83000                 |2018                |
|0000c8c2-af21-48dc-b41d-018e3ba8cd18|Way

#### 12 - Write out the last result (question 11) in parquet format, compressed, partitioned by the year of their highest paying job

In [12]:
%%time

df_result.write.partitionBy('highestPayingJobYear') \
    .parquet('output_data/', compression='gzip', mode='overwrite')

CPU times: user 1.83 s, sys: 260 ms, total: 2.09 s
Wall time: 8min 19s
