In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder. \
    appName("pyspark-1"). \
    getOrCreate()

### Read data

In [3]:
df = spark.read.csv("/dataset/nyc-jobs.csv", header=True)
df.printSchema()

root
 |-- Job ID: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Posting Type: string (nullable = true)
 |-- # Of Positions: string (nullable = true)
 |-- Business Title: string (nullable = true)
 |-- Civil Service Title: string (nullable = true)
 |-- Title Code No: string (nullable = true)
 |-- Level: string (nullable = true)
 |-- Job Category: string (nullable = true)
 |-- Full-Time/Part-Time indicator: string (nullable = true)
 |-- Salary Range From: string (nullable = true)
 |-- Salary Range To: string (nullable = true)
 |-- Salary Frequency: string (nullable = true)
 |-- Work Location: string (nullable = true)
 |-- Division/Work Unit: string (nullable = true)
 |-- Job Description: string (nullable = true)
 |-- Minimum Qual Requirements: string (nullable = true)
 |-- Preferred Skills: string (nullable = true)
 |-- Additional Information: string (nullable = true)
 |-- To Apply: string (nullable = true)
 |-- Hours/Shift: string (nullable = true)
 |-- Work Locatio

In [4]:
df_count = df.count()
df_count

2946

### Sample function

In [5]:
import sys; 
sys.path.insert(0, '..')

In [6]:
from utils.distinct_values import get_distinct_values
get_distinct_values(df = df, column= 'Salary Frequency')

['Annual', 'Daily', 'Hourly']

### Test get_distinct_values

In [7]:
sys.path.insert(0, '../tests')

In [8]:
import test_distinct_values as T

In [9]:
T.loadTest()

True

In [10]:
T.test_get_distinct_values()

From the printSchema output we can see that the dtypes are all string.  Its possible to use inferSchema with read_cvs but for large data sets this can take a long time since the complete dataset needs to be scanned and for each column.

Instead, we can create the schema and use that to re create df

First, lets take a look at the data

In [11]:
sys.path.insert(0, '../utils')

In [12]:
import pre_processing_functions as PPF

In [13]:
help(PPF)

Help on module pre_processing_functions:

NAME
    pre_processing_functions

FUNCTIONS
    get_counts_map(df: pyspark.sql.dataframe.DataFrame) -> dict
        Return dict of DataFrame df's columns and their respective non-null counts. 
        
        Can be used to determine whether there are nulls in a dataframe, i.e.:
        If the count for each coumn != df.count() there are missing values
        (count be a neat function to do this already, and will use that when/if
        I find it, but for now this function is useful) 
        
        Usage: 
        df = .....        
        counts_map=get_missing_counts(df)
        print(counts_map)
        {  'Job ID': '2946',
            'Agency': '2946',
            'Posting Type': '2946',
            '# Of Positions': '2946',
            .
            .
        }
        
        :param df: input dataframe
        :return: dict/map of column: <count>

FILE
    /utils/pre_processing_functions.py




In [14]:
counts_map = PPF.get_counts_map(df)
counts_map

{'Job ID': 2946,
 'Agency': 2946,
 'Posting Type': 2946,
 '# Of Positions': 2946,
 'Business Title': 2946,
 'Civil Service Title': 2946,
 'Title Code No': 2946,
 'Level': 2946,
 'Job Category': 2944,
 'Full-Time/Part-Time indicator': 2751,
 'Salary Range From': 2946,
 'Salary Range To': 2946,
 'Salary Frequency': 2946,
 'Work Location': 2946,
 'Division/Work Unit': 2946,
 'Job Description': 2946,
 'Minimum Qual Requirements': 2928,
 'Preferred Skills': 2687,
 'Additional Information': 2383,
 'To Apply': 2766,
 'Hours/Shift': 1884,
 'Work Location 1': 1808,
 'Recruitment Contact': 1183,
 'Residency Requirement': 2268,
 'Posting Date': 2429,
 'Post Until': 1447,
 'Posting Updated': 2438,
 'Process Date': 2521}

In [15]:
len(df.columns)


28

In [16]:
desc = df.describe().toPandas().transpose()
df_count=df.count()
print(f"count() {df_count}")
desc[0].sort_values()

count() 2946


Recruitment Contact               1183
Post Until                        1447
Work Location 1                   1808
Hours/Shift                       1884
Residency Requirement             2268
Additional Information            2383
Posting Date                      2429
Posting Updated                   2438
Process Date                      2521
Preferred Skills                  2687
Full-Time/Part-Time indicator     2751
To Apply                          2766
Minimum Qual Requirements         2928
Job Category                      2944
Job Description                   2946
Work Location                     2946
Salary Frequency                  2946
Salary Range To                   2946
Salary Range From                 2946
Level                             2946
Title Code No                     2946
Civil Service Title               2946
Business Title                    2946
# Of Positions                    2946
Posting Type                      2946
Agency                   

In [17]:
desc.head()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Job ID,2946,384821.5631364562,53075.33897715407,132292,97899
Agency,2946,,,ADMIN FOR CHILDREN'S SVCS,TEACHERS RETIREMENT SYSTEM
Posting Type,2946,,,External,Internal
# Of Positions,2946,2.4959266802443993,9.281312826466838,1,91


In [18]:
desc.loc["summary"]

0     count
1      mean
2    stddev
3       min
4       max
Name: summary, dtype: object

In [19]:
desc, desc.columns = desc[1:], desc.loc["summary"]

In [20]:
desc

summary,count,mean,stddev,min,max
Job ID,2946,384821.5631364562,53075.33897715407,132292,97899
Agency,2946,,,ADMIN FOR CHILDREN'S SVCS,TEACHERS RETIREMENT SYSTEM
Posting Type,2946,,,External,Internal
# Of Positions,2946,2.4959266802444,9.281312826466838,1,91
Business Title,2946,,,.NET DEVELOPER,executive Vice President for Operations
Civil Service Title,2946,,,ACCOUNTANT,YOUTH COORDINATOR (YOUTH SERVI
Title Code No,2946,35558.51334552102,28141.297679769723,0527A,95841
Level,2946,1.0531400966183575,1.1403671232078134,0,M7
Job Category,2944,,,Administration & Human Resources,"Technology, Data & Innovation Social Services"
Full-Time/Part-Time indicator,2751,,,F,P


### Job ID 


Are there any nulls?

In [21]:
df_count == counts_map["Job ID"]

True

There are no nulls, but are there any duplicates?

In [22]:
df.select('Job ID').distinct().count()

1661

What is the reason for almost all Job IDs to have 2 rows.  I'd guess that there is an original record, then an update.  Lets take a look and see...

In [23]:
df.withColumnRenamed('Job ID','JobID').createOrReplaceTempView('temp')

In [24]:
spark.sql("""
    select JobID,count(*) from temp
    group by JobID
    having count(*) > 1
    """).show()

+------+--------+
| JobID|count(1)|
+------+--------+
|239052|       2|
|406297|       2|
|406575|       2|
|409339|       2|
|400075|       2|
|233549|       2|
|277372|       2|
|365567|       2|
|412747|       2|
|385306|       2|
|393911|       2|
|396449|       2|
|403310|       2|
|423960|       2|
|426223|       2|
|393997|       2|
|403522|       2|
|404892|       2|
|424083|       2|
|407266|       2|
+------+--------+
only showing top 20 rows



In [25]:
df.sort("Job ID").limit(2).toPandas().transpose()

Unnamed: 0,0,1
Job ID,132292,132292
Agency,NYC HOUSING AUTHORITY,NYC HOUSING AUTHORITY
Posting Type,External,Internal
# Of Positions,52,52
Business Title,Maintenance Worker - Technical Services-Heatin...,Maintenance Worker - Technical Services-Heatin...
Civil Service Title,MAINTENANCE WORKER,MAINTENANCE WORKER
Title Code No,90698,90698
Level,0,0
Job Category,Maintenance & Operations,Maintenance & Operations
Full-Time/Part-Time indicator,F,F


The first duplicate Job ID's data is a duplicated record and not what I expected which was an updayed record.  I'll remove true duplicates, i.e. rows where all columns are duplicated, then check if Job ID is now unique

In [26]:
dedupd_df = df.distinct()

In [27]:
dedupd_df.count()

2915

Only 31 duplicated rows removed,  so now we need to investigate the duplicated Job ID that we still have.  From above we know that there are only 1661 unique IDs.

I cheated a bit here, and used Excel to identify where the duplications are and what columns are not duplicated causing .distinct() to not drop most of these.   I have found that these 4 columns have small changes but from the records I viewed, the changes were not fucntional in that by dropping them I wont lose any information in the analaysis that required.

Job Description, Minimum Qual Requirements, Preferred Skills and Additional Information 



In [28]:
df.dropDuplicates(["Job ID"]).count()

1661

In [29]:
dedupd_df = df.dropDuplicates(["Job ID"])

I'm going to rebuild my small helper dict counts_map

In [30]:
counts_map = PPF.get_counts_map(dedupd_df)
print(counts_map)
desc = dedupd_df.describe().toPandas().transpose()
desc, desc.columns = desc[1:], desc.loc["summary"]

{'Job ID': 1661, 'Agency': 1661, 'Posting Type': 1661, '# Of Positions': 1661, 'Business Title': 1661, 'Civil Service Title': 1661, 'Title Code No': 1661, 'Level': 1661, 'Job Category': 1659, 'Full-Time/Part-Time indicator': 1548, 'Salary Range From': 1661, 'Salary Range To': 1661, 'Salary Frequency': 1661, 'Work Location': 1661, 'Division/Work Unit': 1661, 'Job Description': 1661, 'Minimum Qual Requirements': 1652, 'Preferred Skills': 1513, 'Additional Information': 1359, 'To Apply': 1566, 'Hours/Shift': 1080, 'Work Location 1': 1040, 'Recruitment Contact': 689, 'Residency Requirement': 1296, 'Posting Date': 1371, 'Post Until': 825, 'Posting Updated': 1378, 'Process Date': 1428}


In [31]:
df_count=dedupd_df.count()
df_count

1661

In [32]:
dedupd_df.withColumnRenamed('Job ID','JobID').createOrReplaceTempView('temp')

In [33]:
spark.sql("""
    with v1 as (
        select JobID,count(*) from temp
        group by JobID
        having count(*) > 1
    )
    select count(*) from v1
    """).show()

+--------+
|count(1)|
+--------+
|       0|
+--------+



Job ID Summary:
    
* Required for data analysis:  Yes: during pre-processing of the data, Job ID can be useful to hone in on rows where the processing has created unexpected results.


#### Agency 

String and there are no missing values

* Required for data analysis:  Yes

>   What's the job posting having the highest salary per agency? 


In [34]:
df_count == counts_map["Agency"]

True

#### Posting Type 

Would be a categorical type as there are only 2 distict values (none missing)

* Required for data analysis:  No

In [35]:
dedupd_df.groupBy("Posting Type").count().orderBy('count', ascending=False).limit(10).show()

+------------+-----+
|Posting Type|count|
+------------+-----+
|    Internal|  969|
|    External|  692|
+------------+-----+



#### "# Of Positions"

should int type (no missing values)

* Required for data analysis:  No

In [36]:
df_count == counts_map["# Of Positions"]

True

#### Business Title 

string type and no missing

* Required for data analysis:  No

In [37]:
df_count == counts_map["Business Title"]

True

#### Civil Service Title 

string type and no missing

* Required for data analysis:  No

In [38]:
df_count == counts_map["Civil Service Title"]

True

#### Title Code No 

String, as there are numerics mixed in with numbers.
No missing values

* Required for data analysis:  No

In [39]:
df_count == counts_map["Title Code No"]

True

In [40]:
dedupd_df.select("Title Code No").show()

+-------------+
|Title Code No|
+-------------+
|        56058|
|        91717|
|        52366|
|        13632|
|        10009|
|         6798|
|        0608A|
|        1002D|
|        13652|
|        22425|
|        20210|
|         6776|
|        10009|
|        20415|
|        90910|
|        10079|
|        22508|
|        1007C|
|        56058|
|        8300A|
+-------------+
only showing top 20 rows



#### Level
String type
No missing values

* Required for data analysis:  No

In [41]:
df_count == counts_map["Level"]

True

In [42]:
dedupd_df.select("Level").orderBy("Level",ascending=False).limit(10).show()

+-----+
|Level|
+-----+
|   M7|
|   M7|
|   M7|
|   M7|
|   M7|
|   M7|
|   M7|
|   M7|
|   M6|
|   M5|
+-----+



In [43]:
dedupd_df.select("Level").orderBy("Level",ascending=True).limit(10).show()

+-----+
|Level|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
+-----+



#### Job Category

summart:

- String Type
- missing 2 values:
- 131 distinct categories

* Required for data analysis:  YES

> What's the salary distribution per job category 
    - 3 line graph, min, max, avg - grouped by category

> What's the number of jobs posting per category (Top 10)
    - report
    - histogram

In [44]:
df_count == counts_map["Job Category"]

False

In [45]:
dedupd_df.groupBy("Job Category").count().show()

+--------------------+-----+
|        Job Category|count|
+--------------------+-----+
|Administration & ...|    1|
|Health Policy, Re...|    2|
|Administration & ...|    1|
|Finance, Accounti...|    1|
|Information Techn...|    1|
|Engineering, Arch...|    4|
|Legal Affairs Pol...|    3|
|Administration & ...|    1|
|Constituent Servi...|   68|
|Building Operatio...|   99|
|Engineering, Arch...|    2|
|Constituent Servi...|    4|
|Administration & ...|    1|
|       Legal Affairs|  120|
|Engineering, Arch...|    1|
|Finance, Accounti...|    3|
|Constituent Servi...|    1|
|Administration & ...|    1|
|Health Legal Affairs|    2|
|Administration & ...|    4|
+--------------------+-----+
only showing top 20 rows



In [46]:
dedupd_df.groupBy('Job Category').count().orderBy('count', ascending=False).limit(10).show()

+--------------------+-----+
|        Job Category|count|
+--------------------+-----+
|Engineering, Arch...|  260|
|Technology, Data ...|  182|
|       Legal Affairs|  120|
|Building Operatio...|   99|
|Finance, Accounti...|   98|
|Public Safety, In...|   98|
|Administration & ...|   88|
|              Health|   71|
|Constituent Servi...|   68|
|Policy, Research ...|   64|
+--------------------+-----+



In [47]:
dedupd_df.select('Job Category').distinct().count()

131

- there are 2 missing Job Catergory's

In [48]:
df_count - counts_map["Job Category"]

2

In [49]:
import pyspark.sql.functions as F

In [50]:
dedupd_df.where(F.col('Job Category').isNull()).toPandas().transpose()

Unnamed: 0,0,1
Job ID,97899,87990
Agency,DEPARTMENT OF BUSINESS SERV.,DEPARTMENT OF BUSINESS SERV.
Posting Type,Internal,Internal
# Of Positions,1,1
Business Title,"EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT",Account Manager
Civil Service Title,ADMINISTRATIVE BUSINESS PROMOT,CONTRACT REVIEWER (OFFICE OF L
Title Code No,10009,40563
Level,M3,1
Job Category,,
Full-Time/Part-Time indicator,F,


- There are only 2 records with null Job Category.  I'll update these to "not specified" [here](pre_processing_and_wrangling.ipynb#job_category)

#### Full-Time/Part-Time indicator

String with missing values

* Required for data analysis:  No


In [51]:
dedupd_df.groupBy("Full-Time/Part-Time indicator").count().show()

+-----------------------------+-----+
|Full-Time/Part-Time indicator|count|
+-----------------------------+-----+
|                            F| 1484|
|                         null|  113|
|                            P|   64|
+-----------------------------+-----+



- There are 113 records with null.  I'll update these to "not specified" if required for reporting

#### Salary Range From

* Required for data analysis:  YES

> mutilple salary related questions to answer

Should be numeric, lets look at the data:

In [52]:
df_count - counts_map["Salary Range From"]

0

- check that all columns are int - actually select those that cannot be cast to int.

In [53]:
dedupd_df.select("Salary Range From").where(F.col("Salary Range From").cast('float').isNull()).show()

+-----------------+
|Salary Range From|
+-----------------+
+-----------------+



In [54]:
desc.loc["Salary Range From"]

summary
count                   1661
mean      58836.501277965064
stddev    26392.114682422845
min                        0
max                    99353
Name: Salary Range From, dtype: object

#### Salary Range To

Should be numeric, lets look at the data:


* Required for data analysis:  YES

> mutilple salary related questions to answer

In [55]:
df_count - counts_map["Salary Range To"]

0

In [56]:
dedupd_df.select("Salary Range To").where(F.col("Salary Range To").cast('float').isNull()).show()

+---------------+
|Salary Range To|
+---------------+
+---------------+



In [57]:
desc.loc["Salary Range To"]

summary
count                  1661
mean       85387.9378948224
stddev    42041.27995414808
min                   10.36
max                   99406
Name: Salary Range To, dtype: object

#### Salary Frequency

In [58]:
dedupd_df.groupBy("Salary Frequency").count().limit(10).show()

+----------------+-----+
|Salary Frequency|count|
+----------------+-----+
|          Annual| 1540|
|          Hourly|   99|
|           Daily|   22|
+----------------+-----+



- Function required to put salary columns on the same frequencey scale

create function to create new columns: "Freq Adjusted Salary Range From" and "Freq Adjusted Salary Range To"

#### Work Location

String

* Required for data analysis:  Yes

#### Division/Work Unit

* Required for data analysis:  No

#### Job Description

There are no missing rows

* Required for data analysis:  _Maybe_,...could be required to answer "what are the highest paid skills...."

**However looking through many rows of the data I am not sure how required skills can be extracted from this column.**  

Examples when using "skill" as part of a regex:

These matches do not indicate the skill reuirement of the candidate
- `Assist skill trades staff.`  
- `and linking employers with a skilled and qualified workforce.` 

This match does, but how could one categorise it in a form ready for a report:

- `kills:  The ideal candidate will have demonstrated success developing and implementing business driven programs and will have exhibited:     Strong management and leadership skills   Experience planning, implementing and managing projects involving diverse stakeholders ......`

At this point I'm not what column / derived column I'll need to answer "What are the highest paid skills in the US market?"


In [59]:
df_count - counts_map["Job Description"]

0

In [60]:
dedupd_df.select("Job Description").show(20,truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Minimum Qual Requirements

String

* Required for data analysis:  YES

> mutilple salary related questions to answer

Data will need to be processed as this text field looks to be free txt as apposed to a selection from a drop-down combo list.  That means there could be typos, abbreviations etc etc.

Due to the free-text nature of this column a query like:
`df.groupBy("Minimum Qual Requirements")....` would not be possible.

In [61]:
dedupd_df.select("Minimum Qual Requirements").limit(20).show()

+-------------------------+
|Minimum Qual Requirements|
+-------------------------+
|     "1. A baccalaurea...|
|     (1) Five years of...|
|     1. A baccalaureat...|
|     "(1) A baccalaure...|
|     "1. A baccalaurea...|
|     "1. A baccalaurea...|
|     1. A baccalaureat...|
|     "1. A master's de...|
|     "Professional/ven...|
|     1. A baccalaureat...|
|     1.  A baccalaurea...|
|     Qualification Req...|
|     "1. A baccalaurea...|
|     "(1) Four (4) yea...|
|     1. Two years of f...|
|     "1. A four year h...|
|     "1.A baccalaureat...|
|     1. A baccalaureat...|
|     "1. A baccalaurea...|
|     "1. A baccalaurea...|
+-------------------------+



#### Preferred Skills

* Required for data analysis:  Yes to answer `What are the highest paid skills in the US market?`


In [62]:
dedupd_df.select("Preferred Skills").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This is a free text field.  Not sure how to get best value for my analysis.  To be honest, I have noticed the requirement for `What are the highest paid skills in the US market?` at the last hour so to speak.   So not sure what I will do, see [Reports](Reports.ipynb##highest_paid_skills)

#### Additional Information

* Required for data analysis:  No


#### To Apply

* Required for data analysis:  No


#### Hours/Shift

* Required for data analysis:  No.
    
Will need to calculate salary in the case of hourly or monthly paid emplyees and ignore this column

In [63]:
dedupd_df.select(["Hours/Shift","Salary Frequency"]).show()

+--------------------+----------------+
|         Hours/Shift|Salary Frequency|
+--------------------+----------------+
|"Click on the ""A...|          Annual|
|                null|           Daily|
|                null|          Annual|
| you must explain...|          Annual|
|Day - Due to the ...|          Annual|
|Day - Due to the ...|          Annual|
|                null|          Annual|
| as described in ...|          Annual|
|                null|          Annual|
|                null|          Annual|
|35 Hours/To be de...|          Annual|
|                null|          Annual|
|                null|          Annual|
|Appointments are ...|          Annual|
|  40 hours / Various|          Annual|
|                null|          Annual|
|        architecture|          Annual|
|                null|          Annual|
|       Apply Online.|          Annual|
|        architecture|          Annual|
+--------------------+----------------+
only showing top 20 rows



In [64]:
dedupd_df.select(["Hours/Shift","Salary Frequency"]).where(F.col('Salary Frequency') == 'Hourly').toPandas()

Unnamed: 0,Hours/Shift,Salary Frequency
0,,Hourly
1,,Hourly
2,,Hourly
3,Up to 17 hours/week while school is in session...,Hourly
4,,Hourly
...,...,...
94,,Hourly
95,,Hourly
96,This is a part time position. 20 hours per week.,Hourly
97,Monday-Friday 35 hours per week,Hourly


In [65]:
dedupd_df.select(["Hours/Shift","Salary Frequency"]).where(F.col('Salary Frequency') == 'Daily').toPandas()

Unnamed: 0,Hours/Shift,Salary Frequency
0,,Daily
1,,Daily
2,the U.S. Department of Labor or any apprentic...,Daily
3,with a major in Water Quality Monitoring,Daily
4,,Daily
5,This position is open to qualified persons wit...,Daily
6,,Daily
7,with a major in Water Quality Monitoring,Daily
8,"TO APPLY, PLEASE SUBMIT RESUME WITH COVER LETT...",Daily
9,NOTE: This position is open to qualified perso...,Daily


* _tricky_ function required here.  Need to decide how to rationalise the salary based upon payment frequency.   Its not going to be correct to assume that hourly paid roles are going to do 40 hrs / week.  

I will instead, use the statistics I've found in the following link and work out the salary as follows:

[Average Working Hours \(Statistical Data 2021\)](https://clockify.me/working-hours)

- for Annually paid roles, I'll calculate an hourly rate base upon USA avg hrs / year: 1757, e.g. $100k -> $56 per hour
- for Daily paid roles, I'll calculate an hourly rate base upon 8 hrs / day
- for Hourly paid roles, I'll use the raw data.

This may be me skewed results, for example there _may_ be a role that demands only 5 hours a week but its very well paid.  The employee's real annual wage would be extremely low, but in my calculation this role would be relatively well paid.  I will have to look at the data after applying the proposed formula above.




* Also, there are many hourly and daily paid jobs where the number of hours are not specified in "Hours/Shift"

Maybe these columns have that info:

- Job Description

- Additional Information

But we can see from the samples below that 


In [66]:
dedupd_df.select(["Hours/Shift","Salary Frequency","Job Description","Additional Information"]).\
    where(F.col('Salary Frequency') == 'Hourly').\
    where(F.col('Hours/Shift').isNull()).toPandas()

Unnamed: 0,Hours/Shift,Salary Frequency,Job Description,Additional Information
0,,Hourly,"NYC Parks is the steward of over 30,000 acres ...","Approximate start date: May 15, 2020. Positio..."
1,,Hourly,** 30- 35 Hours Part-time The Office of S...,"Must follow all safety, security, Blood-borne ..."
2,,Hourly,The TLC is looking for four responsible Colleg...,
3,,Hourly,The Bureau of Sexually Transmitted Infections ...,**IMPORTANT NOTES TO ALL CANDIDATES: Please n...
4,,Hourly,DIVISION:\t\tTechnical Services â€“ Asset Mana...,REQUIREMENTS: College Aide ($15.75): Students...
...,...,...,...,...
58,,Hourly,Responsibilities of selected candidates will i...,SPECIAL NOTE: 1. This is a temporary assig...
59,,Hourly,The New York City Taxi and Limousine Commissio...,
60,,Hourly,New York City is home to approximately 1.64 mi...,
61,,Hourly,"Under supervision, prepare and apply plasterin...",Candidates will be required to take and pass a...


In [67]:
dedupd_df.select(["Hours/Shift","Salary Frequency","Job Description","Additional Information"]).\
    where(F.col('Salary Frequency') == 'Hourly').\
    where(F.col('Hours/Shift').isNull()).toPandas()

Unnamed: 0,Hours/Shift,Salary Frequency,Job Description,Additional Information
0,,Hourly,"NYC Parks is the steward of over 30,000 acres ...","Approximate start date: May 15, 2020. Positio..."
1,,Hourly,** 30- 35 Hours Part-time The Office of S...,"Must follow all safety, security, Blood-borne ..."
2,,Hourly,The TLC is looking for four responsible Colleg...,
3,,Hourly,The Bureau of Sexually Transmitted Infections ...,**IMPORTANT NOTES TO ALL CANDIDATES: Please n...
4,,Hourly,DIVISION:\t\tTechnical Services â€“ Asset Mana...,REQUIREMENTS: College Aide ($15.75): Students...
...,...,...,...,...
58,,Hourly,Responsibilities of selected candidates will i...,SPECIAL NOTE: 1. This is a temporary assig...
59,,Hourly,The New York City Taxi and Limousine Commissio...,
60,,Hourly,New York City is home to approximately 1.64 mi...,
61,,Hourly,"Under supervision, prepare and apply plasterin...",Candidates will be required to take and pass a...


In [68]:
dedupd_df.select(["Hours/Shift","Salary Frequency","Job Description","Additional Information"]).\
    where(F.col('Salary Frequency') == 'Hourly').\
    where(F.col('Hours/Shift').isNull()).count()

63

- look for Job Descriptions that contain specification of the number of hours to work:

Below I have limited the rows to just one to see where I am getting the regex match:

>

The mission of the New York City Police Department is to enhance the quality of life in New York City by working in partnership with the community to enforce the law, preserve peace, protect the people, reduce fear, and maintain order. The NYPD strives to foster a safe and fair city by incorporating Neighborhood Policing into all facets of Department operations, and solve the problems that create crime and disorder through an interdependent relationship between the people and its police, and by pioneering strategic innovation.  The Facilities Management Division, Building Maintenance Section manages the physical operation maintenance and repair of department facilities. The Building Maintenance Section is seeking a Sheet Metal Worker who will responsible for the following:  - Fabricate, erect and repair sheet metal structures such as ducts, metal ceilings, dampers, louvers and roofs;  - Spot welds solder and sweat all forms of sheet metal;  - Develop patterns and templates in fabricating complex shapes and forms.|

And found that my regex is poor at finding what I'm attemping to find, e.g.

- 25 hours
- 40 hrs

etc

Here I've matched the hr in `through` - but further on I have found that even with this false match, there are many records where we cant use these columns.


In [69]:
dedupd_df.select(["Job Description"]).\
    where(F.col('Salary Frequency') == 'Daily').\
    where(F.col('Hours/Shift').isNull()).\
    where(F.col("Job Description").rlike("(?i)^.*?hour|hr.*?$")).limit(1).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- look for records where neither Job Descriptions nor Additional Information contain specification of the number of hours to work.  

In [70]:
dedupd_df.select(["Job Description","Additional Information"]).\
    where(F.col('Salary Frequency') == 'Daily').\
    where(F.col('Hours/Shift').isNull()).\
    where(~F.col("Job Description").rlike("(?i)^.*?hour|hr.*?$")).limit(10).\
    where(~F.col("Additional Information").rlike("(?i)^.*?hour|hr.*?$")).count()

3

- so only 5 rows:  It may be possible to get the total number of hours required for all of the others.   Lets take a look at the records more closely nd ssee its its going to be feasible to "grep" out the hours

In [71]:
dedupd_df.select(["Job Description","Additional Information"]).\
    where(F.col('Salary Frequency') == 'Daily').\
    where(F.col('Hours/Shift').isNull()).\
    where(F.col("Job Description").rlike("(?i)^.*?hour|hr.*?$") | F.col("Additional Information").rlike("(?i)^.*?hour|hr.*?$")).\
    count()

4

##### Conclusion:  It does not seem possible to find the number of hours required for over 100 of the Hourly paid jobs.   Therefore I am to go with my origonal suggested solution of working out all jobs' salaries in Hours.

#### Work Location 1

* Required for data analysis:  No
    

#### Recruitment Contact

* Required for data analysis:  No


#### Residency Requirement

* Required for data analysis:  No


#### Posting Date

* Required for data analysis:  Yes
    
>  What's the job postings average salary per agency for the last 2 years? 

* Are all dates in the correct format to be cast to date?

These Posting Date values are not null but they also cannot be cast to date:

In [72]:
dedupd_df.select(["Posting Date"]).\
    where(F.col("Posting Date").cast('date').isNull() & ~F.col("Posting Date").isNull()).count()

605

In [73]:
dedupd_df.select(["Posting Date"]).\
    where(F.col("Posting Date").cast('date').isNull() & ~F.col("Posting Date").isNull()).show(30)

+--------------------+
|        Posting Date|
+--------------------+
|New York City res...|
|           help desk|
|New York City res...|
| managerial or su...|
| all candidates m...|
|       Apply Online.|
| ""2"" or ""3"" a...|
|New York City Res...|
|NYCHA has no resi...|
|30-30 Thomson Ave...|
| by examining for...|
|      59 Maiden Lane|
|New York City res...|
|â€¢	7+ years of P...|
|"Click on the ""A...|
|30-30 Thomson Ave...|
| ""2"" or ""3"" a...|
|Please submit you...|
|New York City res...|
|New York City res...|
|33 Beaver St, New...|
| all candidates m...|
|           help desk|
|NYCHA has no resi...|
|New York City res...|
|1. 8 years relate...|
| ""2"" or ""3"" a...|
|9:00am to 5:00pm ...|
| at least 18  mon...|
|"Please click on ...|
+--------------------+
only showing top 30 rows



The following cant simple be cast to date:

In [74]:
dedupd_df.select(["Posting Date"]).where(F.col("Posting Date").cast('date').isNull()).count()

895

In [75]:
dedupd_df.where(F.col("Posting Date").cast('date').isNull()).limit(5).toPandas().transpose()

Unnamed: 0,0,1,2,3,4
Job ID,239052,400075,425665,233549,396449
Agency,ADMIN FOR CHILDREN'S SVCS,ADMIN FOR CHILDREN'S SVCS,HRA/DEPT OF SOCIAL SERVICES,NYC EMPLOYEES RETIREMENT SYS,DEPT OF ENVIRONMENT PROTECTION
Posting Type,External,External,Internal,Internal,External
# Of Positions,1,1,1,1,1
Business Title,Child Welfare Trainer,Dynamics Developer,"Director, PERT","CERTIFIED IT ADMINISTRATOR (LAN/WAN), LEVEL 4",Mechanical Engineer
Civil Service Title,COMMUNITY COORDINATOR,COMPUTER SPECIALIST (SOFTWARE),ADMINISTRATIVE STAFF ANALYST (,CERTIFIED IT ADMINISTRATOR (LA,MECHANICAL ENGINEER
Title Code No,56058,13632,1002D,13652,20415
Level,0,2,0,4,2
Job Category,Community & Business Services Social Services,"Technology, Data & Innovation","Policy, Research & Analysis Social Services",Information Technology & Telecommunications,"Engineering, Architecture, & Planning"
Full-Time/Part-Time indicator,F,,F,F,F


In [76]:
dedupd_df.select(["Posting Date"]).show()

+--------------------+
|        Posting Date|
+--------------------+
|New York City res...|
|2017-11-01T00:00:...|
|2018-03-27T00:00:...|
|           help desk|
|2019-09-30T00:00:...|
|2019-09-30T00:00:...|
|2019-08-28T00:00:...|
|                null|
|                null|
|2017-01-09T00:00:...|
|2018-09-19T00:00:...|
|2019-03-02T00:00:...|
|2019-05-06T00:00:...|
|                null|
|2019-07-31T00:00:...|
|2019-09-13T00:00:...|
|                null|
|2019-12-06T00:00:...|
|New York City res...|
| managerial or su...|
+--------------------+
only showing top 20 rows



<a id='posting_date_analysis_summary'></a>

<a id='posting_date_analysis'></a>

#### "Posting Date" analysis

There are 1566 records that are either null or cant be cast to a date. If I cast to date, all values for corresponding rows where data is not valid will all be null and I dont want that.   For the one query regarding Posting Date I will only include the rows that have valid dates, just for that query.  

With so many records "invalid" it does not make sense to impute a date.  I will update these nulls to 1900 and filter them out of any reports that are required

#### Post Until

* Required for data analysis:  No


#### Posting Updated

* Required for data analysis:  No


#### Process Date

* Required for data analysis:  No


# Save data so far in csv format

We have not changed any column types so I'm happy that csv format is okay for now.

In [80]:
dedupd_df.toPandas().to_csv('/dataset/dedupd_df.csv')

In [2]:
ls -l  /dataset/dedupd_df.csv

-rw-r--r-- 1 root root 8714066 Sep 18 21:43 /dataset/dedupd_df.csv


In [3]:
!head -1 /dataset/dedupd_df.csv

,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Code No,Level,Job Category,Full-Time/Part-Time indicator,Salary Range From,Salary Range To,Salary Frequency,Work Location,Division/Work Unit,Job Description,Minimum Qual Requirements,Preferred Skills,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
