### <font color="#1F618D"> Newyork City Job Postings - Data Engineering Challenge </font>
### <font color="#F5B041"> Part-1: Data Profiling </font>

In [15]:
# Import the findspark module to help locate and initialize Spark
import findspark

# Initialize Spark
findspark.init()

# Import the jupyter_black module, which is an extension to format code cells in Jupyter Notebook
import jupyter_black

# Load and enable the jupyter_black extension to format code cells automatically
jupyter_black.load()

In [16]:
# importing required libraries

import pandas as pd

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, count, isnan, when

from common import *

In [17]:
def set_spark_conf():
    spark_conf = SparkConf()

    # Set the number of executor cores
    # spark_conf.set("spark.executor.cores", "2")

    # Set the executor memory
    # spark_conf.set("spark.executor.memory", "1g")

    # Set the driver memory
    # spark_conf.set("spark.driver.memory", "2g")

    # Additional settings
    spark_conf.set("spark.master", "spark://master:7077")

    return spark_conf


def get_spark_session(spark_conf: dict, job_name: str) -> SparkSession:
    """
    Get or create a SparkSession.

    Parameters:
        spark_conf (dict): Configuration options for SparkSession.
        job_name (str): Name of the Spark job.

    Returns:
        SparkSession: The initialized or retrieved SparkSession.
    """
    spark = SparkSession.builder.appName(job_name).config(conf=spark_conf).getOrCreate()
    return spark

In [18]:
# set the spark configuration options
spark_conf = set_spark_conf()

# Create a SparkSession with the configured settings
spark = get_spark_session(spark_conf, "NYCJobsDataProfiling")

In [None]:
spark.catalog.listTables()

In [5]:
# load the input dataset

df = spark.read.csv(
    "../../dataset/raw_data/nyc-jobs.csv", header=True, inferSchema=True, escape='"'
)

In [6]:
# obtain the column information

df.printSchema()

root
 |-- Job ID: integer (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Posting Type: string (nullable = true)
 |-- # Of Positions: integer (nullable = true)
 |-- Business Title: string (nullable = true)
 |-- Civil Service Title: string (nullable = true)
 |-- Title Code No: string (nullable = true)
 |-- Level: string (nullable = true)
 |-- Job Category: string (nullable = true)
 |-- Full-Time/Part-Time indicator: string (nullable = true)
 |-- Salary Range From: double (nullable = true)
 |-- Salary Range To: double (nullable = true)
 |-- Salary Frequency: string (nullable = true)
 |-- Work Location: string (nullable = true)
 |-- Division/Work Unit: string (nullable = true)
 |-- Job Description: string (nullable = true)
 |-- Minimum Qual Requirements: string (nullable = true)
 |-- Preferred Skills: string (nullable = true)
 |-- Additional Information: string (nullable = true)
 |-- To Apply: string (nullable = true)
 |-- Hours/Shift: string (nullable = true)
 |-- Work Locat

### <font color="#1F618D"> Detailed analysis of source data: <font color="#1F618D">

The provided schema represents the structure of a PySpark DataFrame that holds information about job postings in New York City. 
Each row in the DataFrame corresponds to a single job posting, and the columns represent various attributes associated with 
each posting. Let us analyze the source data based on the provided schema, categorizing columns by their data types.

#### <span style="text-decoration: underline;"> Numerical Columns:</span>

**#Of Positions (numerical):** This column represents the number of positions available for the job. It contains numerical values.

**Salary Range From (numerical):** This column represents the starting point of the salary range for the job. It contains numerical values.

**Salary Range To (numerical):** This column represents the endpoint of the salary range for the job. It contains numerical values.

#### <span style="text-decoration: underline;"> Categorical Columns:</span> 

**Job ID (categorical):** This column is a unique identifier for each job posting. It contains categorical values.

**Agency (categorical):** This column represents the agency or organization offering the job. It contains categorical values.

**Posting Type (categorical):** This column indicates the type of job posting. It contains categorical values.

**Business Title (categorical):** This column contains the business titles of the jobs. It contains categorical values.

**Civil Service Title (categorical):** This column contains the civil service titles of the jobs. It contains categorical values.

**Title Code No (categorical):** This column contains unique code numbers for job titles. It contains categorical values.

**Level (categorical):** This column represents the level or rank associated with the job. It contains categorical values.

**Job Category (categorical):** This column represents the category or field to which the job belongs. It contains categorical values.

**Full-Time/Part-Time indicator (categorical):** This column indicates whether the job is full-time or part-time. It contains categorical values.

**Salary Frequency (categorical):** This column represents the frequency at which the salary is paid. It contains categorical values.

**Work Location (categorical):** This column represents the primary location where the job will be performed. It contains categorical values.

**Division/Work Unit (categorical):** This column represents the division or specific unit within the agency where the job belongs. It contains categorical values.

**Residency Requirement (categorical):** This column specifies any residency requirements for the job. It contains categorical values.

**Posting Date (categorical):** This column represents the date when the job posting was initially published. It contains categorical date values.

**Post Until (categorical):** This column represents the date until which the job posting will be available. It contains categorical date values.

**Posting Updated (categorical):** This column represents the date when the job posting information was last updated. It contains categorical date values.

**Process Date (categorical):** This column represents the date when the information was processed. It contains categorical date values.

#### <span style="text-decoration: underline;"> Textual Columns:</span> 

**Job Description (textual):** This column contains descriptions of the responsibilities and tasks associated with the job.

**Minimum Qual Requirements (textual):** This column contains the minimum qualifications or requirements that candidates must meet to be eligible for the job.

**Preferred Skills (textual):** This column contains skills or qualifications that are preferred but not mandatory for the job.

**Additional Information (textual):** This column contains additional information about the job posting.

**To Apply (textual):** This column contains instructions on how to apply for the job.

**Hours/Shift (textual):** This column contains information about the working hours or shifts associated with the job.

**Work Location 1 (textual):** This column contains additional information about the work location.

**Recruitment Contact (textual):** This column contains contact details for inquiries related to recruitment. From the data profiling, noticed that it contains all null values, hence can be removed.

This detailed analysis categorizes the columns in the provided schema into numerical, categorical, and textual types, providing insights into the nature of the data present in each column.

### <font color="#1F618D"> Data Profiling </font>

In [7]:
def get_data_types(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get data types of specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names and their data types.
    """
    data_types_df = pd.DataFrame(
        {
            "column_names": columns,
            "data_types": [x[1] for x in data_df.select(columns).dtypes],
        }
    )
    return data_types_df[["column_names", "data_types"]]


def get_null_counts(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get counts of null and NaN values for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names and their null value counts.
    """
    null_counts_df = (
        data_df.select(
            [
                count(when(isnan(c) | col(c).isNull(), c)).alias(c)
                for c in columns
                if data_df.select(c).dtypes[0][1] != "timestamp"
            ]
        )
        .toPandas()
        .transpose()
    )
    null_counts_df = null_counts_df.reset_index()
    null_counts_df.columns = ["column_names", "num_null"]
    return null_counts_df


def get_space_and_blank_counts(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get counts of white spaces and blanks for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names and their space/blank counts.
    """
    num_spaces = [data_df.where(col(c).rlike("^\\s+$")).count() for c in columns]
    num_blank = [data_df.where(col(c) == "").count() for c in columns]

    space_blank_df = pd.DataFrame(
        {"column_names": columns, "num_spaces": num_spaces, "num_blank": num_blank}
    )
    return space_blank_df


def get_descriptive_stats(data_df: DataFrame) -> pd.DataFrame:
    """
    Get descriptive statistics for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: DataFrame containing descriptive statistics.
    """
    desc_df = data_df.describe().toPandas().transpose()
    desc_df.columns = ["count", "mean", "stddev", "min", "max"]
    desc_df = desc_df.iloc[1:, :]
    desc_df = desc_df.reset_index()
    desc_df.columns.values[0] = "column_names"
    desc_df = desc_df[["column_names", "count", "mean", "stddev"]]
    return desc_df


def get_distinct_value_counts(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get the number of distinct values for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names and their distinct value counts.
    """
    distinct_counts_df = pd.DataFrame(
        {
            "column_names": columns,
            "num_distinct": [data_df.select(x).distinct().count() for x in columns],
        }
    )
    return distinct_counts_df


def get_most_frequent_values(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get the most frequently occurring value and its count for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names, most frequent value, and its count.
    """
    most_freq_values = [
        data_df.groupBy(x)
        .count()
        .sort("count", ascending=False)
        .limit(1)
        .toPandas()
        .iloc[0]
        .tolist()
        for x in columns
    ]
    most_freq_values_df = pd.DataFrame(
        most_freq_values, columns=["most_freq_value", "most_freq_value_count"]
    )
    most_freq_values_df["column_names"] = columns
    most_freq_values_df = most_freq_values_df[
        ["column_names", "most_freq_value", "most_freq_value_count"]
    ]
    return most_freq_values_df


def get_least_frequent_values(data_df: DataFrame, columns: list) -> pd.DataFrame:
    """
    Get the least frequently occurring value and its count for specified columns in the DataFrame.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        columns (list): List of column names.

    Returns:
        pd.DataFrame: DataFrame containing column names, least frequent value, and its count.
    """
    least_freq_values = [
        data_df.groupBy(x)
        .count()
        .sort("count", ascending=True)
        .limit(1)
        .toPandas()
        .iloc[0]
        .tolist()
        for x in columns
    ]
    least_freq_values_df = pd.DataFrame(
        least_freq_values, columns=["least_freq_value", "least_freq_value_count"]
    )
    least_freq_values_df["column_names"] = columns
    least_freq_values_df = least_freq_values_df[
        ["column_names", "least_freq_value", "least_freq_value_count"]
    ]
    return least_freq_values_df


def is_categorical_column(column: DataFrame, distinct_threshold: int) -> bool:
    """
    Determine if a column is categorical based on the distinct value threshold.

    Parameters:
        column (DataFrame): The input column to be evaluated.
        distinct_threshold (int): The threshold to consider a column as categorical.

    Returns:
        bool: True if the column is categorical, False otherwise.
    """
    num_distinct = column.nunique()
    return num_distinct <= distinct_threshold


def profile_column_values(data_df: DataFrame, column_name: str):
    """
    Profile the values of a column by showing value counts.

    Parameters:
        data_df (DataFrame): The input DataFrame.
        column_name (str): The name of the column to be profiled.
    """
    value_counts = data_df.groupBy(column_name).count().orderBy(col("count").desc())

    return value_counts.toPandas()

In [11]:
def get_data_profile(data_all_df: DataFrame, columns2Bprofiled: list) -> pd.DataFrame:
    """
    Generate a comprehensive data profile for specified columns in the DataFrame.

    Parameters:
        data_all_df (DataFrame): The input DataFrame.
        columns2Bprofiled (list): List of column names to be profiled.

    Returns:
        pd.DataFrame: Comprehensive data profiling results.
    """
    # Print the section heading
    print_section_heading("DATA PROFILING REPORT")

    # --- Data Shape ---
    print_subheading("Data Shape")

    # Get the number of rows and columns
    num_rows = data_all_df.count()
    num_columns = len(data_all_df.columns)

    # Display the results
    print(f"Number of rows: {num_rows}")
    print(f"Number of columns: {num_columns}")

    # Dropping duplicates
    data_all_df = data_all_df.distinct()
    dist_tot_cnt = data_all_df.count()
    print(f"Total Records after dropping duplicates: {dist_tot_cnt}")

    print_subheading_closure()

    # --- Column Stats ---
    print_subheading("Column Stats")

    # Select columns to be profiled
    data_df = data_all_df.select(columns2Bprofiled)

    # Get data type and other basic statistics for columns
    prof_df = get_data_types(data_df, columns2Bprofiled)

    # Count the number of rows
    num_rows = data_df.count()
    prof_df["num_rows"] = num_rows

    # Get null counts for columns
    null_counts_df = get_null_counts(data_df, columns2Bprofiled)
    prof_df = pd.merge(prof_df, null_counts_df, on=["column_names"], how="left")

    # Get counts of spaces and blank values for columns
    space_blank_df = get_space_and_blank_counts(data_df, columns2Bprofiled)
    prof_df = pd.merge(prof_df, space_blank_df, on=["column_names"], how="left")

    # Get descriptive statistics for columns
    desc_df = get_descriptive_stats(data_df)
    prof_df = pd.merge(prof_df, desc_df, on=["column_names"], how="left")

    # Get distinct value counts for columns
    distinct_counts_df = get_distinct_value_counts(data_df, columns2Bprofiled)
    prof_df = pd.merge(prof_df, distinct_counts_df, on=["column_names"], how="left")

    # Get most frequent values for columns
    most_freq_values_df = get_most_frequent_values(data_df, columns2Bprofiled)
    prof_df = pd.merge(prof_df, most_freq_values_df, on=["column_names"], how="left")

    # Get least frequent values for columns
    least_freq_values_df = get_least_frequent_values(data_df, columns2Bprofiled)
    prof_df = pd.merge(prof_df, least_freq_values_df, on=["column_names"], how="left")

    # Display the comprehensive data profile DataFrame
    display(prof_df)

    print_subheading_closure()

    # --- Column Profiles ---
    print_subheading("Column Frequency Distribution")

    print("Getting value counts - top 10")
    for col_name in columns2Bprofiled:
        display(profile_column_values(data_df, col_name).head(10))

    print_subheading_closure()

In [12]:
get_data_profile(df, df.columns)


╔════════════════════════════════════════════════════════╗
║                 DATA PROFILING REPORT                  ║
╚════════════════════════════════════════════════════════╝

╔════════════════════════════════════════════════════════╗
║                       Data Shape                       ║
╠════════════════════════════════════════════════════════╣
Number of rows: 2946
Number of columns: 28
Total Records after dropping duplicates: 2915
╚════════════════════════════════════════════════════════╝

╔════════════════════════════════════════════════════════╗
║                      Column Stats                      ║
╠════════════════════════════════════════════════════════╣


Unnamed: 0,column_names,data_types,num_rows,num_null,num_spaces,num_blank,count,mean,stddev,num_distinct,most_freq_value,most_freq_value_count,least_freq_value,least_freq_value_count
0,Job ID,int,2915,0.0,0,0,2915.0,384863.0401372213,53017.48027901095,1661,384143,3.0,363130,1.0
1,Agency,string,2915,0.0,0,0,2915.0,,,52,DEPT OF ENVIRONMENT PROTECTION,650.0,PUBLIC ADMINISTRATOR-NEW YORK,1.0
2,Posting Type,string,2915,0.0,0,0,2915.0,,,2,Internal,1653.0,External,1262.0
3,# Of Positions,int,2915,0.0,0,0,2915.0,2.4363636363636365,8.58105496276755,34,1,2263.0,120,1.0
4,Business Title,string,2915,0.0,0,0,2915.0,,,1244,Assistant Civil Engineer,32.0,Administrative Program Coordinator,1.0
5,Civil Service Title,string,2915,0.0,0,0,2915.0,,,312,COMMUNITY COORDINATOR,180.0,ADM HOUSING DEV SPEC(NON MGRL),1.0
6,Title Code No,string,2915,0.0,0,0,2915.0,35536.76237989653,28113.970668785427,323,56058,180.0,31118,1.0
7,Level,string,2915,0.0,0,0,2915.0,1.0541530944625408,1.1402285900872402,14,0,1097.0,4B,1.0
8,Job Category,string,2915,2.0,0,0,2913.0,,,131,"Engineering, Architecture, & Planning",497.0,"Administration & Human Resources Technology, D...",1.0
9,Full-Time/Part-Time indicator,string,2915,193.0,0,0,2722.0,,,3,F,2597.0,P,125.0


╚════════════════════════════════════════════════════════╝

╔════════════════════════════════════════════════════════╗
║             Column Frequency Distribution              ║
╠════════════════════════════════════════════════════════╣
Getting value counts - top 10


Unnamed: 0,Job ID,count
0,384143,3
1,425375,2
2,412113,2
3,396195,2
4,369120,2
5,383183,2
6,243640,2
7,422886,2
8,274845,2
9,416166,2


Unnamed: 0,Agency,count
0,DEPT OF ENVIRONMENT PROTECTION,650
1,NYC HOUSING AUTHORITY,231
2,DEPT OF HEALTH/MENTAL HYGIENE,187
3,DEPARTMENT OF TRANSPORTATION,180
4,DEPT OF DESIGN & CONSTRUCTION,142
5,TAXI & LIMOUSINE COMMISSION,134
6,ADMIN FOR CHILDREN'S SVCS,108
7,DEPT OF INFO TECH & TELECOMM,107
8,LAW DEPARTMENT,90
9,HOUSING PRESERVATION & DVLPMNT,86


Unnamed: 0,Posting Type,count
0,Internal,1653
1,External,1262


Unnamed: 0,# Of Positions,count
0,1,2263
1,2,307
2,3,82
3,4,74
4,5,59
5,8,19
6,6,15
7,7,15
8,10,11
9,15,10


Unnamed: 0,Business Title,count
0,Assistant Civil Engineer,32
1,Project Manager,29
2,College Aide,24
3,Construction Project Manager,22
4,ACCOUNTABLE MANAGER,20
5,Confidential Investigator,18
6,Watershed Maintainer,17
7,Prosecuting Attorney,16
8,Investigator,16
9,Senior Project Manager,15


Unnamed: 0,Civil Service Title,count
0,COMMUNITY COORDINATOR,180
1,AGENCY ATTORNEY,112
2,CIVIL ENGINEER,87
3,CITY RESEARCH SCIENTIST,83
4,CONSTRUCTION PROJECT MANAGER,72
5,CLERICAL ASSOCIATE,69
6,COMMUNITY ASSOCIATE,69
7,ADMINISTRATIVE PROJECT MANAGER,58
8,COMPUTER SYSTEMS MANAGER,57
9,PRINCIPAL ADMINISTRATIVE ASSOC,53


Unnamed: 0,Title Code No,count
0,56058,180
1,30087,112
2,20215,87
3,21744,83
4,34202,72
5,56057,69
6,10251,69
7,10050,57
8,10124,53
9,6088,52


Unnamed: 0,Level,count
0,0,1097
1,1,518
2,2,499
3,3,295
4,M1,160
5,M2,109
6,M3,102
7,4,47
8,M4,40
9,M5,27


Unnamed: 0,Job Category,count
0,"Engineering, Architecture, & Planning",497
1,"Technology, Data & Innovation",312
2,Legal Affairs,224
3,"Public Safety, Inspections, & Enforcement",179
4,Building Operations & Maintenance,177
5,"Finance, Accounting, & Procurement",168
6,Administration & Human Resources,131
7,Constituent Services & Community Programs,129
8,Health,125
9,"Policy, Research & Analysis",124


Unnamed: 0,Full-Time/Part-Time indicator,count
0,F,2597
1,,193
2,P,125


Unnamed: 0,Salary Range From,count
0,52524.0,82
1,75000.0,71
2,56990.0,68
3,55416.0,61
4,63031.0,49
5,15.5,46
6,54100.0,44
7,69940.0,43
8,65783.0,43
9,73305.0,37


Unnamed: 0,Salary Range To,count
0,85000.0,49
1,83151.0,41
2,100000.0,38
3,75000.0,35
4,65000.0,34
5,90000.0,33
6,80000.0,33
7,95000.0,32
8,130000.0,31
9,168433.0,31


Unnamed: 0,Salary Frequency,count
0,Annual,2683
1,Hourly,194
2,Daily,38


Unnamed: 0,Work Location,count
0,96-05 Horace Harding Expway,262
1,59-17 Junction Blvd Corona Ny,201
2,30-30 Thomson Ave L I City Qns,142
3,"1 Centre St., N.Y.",123
4,55 Water St Ny Ny,117
5,"33 Beaver St, New York Ny",110
6,255 Greenwich Street,96
7,100 Gold Street,86
8,"100 Church St., N.Y.",85
9,"150 William Street, New York N",76


Unnamed: 0,Division/Work Unit,count
0,Executive Management,56
1,Central Brookly City Operation,36
2,Law Department,32
3,Administration,31
4,Citywide Cybersecurity,29
5,Default,28
6,W S/Connections Permitting,25
7,Green Infrastructure,24
8,Information Technology,24
9,Dept of Environment Protection,24


Unnamed: 0,Job Description,count
0,The New York City Taxi and Limousine Commissio...,14
1,The New York City Taxi and Limousine Commissio...,6
2,The NYC Department of Environmental Protection...,4
3,The City of New Yorkâ€™s Office of Administrat...,4
4,Providing technical support to the Affirmative...,4
5,New York City is home to approximately 1.64 mi...,4
6,The Division Tenant Resources HPD's Division o...,4
7,Major Responsibilities â€¢\tWork with the Boro...,4
8,The EEO Investigator will work under general t...,4
9,The TLC is looking for four responsible Colleg...,4


Unnamed: 0,Minimum Qual Requirements,count
0,1. A baccalaureate degree from an accredited c...,180
1,1. Admission to the New York State Bar; and ei...,112
2,"(1) Four (4) years of full-time, satisfactory ...",87
3,"1. For Assignment Level I (only physical, bio...",83
4,1. A four-year high school diploma or its educ...,72
5,Qualification Requirements A four-year high s...,69
6,Qualification Requirements 1. High school gra...,69
7,1. A master's degree in computer science from ...,57
8,1. A baccalaureate degree from an accredited c...,56
9,1. A baccalaureate degree from an accredited c...,53


Unnamed: 0,Preferred Skills,count
0,,387
1,ERROR: #NAME?,41
2,Interested candidates should have excellent wr...,18
3,Ability to communicate effectively in verbal a...,13
4,"1.\tKnowledge of Microsoft Word, Excel, Outloo...",12
5,1.\tExperience with public housing. 2.\tExper...,8
6,â€¢\tAt least five years of litigation experie...,8
7,â€¢\tExcellent oral and written communication ...,8
8,A valid NYS Driver's License is required for t...,8
9,Experience working with distributed computing ...,7


Unnamed: 0,Additional Information,count
0,,1087
1,Appointments are subject to OMB approval. For ...,87
2,"NYCHA employees applying for promotional, titl...",56
3,**IMPORTANT NOTES TO ALL CANDIDATES: Please n...,49
4,**IMPORTANT NOTES TO ALL CANDIDATES: Please n...,36
5,Note: This position is open to qualified perso...,35
6,Appointments are subject to OMB approval. For...,27
7,DEP is an equal opportunity employer with a st...,22
8,Appointments are subject to OMB approval. For ...,20
9,Mayorâ€™s Office of Contract Services is an eq...,20


Unnamed: 0,To Apply,count
0,"Click the ""Apply Now"" button.",296
1,"Click, ""APPLY NOW"" Current city employees must...",116
2,"Click the ""Apply Now"" button",112
3,"To apply click ""Apply Now""",89
4,"Click ""Apply Now"" button",54
5,"For City employees, please go to Employee Self...",46
6,"Click on the ""Apply Now"" button.",43
7,Click the â€œApply Nowâ€ button,41
8,"To apply, please click, ""Apply Now"".",38
9,Apply Online.,38


Unnamed: 0,Hours/Shift,count
0,,2039
1,35 Hours,134
2,35 hours per week,47
3,Day - Due to the necessary technical support d...,38
4,35 hours per week / day,36
5,35hrs,33
6,"Unless otherwise indicated, all positions requ...",24
7,"DAY, 9-5; ON OCCASION, CANDIDATES WILL BE REQU...",22
8,"Monday â€“ Friday, 9am to 5pm.",19
9,35 hours,18


Unnamed: 0,Work Location 1,count
0,,1578
1,"30-30 Thomson Avenue, LIC, NY",130
2,100 Gold Street,75
3,55 Water St Ny Ny,70
4,"33 Beaver St, New York Ny",68
5,255 Greenwich Street,56
6,"New York, NY",45
7,"Brooklyn, NY",43
8,"1 Police Plaza, N.Y.",32
9,"31-00 47 Ave, 3 FL, LIC NY",28


Unnamed: 0,Recruitment Contact,count
0,,2915


Unnamed: 0,Residency Requirement,count
0,New York City residency is generally required ...,1687
1,New York City Residency is not required for th...,636
2,NYCHA has no residency requirements.,222
3,New York City Residency is not required for th...,216
4,New York City residency is not required for th...,16
5,This position is exempt from NYC residency req...,13
6,"Residency in New York City, Nassau, Orange, Ro...",11
7,"Residency in New York City, Nassau, Orange, Ro...",10
8,City Residency is not required for this position,8
9,New York City Residency is not required for th...,7


Unnamed: 0,Posting Date,count
0,2019-11-01,64
1,2019-09-30,59
2,2019-11-26,58
3,2019-11-27,54
4,2019-12-10,53
5,2019-12-13,51
6,2019-12-06,50
7,2019-12-11,48
8,2019-10-28,45
9,2019-10-31,38


Unnamed: 0,Post Until,count
0,NaT,2052
1,2019-12-20,51
2,2019-12-22,38
3,2019-12-18,34
4,2020-02-24,32
5,2020-01-26,32
6,2020-02-25,31
7,2019-12-26,31
8,2019-12-21,31
9,2019-12-19,28


Unnamed: 0,Posting Updated,count
0,2019-12-13,82
1,2019-12-16,71
2,2019-12-11,68
3,2019-11-27,67
4,2019-12-10,66
5,2019-11-26,66
6,2019-12-12,61
7,2019-12-05,52
8,2019-12-06,47
9,2019-11-04,45


Unnamed: 0,Process Date,count
0,2019-12-17,2911
1,NaT,4


╚════════════════════════════════════════════════════════╝


In [13]:
spark.stop()