This notebook hits the federal USA jobs database to retrieve both current and archived jobs based on given keywords and/or dates. **To get this working, make sure you have a folder called "collected_data" in the same directory level as this notebook.** The collected datasets will be dropped into that folder. 

To use this, simply input your parameters in the section below. Click on the "Cell" tab then "Run All". 

API Documentation: https://developer.usajobs.gov/API-Reference/GET-api-Search

### FILENAME CONVENTIONS:

See examples below:

**datascience_N_346_20171223111701.csv** for current jobs:

* Keyword = datascience
* Y/N (Y = Archived/N = Current)
* Number of results = 346
* Current Datetime stamp = 20171223111701 (2017-12-23 on 11:11am)

**geospatial_Y_4_20171223112709_12012016_12302016.csv** for archived jobs:

* Keyword = geospatial
* Y/N (Y = Archived/N = Current)
* Number of results = 4
* Current Datetime stamp = 20171223112709 (2017-12-23 on 11:27am)
* From Archived Date = 12012016 (12/01/2016)
* To Archived Date = 12302016 (12/30/2016)




## INPUT PARAMETERS:

In [1527]:
# API Key 
# Request one at:
# https://developer.usajobs.gov/APIRequest/Index

apiKey = "REQUEST YOUR API KEY AND INSERT IT HERE"

In [1528]:
# specifies whether results are current job postings or archived job postings
# Y = Archived posts
# N = Current posts

archive = "N"

In [1529]:
# specifies the number of results to retrieve
# only for current searches

resultsPerPage = 500

In [1530]:
# date range
# ONLY FOR ARCHIVED JOB POSTINGS

# format: MM/DD/YYYY

startDate = "01/01/2016"
endDate = "12/31/2016"

In [1531]:
# Position Title

title = "logistics"

## CODE BELOW:

In [1532]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import datetime
import os

In [1533]:
urlSearch = "https://data.usajobs.gov/api/search?Keyword=" + title + "&KeywordFilter=All"

In [1534]:
urlArchive = "https://data.usajobs.gov/api/Archive?PositionTitle=" + title

In [1535]:
if (archive == "Y"):
    url = urlArchive
    url = url + "&PostingStartDate=" + startDate
    url = url + "&PostingEndDate=" + endDate
else:
    url = urlSearch
    url = url +  "&ResultsPerPage=" + str(resultsPerPage) 

In [1536]:
print("API GET URL CALL:")
print(url)

API GET URL CALL:
https://data.usajobs.gov/api/search?Keyword=logistics&KeywordFilter=All&ResultsPerPage=500


In [1537]:
headers = {"Authorization-Key": apiKey}

In [1538]:
res = requests.get(url, headers = headers, verify = False)



In [1539]:
print("API Response code: ", res.status_code)

API Response code:  200


In [1540]:
response = res.json()

In [1541]:
def parseResults(searchResults):
    
    ApplicationCloseDate = []
    ApplyURI = []
    JobGrade = []
    PositionStartDate = []
    PositionEndDate = []
    PositionID = []
    PositionLocation = []
    PositionLocationDisplay = []
    QualificationSummary = []
    OrganizationName = []
    DepartmentName = []
    OfferingType = []
    MinPay = []
    MaxPay = []
    PayType = []
    PositionTitle = []
    JobSummary = []
    
    for r in searchResults:
        if 'MatchedObjectDescriptor' in r:            
            rr =  r.get("MatchedObjectDescriptor", {})   
            
            r_ApplicationCloseDate = rr.get("ApplicationCloseDate", None)
            r_ApplyURI = rr.get("PositionURI", None)
            r_JobGrade = rr.get("JobGrade", None)[0]["Code"]
            r_PositionStartDate = rr.get("PositionStartDate", None)
            r_PositionEndDate = rr.get("PositionEndDate", None)
            r_PositionID = rr.get("PositionID", None)
            r_PositionLocationDisplay = rr.get("PositionLocationDisplay", None)
            r_QualificationSummary = rr.get("QualificationSummary", None)            
            r_OrganizationName = rr.get("OrganizationName", None)
            r_DepartmentName = rr.get("DepartmentName", None)    
            r_PositionLocation = rr.get("PositionLocation", None)       
            r_PositionTitle = rr.get("PositionTitle", None)
            
            locations = []
            for l in r_PositionLocation:
                locations.append(l.get("CityName", None)) 
            locations = '|'.join(str(locs) for locs in locations)
            
            r_OfferingType = rr.get("PositionOfferingType", None)[0]["Name"]
            r_MinPay = rr.get("PositionRemuneration")[0]["MinimumRange"]
            r_MaxPay = rr.get("PositionRemuneration")[0]["MaximumRange"]
            r_PayType = rr.get("PositionRemuneration")[0]["RateIntervalCode"] 
            r_JobSummary = rr.get("UserArea", {}).get("Details", {}).get("JobSummary", None)
            
            ApplicationCloseDate.append(r_ApplicationCloseDate)
            ApplyURI.append(r_ApplyURI)
            JobGrade.append(r_JobGrade)
            PositionEndDate.append(r_PositionEndDate)
            PositionID.append(r_PositionID)
            PositionLocationDisplay.append(r_PositionLocationDisplay)
            QualificationSummary.append(r_QualificationSummary)
            OrganizationName.append(r_OrganizationName)
            DepartmentName.append(r_DepartmentName)
            PositionLocation.append(locations)
            OfferingType.append(r_OfferingType)
            MinPay.append(r_MinPay)
            MaxPay.append(r_MaxPay)
            PayType.append(r_PayType)
            PositionStartDate.append(r_PositionStartDate)
            PositionTitle.append(r_PositionTitle)
            JobSummary.append(r_JobSummary)

    return pd.DataFrame({
            "PositionID": PositionID,
            "ApplicationCloseDate": ApplicationCloseDate,
            "JobGrade": JobGrade,
            "PositionEndDate": PositionEndDate,   
            "OrganizationName": OrganizationName, 
            "DepartmentName": DepartmentName,
            "QualificationSummary": QualificationSummary,
            "URI": ApplyURI,
            "PositionLocation": PositionLocation,
            "OfferingType": OfferingType,
            "MinPay": MinPay,
            "MaxPay": MaxPay,
            "PayType": PayType,
            "PositionStartDate": PositionStartDate,
            "PositionTitle": PositionTitle,
            "JobSummary": JobSummary
    })

In [1542]:
def convertToDataframe(res):
    
    response = res.json()
    print("==============================================================")
    if ('SearchResult' in response):
        
        if (archive != "Y"):
            searchResultNumber = response.get("SearchResult", {}).get("SearchResultCountAll", 0)
        else:
            searchResultNumber = response.get("SearchResult", {}).get("SearchResultCount", 0)
            
        print(searchResultNumber, "results found in API response...")        
        
        if (searchResultNumber > 0):
            searchResults = response.get("SearchResult", {}).get("SearchResultItems", None)
            
            if (len(searchResults) > 0):
                t0 = time.time()
                print("Parsing in progress...")
                df = parseResults(searchResults).reset_index(drop = True)
                t1 = time.time()
                print("Parse complete. \nDuration: ", round(t1-t0, 5), " seconds.")
                print("Number of records: ", len(df))
                return df
            else:
                print("No Search Results.")        
        else:
            print("Search Result Number = 0.")                
    else:
        print("No SearchResult found in json response.")   

In [1543]:
df = convertToDataframe(res)

322 results found in API response...
Parsing in progress...
Parse complete. 
Duration:  0.00902  seconds.
Number of records:  322


In [1544]:
now = datetime.datetime.now()

nowDate = str(now).split(" ")[0].replace("-", "")
nowHour = str(now).split(" ")[1].split(":")[0]
nowMin = str(now).split(" ")[1].split(":")[1]
nowSec = str(now).split(" ")[1].split(":")[2].split(".")[0]

nowString = nowDate + nowHour + nowMin + nowSec

searchDates = ""

if archive == "Y":
    searchDates = "_" + str(startDate) + "_" + str(endDate)
    searchDates = searchDates.replace("/", "")

fileName = title.replace(";", "") + "_" + archive + "_" + str(len(df)) + "_" + nowString + searchDates + ".csv"

print(fileName)

logistics_N_322_20171223121226.csv


In [1545]:
df.to_csv(os.path.join("collected_data", fileName), index = False)

In [1546]:
df.head().transpose()

Unnamed: 0,0,1,2,3,4
ApplicationCloseDate,2018-01-03,2018-01-04,2018-01-02,2018-01-20,2018-01-08
DepartmentName,Department of the Army,Department of the Air Force,Department of the Air Force,Department of the Air Force,Department of the Air Force
JobGrade,GS,GS,GS,GS,GS
JobSummary,THIS IS A NATIONAL GUARD TITLE 32 EXCEPTED SER...,PUERTO RICO NATIONAL GUARD\nAIR TECHNICIAN VAC...,THIS IS A NATIONAL GUARD TITLE 32 EXCEPTED SER...,***THIS IS A TITLE 32 NATIONAL GUARD TECHNICIA...,THIS IS A NATIONAL GUARD TITLE 32 EXCEPTED SER...
MaxPay,82106.0000,64697.0000,93821.0000,64697.0000,78270.0000
MinPay,63161.0000,49765.0000,72168.0000,49765.0000,60210.0000
OfferingType,Permanent,Permanent,Permanent,Permanent,Permanent
OrganizationName,Army National Guard Units (Title 32/Title 5),Air National Guard Units (Title 32/Title 5),Air National Guard Units (Title 32/Title 5),Air National Guard Units (Title 32/Title 5),Air National Guard Units (Title 32/Title 5)
PayType,Per Year,Per Year,Per Year,Per Year,Per Year
PositionEndDate,2018-01-03,2018-01-04,2018-01-02,2018-01-20,2018-01-08
