## Background
#### Labor Market
How can we understand labor supply and demand for particular jobs in real-time? We can utilize the Bureau of Labor Statitics' (https://www.bls.gov/eag/eag.va.htm) employment and unemployment estimates to roughly gauge labor supply. For example in Virginia, as of December 2020, the number of unemploymed individuals is 209,400 or 4.9% of the civillian labor force. However the BLS and other statistical organizations release data usually a month or two after the point-in-time estaimtes are calculated and it is difficult to see what the labor supply is for particular job types or industries. 

We can gain insight into Labor Demand by viewing job postings in real-time. Some organizations like Burning Glass Technnologies or Claro Analytics have started to collect talent and job posting data and sell it to companies who are interested in topics such as talent location, compensation benchmarking, skill demand, and other labor market trends.

We can utilize the same techniques at a smaller scale by web-scraping job postings from major job boards such as Indeed to answer questions such as:
1. Which companies are hiring?
2. Where are companies hiring?
3. How much are they offering in compensation?
4. What kind of specific jobs are they hiring for?
5. Has there been a recent spike in labor demand? (If the analysis is run perodically over time)

#### Nursing
The American Association of Colleges of Nurses reports that the United States will "experience a shortage of Registered Nurses (RNs) that is expected to intensify as Baby Boomers age and the need for health care grows. Compounding the problem is the fact that nursing schools across the country are struggling to expand capacity to meet the rising demand for care given the national move toward healthcare reform."

COVID-19 exacerabated this shortage with the huge demand for healthcare workers (doctors, nurses, PAs, etc) as cases spiked across the United States. Even after a year since COVID-19 cases in the United States started to strain medical systems, this issue and challenge is real. Even a month ago, metro-Atlanta hospitals reported nursing shortages.

If the demand continues to remain at all-time highs then hospital systems and nursing schools should look to increase compensation and incentives to retain top talent and attract graduating nursing students. 

#### Purpose of Project
I was an economics major and I currently work in Ernst & Young's (EY) People Advisory Services as a data scientist consultant. I learned about labor demand/supply briefly and we've completed some high level analysi son this topic for clients from time to time. I try my best to keep up with the state of the US workforce, but have never completed any data analysis of my own. 

This project will collect nurse job postings in Virginia and visualize the demand on a map of Virginia. I don't believe this will reveal anything significant, but to me this is an important topic for the healthcare industry. This also gives me the opporuntity to work on labor economics while showcasing some of the technical skills I've learned in Python over the last month. 

This report/code will create a point-in-time view of the collected data but I will collect data every day to see the demand for nurses over the next few months. 

## Table of Contents
1. [Import Packages](#section_id1)
2. [Build Web Scraper](#section_id2)
3. [Run Web Scraper](#section_id3)
4. [Geo-Code Data](#section_id4)
5. [Data Wrangling](#section_id5)
6. [Visualize Data](#section_id6)
7. [Conclusion](#section_id7)
8. [Appendix](#section_id8)

 ## Importing Packages <a id='section_id1'></a>

In [2]:
# Data Wrangling Libraries 
import numpy as np
import pandas as pd
from datetime import date
import re

# Web scraping and Geocoding Libraries
import requests
import bs4
from bs4 import BeautifulSoup
import time
import pytz
import datetime

useProxy=False

# Visualizing Data Libraries
import folium
from folium import plugins
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## Building the Web Scraper <a id='section_id2'></a>

This is the web-scraper I used to find and collect job postings on Indeed's job board. 

In [12]:
#scraping code:
def extract_job_postings(job, city, number_of_searches):
    search_term = job.replace(' ', '+')
    location = city.replace(' ', '+').replace(',','%2C')
    url = f"https://www.indeed.com/jobs?q={search_term}&l={location}&radius=50"
    print(url)
    columns = ["job_title", "company_name","location","salary","days_since_posting","city"]
    df = pd.DataFrame(columns = columns)
    for start in range(0, number_of_searches, 10):
        page = requests.get(str(url)+"&start=" + str(start))
        time.sleep(2)  #ensuring at least 1 second between page grabs
        soup = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")
        for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard unifiedRow row result"}): 
            #specifying row num for index of job posting in dataframe
            num = (len(df) + 1) 
            #creating an empty list to hold the data for each posting
            job_post = [] 
            #grabbing job title      
            for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
                job_post.append(a["title"]) 
            #grabbing company name
            company = div.find_all(name="span", attrs={"class":"company"}) 
            if len(company) > 0: 
                for b in company:
                    job_post.append(b.text.strip()) 
            else: 
                sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
                for span in sec_try:
                    job_post.append(span.text) 
            #grabbing location name
            location = div.find_all(name="div", attrs={"class":"location accessible-contrast-color-location"})
            if len(location) == 0:
                job_post.append(city)
            else:
                for loc in location:
                    job_post.append(loc.text) 
            #grabbing salary
            salary = div.find(name="span", attrs={"class":"salaryText"})
            if salary is None:
                job_post.append("NO SALARY")
            else:
                job_post.append(salary.text.strip())
            #grabbing days since posting
            date = div.find(name="span", attrs={"class":"date"})
            job_post.append(date.text.strip())
            #adding city search term
            job_post.append(city)
            #appending list of job post info to dataframe at index num
            df.loc[num] = job_post
    return df

## Running Web Scraper <a id='section_id3'></a>

I've utilized Eviction Lab data to get every city and county in the United States. In this project I'll be only looking at Virginia. I could manually input data or find something from the census, but htis is what I had already so i quikcly made a csv of virginia's counties and independent cities to search job postings in. 

It does require to have a aws account to connect to the API and extract the data. I am not sure if they still give out the key publicly but you can check out their data here (super interesting data set) https://data-downloads.evictionlab.org/ 

#### R Code to get csv file of all the cities and counties in Virginia 

#Load Packages<br>
library(tidyverse)<br>
library(aws.s3)<br>

#Set keys<br>
Sys.setenv("AWS_ACCESS_KEY_ID" = "**KEY HERE**","AWS_SECRET_ACCESS_KEY" = "**KEY HERE**")<br>

#Code<br>
usercsvobj <-get_object("s3://eviction-lab-data-downloads/VA/tracts.csv")<br>
csvcharobj <- rawToChar(usercsvobj)<br>
con <- textConnection(csvcharobj) <br>
evictions_state <- read.csv(con,header= TRUE,colClasses= c(GEOID = "character", name = "character")) <br>
close(con) <br>
rm(csvcharobj,usercsvobj,con)<br>
df <- evictions_state %>% select(parent.location) %>% mutate(count=1) %>% group_by(parent.location) %>% summarize_each(funs(sum)) %>% select(parent.location)<br>
df1 <- data.frame("Washington, DC") <br>
names(df1)<-c("parent.location")<br>
new_df <- rbind(df,df1)<br>
write.csv(new_df,"**SAVE PATH HERE**")

In [4]:
#read in csv created by R code above
counties = pd.read_csv("C:\\Users\\RB232BZ\\OneDrive - EY\\Desktop\\virginia.csv")
counties.head(10)

Unnamed: 0.1,Unnamed: 0,parent.location
0,1,"Accomack County, Virginia"
1,2,"Albemarle County, Virginia"
2,3,"Alexandria city, Virginia"
3,4,"Alleghany County, Virginia"
4,5,"Amelia County, Virginia"
5,6,"Amherst County, Virginia"
6,7,"Appomattox County, Virginia"
7,8,"Arlington County, Virginia"
8,9,"Augusta County, Virginia"
9,10,"Bath County, Virginia"


In [5]:
#create a empty list 
cities=[]

#for every row in data frame append the location to the list as a string
for index, row in counties.iterrows():
    cities.append(row["parent.location"])

In [7]:
#we are looking for nurse job postings so our keyword in the indeed job search is nurse
job = "nurse"

#we will be looking for the top 100 job postings for each city/county search in Virginia within a 50 mile radius
number_of_searches =100

Unfortunately Indeed is blocking the web scraper from gathering data from more than 10 searches with hCaptcha. I have yet to code or devise a way to bypass or solve Captcha. 

For the purpose of this analysis I will focus on 13 different counties/independent cities in Virginia that will cover the state. Each search looks at job postings within a 50-mile radius which should be enough to find all recent jobs in the area. 

In [9]:
#create a subset of cities list to 13 independent cities and counties that cover the span of Virginia geographically

group = ["Petersburg city, Virginia",
          "Richmond city, Virginia",
          "Fairfax County, Virginia",
          "Williamsburg city, Virginia",
          "Fredericksburg city, Virginia",
          "Albemarle County, Virginia",
          "Virginia Beach city, Virginia",
          "Harrisonburg city, Virginia",
          "Lynchburg city, Virginia",
          "Winchester city, Virginia",
          "Alexandria city, Virginia",
          "Roanoke city, Virginia",
          "Smyth County, Virginia"
         ]

In [13]:
#create empty column to append data to
column_names = ["job_title", "company_name","location","salary","days_since_posting","city"]
df = pd.DataFrame(columns = column_names)

#scrape Indeed job postings 
for city in group:
    data = extract_job_postings(job,str(city), number_of_searches)
    df = df.append(data)

df.head(10)

https://www.indeed.com/jobs?q=nurse&l=Petersburg+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Richmond+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Fairfax+County%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Williamsburg+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Fredericksburg+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Albemarle+County%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Virginia+Beach+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Harrisonburg+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Lynchburg+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Winchester+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Alexandria+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Roanoke+city%2C+Virginia&radius=50
https://www.indeed.com/jobs?q=nurse&l=Smyth+County%2C+Virginia&radius=50


Unnamed: 0,job_title,company_name,location,salary,days_since_posting,city
1,Registered Nurse | Vaccine Staff - Tysons-McLe...,Sameday Health,"Fairfax County, Virginia",$40 an hour,13 days ago,"Fairfax County, Virginia"
2,Nurse (Educator),US Department of Defense,"Fairfax County, Virginia","$95,409 - $119,709 a year",2 days ago,"Fairfax County, Virginia"
3,Advice Nurse,Kaiser Permanente,"Fairfax County, Virginia",NO SALARY,13 days ago,"Fairfax County, Virginia"
4,Evening Nurse- 32 hr Full Time,UMFS,"Fairfax County, Virginia",NO SALARY,Today,"Fairfax County, Virginia"
5,Supervisory Nurse (Clinical/Critical Care),US Department of Defense,"Fairfax County, Virginia","$119,678 - $151,988 a year",2 days ago,"Fairfax County, Virginia"
6,Quality Assurance Nurse,Confidential,"Springfield, VA 22030",$32 - $38 an hour,30+ days ago,"Fairfax County, Virginia"
7,Nurse (Clinical/Critical Care),US Department of Defense,"Fairfax County, Virginia","$95,409 - $119,709 a year",1 day ago,"Fairfax County, Virginia"
8,Registered Nurse Periop Operating Room,Novant Health,"Manassas, VA 20110",NO SALARY,11 days ago,"Fairfax County, Virginia"
9,School Nurse for the 2021 - 2022 Schol Year,Diocese of Arlington Catholic Schools,"Fairfax County, Virginia",NO SALARY,15 days ago,"Fairfax County, Virginia"
10,Aesthetic Nurse Injector,Séchoir,"Fairfax County, Virginia",NO SALARY,4 days ago,"Fairfax County, Virginia"


In [15]:
#assign df dataframe to new dataframe final_df
final_df = df

#drop duplicates rows
final_df.drop_duplicates(inplace=True)

## Geocode location data using Bing Maps API <a id='section_id4'></a>
https://docs.microsoft.com/en-us/bingmaps/rest-services/locations/find-a-location-by-query

In [16]:
#define function that will take location query and provide coordinates of the best search using Bing Maps API 
def extract_lat_long(query):
    BingMapsAPIKey = "KEY HERE"
    base_url = "http://dev.virtualearth.net/REST/v1/Locations/"
    endpoint = f"{base_url}{query}?&key={BingMapsAPIKey}"
    r = requests.get(endpoint)
    try: 
        j = r.json()
        lat = j['resourceSets'][0]['resources'][0]['geocodePoints'][0]['coordinates'][0]
        lng = j['resourceSets'][0]['resources'][0]['geocodePoints'][0]['coordinates'][1]
        coords = str(lat) + "," + str(lng)
        return coords
    except ValueError:
        coords = "Error,Error"
        return coords       

In [18]:
#create query column
final_df['query'] = final_df['company_name'].astype(str) + " " + final_df['location']

#remove commas
final_df['query'] = final_df['query'].str.replace(',', '')

#remove spaces and replace with %20 to convert string to url encode
final_df['query'] = final_df['query'].str.replace(' ', '%20')

In [19]:
#find best coordinates using Bing Maps API for searched query
coords = final_df['query'].map(extract_lat_long)

In [20]:
#assign coords Series to new dataframe column
final_df['coords'] = coords

In [21]:
#split the coords column by the comma and assign it to a list
coords_df = final_df["coords"].str.split(",", n = 1, expand = True)

In [23]:
#assign latitude coordiantes to new lat column
final_df["lat"]= coords_df[0] 

#assign longitude coordinates to new long column
final_df["long"]= coords_df[1]

#remove coords and query column
final_df.drop(columns =["coords","query"], inplace = True)
final_df.head()

Unnamed: 0,job_title,company_name,location,salary,days_since_posting,city,lat,long
1,Registered Nurse | Vaccine Staff - Tysons-McLe...,Sameday Health,"Fairfax County, Virginia",$40 an hour,13 days ago,"Fairfax County, Virginia",38.83000183105469,-77.27999877929688
2,Nurse (Educator),US Department of Defense,"Fairfax County, Virginia","$95,409 - $119,709 a year",2 days ago,"Fairfax County, Virginia",38.83000183105469,-77.27999877929688
3,Advice Nurse,Kaiser Permanente,"Fairfax County, Virginia",NO SALARY,13 days ago,"Fairfax County, Virginia",38.83000183105469,-77.27999877929688
4,Evening Nurse- 32 hr Full Time,UMFS,"Fairfax County, Virginia",NO SALARY,Today,"Fairfax County, Virginia",38.83000183105469,-77.27999877929688
5,Supervisory Nurse (Clinical/Critical Care),US Department of Defense,"Fairfax County, Virginia","$119,678 - $151,988 a year",2 days ago,"Fairfax County, Virginia",38.83000183105469,-77.27999877929688


## Data Cleaning <a id='section_id5'></a>
#### Clean Data

In [109]:
#replace Error with NA
final_df['lat'].replace("Error",np.NaN,inplace=True)
final_df['long'].replace("Error",np.NaN,inplace=True)

#convert lat/long from string to float
final_df['lat'] = final_df['lat'].astype(float)
final_df['long'] = final_df['long'].astype(float)

In [110]:
#remove NAs
final_df.dropna(inplace=True)

In [126]:
#remove job postings 30 days + old
final_df['days_since_posting'].replace("30+ days ago",np.NaN,inplace=True)
final_df.dropna(inplace=True)

#replace "Just posted" and "today" with 0 to represent 0 days since posting
final_df["days_since_posting"] = np.where(final_df["days_since_posting"] == "Just posted", "0 days ago", 
                                   np.where(final_df["days_since_posting"] == "Today","0 days ago", 
                                            final_df["days_since_posting"]))

In [134]:
#define function that will find the day number in date_posted column
def find_number(string):
    days = re.findall(r'[0-9]+',string)
    return " ".join(days)

#apply this function to extract the days since posting number into new date_posted column
final_df["date_posted"] = final_df["days_since_posting"].apply(lambda x: find_number(x))

In [136]:
#assign a variable with todays date
today = date.today()

#apply the following lambda function todays date - days since posting to get the actual date posted
final_df["date_posted"] = final_df["date_posted"].apply(lambda x: today - datetime.timedelta(days=int(x)))

In [137]:
final_df

Unnamed: 0,job_title,company_name,location,salary,date_posted,lat,long,date,days_since_posting
1,COVID VACCINATION NURSE,Community of Hope,"Washington, DC",NO SALARY,4 days ago,38.892063,-77.019913,2021-02-14,2021-02-14
2,Interventional Radiology (IR) Training Course ...,Medstar Georgetown University Hospital,"Washington, DC 20007 (Georgetown area)",NO SALARY,0 days,38.909176,-77.064278,2021-02-18,2021-02-18
4,Occupational Health Nurse,US Department of Homeland Security,"Washington, DC","$103,690 - $134,798 a year",7 days ago,38.892063,-77.019913,2021-02-11,2021-02-11
5,RN for Mobile IV Company,Reset IV,"Washington, DC",From $65 an hour,4 days ago,38.892063,-77.019913,2021-02-14,2021-02-14
6,Clinical Nurse,Medstar Georgetown University Hospital,"Washington, DC",NO SALARY,0 days,38.892063,-77.019913,2021-02-18,2021-02-18
...,...,...,...,...,...,...,...,...,...
299,MDS Coordinator (RN),Greenbrier Regional Medical Center,"Norfolk, VA",NO SALARY,8 days ago,36.846165,-76.285912,2021-02-10,2021-02-10
304,Assisted Living Director LPN,Tarantino Properties Inc,"Norfolk, VA",NO SALARY,29 days ago,36.846165,-76.285912,2021-01-20,2021-01-20
306,"Registered Nurse, Cancer Outpatient - Flexi/PRN",Sentara Healthcare,"Norfolk, VA",NO SALARY,6 days ago,36.846165,-76.285912,2021-02-12,2021-02-12
307,Hospice Licensed Practical Nurse | PRN,Amedisys,"Norfolk, VA",NO SALARY,1 day ago,36.846165,-76.285912,2021-02-17,2021-02-17


In [164]:
#rename CSV based on today's date

csv = "indeed_nurse_search_" + str(today) + ".csv"
#
final_df.to_csv(csv,index=False)

'indeed_nurse_search_2021-02-18.csv'

## Visualize Data <a id='section_id6'></a>

In [169]:
# create a map of Virginia (coordinates are the general center of Virginia)
m1 = folium.Map([37.5, -79], zoom_start=7)

# convert to (n, 2) nd-array format for heatmap
stationArr = final_df[['lat', 'long']].values

# plot heatmap
HeatMap(stationArr, max_opacity=0.2, radius = 15,blur = 15).add_to(m1)
m1

# Conclusion <a id='section_id7'></a>

# Appendix <a id='section_id8'></a>

This section will be updated as additional data is collected and other data sets are combined with the original dataset. <br>

One data set that has already being joined to the original data is COVID-19 cases data by county. This will allow us to see if the number of job postings is correlated with spikes in COVID case count. The COVID-19 positive case data is from the New York Times.

In [22]:
# Add COVID-19 Data
r = requests.get("https://api.covidtracking.com/v1/states/va/daily.json")
j = r.json()

#download county level from new york times covid data github 
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
county_covid = pd.read_csv(url,parse_dates=[0])
county_covid.head()

#filter to Virginia
county_covid[county_covid['state']=="Virginia"]

Unnamed: 0,date,county,state,fips,cases,deaths
739,2020-03-07,Fairfax,Virginia,51059.0,1,0.0
851,2020-03-08,Fairfax,Virginia,51059.0,2,0.0
981,2020-03-09,Fairfax,Virginia,51059.0,4,0.0
982,2020-03-09,Virginia Beach city,Virginia,51810.0,1,0.0
1143,2020-03-10,Fairfax,Virginia,51059.0,4,0.0
...,...,...,...,...,...,...
1082521,2021-03-02,Williamsburg city,Virginia,51830.0,532,10.0
1082522,2021-03-02,Winchester city,Virginia,51840.0,2469,38.0
1082523,2021-03-02,Wise,Virginia,51195.0,2878,92.0
1082524,2021-03-02,Wythe,Virginia,51197.0,2035,61.0
