# Web Scrape Indeed.com & Predict Salaries with Machine Learning

The goal of this project is to practice my data science and machine learning skills. I will practice three major skills: webscraping data, data cleaning, and building binary classifiers.

Since I am interested in actuarial science, I will be webscraping data from Indeed.com to learn about the actuarial job market. I will collect data on job title, company, location, and salary. Using these data, I will then build multiple binary classifiers to predict whether a given salary of an actuarial job will be greater than the median salary or not. The first part of this project will be focused on webscraping and collecting data. The second part will be about data cleaning, and finally, the third part is about building binary classifiers, such as Logistic Regression, Random Forest, etc.

## Part I: Scraping job listings from Indeed.com

I will be using BeautifulSoup to scrape job listings from Indeed.com.

Let's take a look at the page that I will be scraping: https://www.indeed.com/jobs?q=actuarial+analyst&l=Chicago&start=10

Notice that each job listing is tied to a `div` tag with a class name of `result`.

The URL here has many query parameters
- 'q' followed by job title
- 'l' for a location
- 'start' followed by what result page number to start on (i.e '10' indicates the result starts on page 2).

In [325]:
#import relevent libraries
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [309]:
URL = "https://www.indeed.com/jobs?q=actuarial+analyst&l=Chicago"

In [310]:
#conducting a request of the stated URL above:
page = requests.get(URL)

In [311]:
soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/1c5b9b9/en_US.js" type="text/javascript">
  </script>
  <link href="/s/73ceddb/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://rss.indeed.com/rss?q=actuarial+analyst&amp;l=New+York" rel="alternate" title="Actuarial Analyst Jobs, Employment in New York State" type="application/rss+xml"/>
  <link href="/m/jobs?q=actuarial+analyst&amp;l=New+York" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=actuarial+analyst&amp;l=New+York" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
        window['closureReadyCallbacks'] = [];
    }

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureReadyCallbacks'].push(cb);
        }
    

In [312]:
# extract data helper functions
def extract_location(result):
    dflocation = pd.DataFrame(columns=["location"])
    for b in result.find_all('span', {'class': 'location'}):
        location = b.text
        dflocation.loc[len(dflocation)] = [location]
    return dflocation
        
def extract_company(result):      
    dfcompany = pd.DataFrame(columns=["company"])
    for i in result.find_all('span', {'class':'company'}):
        company = i.text
        dfcompany.loc[len(dfcompany)] = [company]
    return dfcompany

def extract_job_title(result):
    dfjob_title = pd.DataFrame(columns=["job_title"])
    for a in result.find_all('a', {'data-tn-element':'jobTitle'}):
        job_title = a.text
        dfjob_title.loc[len(dfjob_title)] = [job_title]
    return dfjob_title

def extract_salary(result):
    dfsalary = pd.DataFrame(columns=["salary"])
    for entry in result.find_all('span', {'class':'no-wrap'}):
        try:
            salary = entry.text
            dfsalary.loc[len(dfsalary)] = [salary]  
        except:
            salary = 'NA'
            dfsalary.loc[len(dfsalary)] = [salary]
    return dfsalary

In [313]:
a.join(b).join(c).join(d).head()

Unnamed: 0,location,company,job_title,salary
0,"Chicago, IL",\n\n Liberty Mutual,"Senior Actuarial Analyst, U. S. Planning - Glo...",relevance -\n date
1,"Chicago, IL 60601",\n\n PwC,Financial Services Insurance Business Analyst,
2,"Naperville, IL",\n\n Avenica,Junior Analyst - Entry Level,
3,"Northbrook, IL",\n\n CVS Health,Analytic Services Advisor,
4,"Chicago, IL 60290 (Loop area)",\n\n Ameriprise Financial,Actuarial Analyst,
5,"Chicago, IL","\n The Segal Group, Inc.",Actuarial Analyst,
6,"Chicago, IL",\n\n Kemper Corporation,Actuarial Analyst - Reserving,
7,"Chicago, IL 60601 (Loop area)",\n\n Allstate,Experienced Actuarial Analyst - Modeling & Maj...,
8,"Chicago, IL",\n\n Evolent Health,"Analyst, Client Analytics",
9,"Chicago, IL 60607 (Near West Side area)",\n\n Aspen Dental,Financial Analyst â€“ Insurance,


Now, scale up the scraping. Remember that we can modify the URL to scrape different job titles, locations, and result pages.

In [314]:
# a list of cities to run through later
cities = ['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'washington+dc', 
    'Charlottesville', 'Richmond', 'Baltimore', 'San+Antonio', 'San+Diego', 'San+Jose'
    'Austin', 'Jacksonville', 'Indianapolis', 'Columbus', 'Charlotte', 'Detroit', 'El+Paso', 
    'Memphis', 'Boston', 'Nashville', 'Louisville', 'Milwaukee', 'Las+Vegas', 'Albuquerque', 'Tucson', 
    'Fresno', 'Sacramento', 'Long+Beach', 'Mesa', 'Virginia+Beach', 'Norfolk', 'Atlanta', 'Colorado+Springs',
    'Raleigh', 'Omaha', 'Oakland', 'Tulsa', 'Minneapolis', 'Cleveland', 'Wichita', 'Arlington', 'New+Orleans', 
    'Bakersfield', 'Tampa', 'Honolulu', 'Anaheim', 'Aurora', 'Santa+Ana', 'Riverside', 'Corpus+Christi', 'Pittsburgh', 
    'Lexington', 'Anchorage', 'Cincinnati', 'Baton+Rouge', 'Chesapeake', 'Alexandria', 'Fairfax']

In [315]:
url_template = "https://www.indeed.com/jobs?q=actuarial+analyst&l={}&start={}"
max_results_per_city = 100 # number of pages to go through for each location
df_more = pd.DataFrame(columns=["Title","Location","Company","Salary"])
for city in cities:
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        url = url_template.format(city, start)
        html = requests.get(url)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobtitle').text.replace('\n', '')
            except:
                title = None
            try:
                location = each.find('span', {'class':"location" }).text.replace('\n', '')
            except:
                location = None
            try: 
                company = each.find(class_='company').text.replace('\n', '')
            except:
                company = None
            try:
                salary = each.find('span', {'class':'no-wrap'}).text.replace('\n', '')
            except:
                salary = None
            df_more = df_more.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary}, ignore_index=True)
            df_more.drop_duplicates(inplace=True)
df_more

Unnamed: 0,Title,Location,Company,Salary
0,INSURANCE STATE REGULATORY COMPLIANCE ANALYST,"New York, NY",Transatlantic Reinsurance Co.,
1,Quantitative Corporate Finance Analyst,"New York, NY",Global Atlantic Financial Group,
2,Actuarial Analyst,"New York, NY",DW Simpson Global Actuarial Recruitment,
3,Senior Data Analyst (Health Care Analytics Exp...,"New York, NY",One Call Care Management,
4,Actuarial Analyst - Entry Level,"New York, NY 10119 (Chelsea area)",Milliman,
5,Actuarial Analyst,"New York, NY",AIG,
6,Actuarial Analyst,"New York, NY",Arch,
7,Actuarial Analyst,"Garden City, NY",Milliman,
8,Investment Risk Actuarial Analyst,"New York, NY 10036",AXIS Insurance,
9,Catastrophe Risk Analyst,"New York, NY",Network ESC,"$100,000 - $115,000 a year"


## Part II: Data Cleaning

Now that we are done with scraping and collecting data, we will now move on to data cleaning/preprocessing.

I want to only include the annual salaries. I will discard salaries that are monthly or hourly. Note that I could convert both monthly and hourly salaries to annual salaries but they might not be accurate.

In [316]:
# keep annual salaries
df_more = df_more[df_more.Salary.notnull()]
df_more = df_more[df_more.Salary.str.contains('year')]

In [317]:
df_more

Unnamed: 0,Title,Location,Company,Salary
9,Catastrophe Risk Analyst,"New York, NY",Network ESC,"$100,000 - $115,000 a year"
34,"Underwriting Specialist, Property, National In...","New York, NY",Liberty Mutual,"$74,000 - $140,000 a year"
38,Junior Catastrophe Risk Analyst,"New York, NY",S.C. International,"$75,000 a year"
51,Actuarial Analyst II,"New York, NY",S.C. International,"$75,000 a year"
57,Upstate Actuarial Analyst #81164,"Syracuse, NY",Ezra Penland Actuarial Recruitment,"$60,007 - $75,008 a year"
58,Senior Actuarial Analyst - Pricing,"New York, NY",Oliver James Associates,"$85,000 - $115,000 a year"
62,OPEB Actuarial analyst/consultant,"New York, NY",S.C. International,"$100,000 a year"
63,Sr. Actuarial Analyst,"New York, NY",S.C. International,"$85,000 a year"
65,Health Actuarial Sr. Analyst,"New York, NY",S.C. International,"$100,000 a year"
68,Product Analyst with Health Insurance Experience,"New York, NY",Workbridge Associates,"$100,000 - $140,000 a year"


In [318]:
# strip out any unnecessary symbols or whitespace
df_more.Salary.replace(regex=True, inplace=True, to_replace="a year", value="")
df_more.Salary.replace(regex=True, inplace=True, to_replace=",", value="")
df_more.Salary.replace(regex=True, inplace=True, to_replace="\$", value="")
df_more.Salary.replace(regex=True, inplace=True, to_replace="\(Indeed est.\)", value="")
df_more.Salary.replace(regex=True, inplace=True, to_replace=" ", value="")
df_more['Salary_Split'] = df_more['Salary'].str.split('-')
df_more

Unnamed: 0,Title,Location,Company,Salary,Salary_Split
9,Catastrophe Risk Analyst,"New York, NY",Network ESC,100000-115000,"[100000, 115000]"
34,"Underwriting Specialist, Property, National In...","New York, NY",Liberty Mutual,74000-140000,"[74000, 140000]"
38,Junior Catastrophe Risk Analyst,"New York, NY",S.C. International,75000,[75000]
51,Actuarial Analyst II,"New York, NY",S.C. International,75000,[75000]
57,Upstate Actuarial Analyst #81164,"Syracuse, NY",Ezra Penland Actuarial Recruitment,60007-75008,"[60007, 75008]"
58,Senior Actuarial Analyst - Pricing,"New York, NY",Oliver James Associates,85000-115000,"[85000, 115000]"
62,OPEB Actuarial analyst/consultant,"New York, NY",S.C. International,100000,[100000]
63,Sr. Actuarial Analyst,"New York, NY",S.C. International,85000,[85000]
65,Health Actuarial Sr. Analyst,"New York, NY",S.C. International,100000,[100000]
68,Product Analyst with Health Insurance Experience,"New York, NY",Workbridge Associates,100000-140000,"[100000, 140000]"


In [319]:
# helper function to calculate the average salaries
def avg(salary):
    salary['Lower'] = salary['Salary_Split'].str[0].astype('float')
    salary['Upper'] = salary['Salary_Split'].str[1].astype('float')
    salary['Average Salary'] = salary[['Lower','Upper']].mean(axis=1)
avg(df_more)
df_more.head()

Unnamed: 0,Title,Location,Company,Salary,Salary_Split,Lower,Upper,Average Salary
9,Catastrophe Risk Analyst,"New York, NY",Network ESC,100000-115000,"[100000, 115000]",100000.0,115000.0,107500.0
34,"Underwriting Specialist, Property, National In...","New York, NY",Liberty Mutual,74000-140000,"[74000, 140000]",74000.0,140000.0,107000.0
38,Junior Catastrophe Risk Analyst,"New York, NY",S.C. International,75000,[75000],75000.0,,75000.0
51,Actuarial Analyst II,"New York, NY",S.C. International,75000,[75000],75000.0,,75000.0
57,Upstate Actuarial Analyst #81164,"Syracuse, NY",Ezra Penland Actuarial Recruitment,60007-75008,"[60007, 75008]",60007.0,75008.0,67507.5


In [320]:
clean_sal = df_more.drop(['Salary','Salary_Split', 'Lower', 'Upper'], axis=1)
clean_sal.head()

Unnamed: 0,Title,Location,Company,Average Salary
9,Catastrophe Risk Analyst,"New York, NY",Network ESC,107500.0
34,"Underwriting Specialist, Property, National In...","New York, NY",Liberty Mutual,107000.0
38,Junior Catastrophe Risk Analyst,"New York, NY",S.C. International,75000.0
51,Actuarial Analyst II,"New York, NY",S.C. International,75000.0
57,Upstate Actuarial Analyst #81164,"Syracuse, NY",Ezra Penland Actuarial Recruitment,67507.5


In [322]:
clean_sal['citystate'] = clean_sal['Location'].str.split(',') #splitting the location to separate city and state
clean_sal['city'] = clean_sal['citystate'].str[0] #getting cities
clean_sal['state'] = clean_sal['citystate'].str[1] #getting states
clean_sal['state'] = clean_sal['state'].str[0:3] #getting only 2 letter state codes
clean_sal.drop(['Location','citystate'], axis=1, inplace=True) #dropping columns so I'm only left with cities/states
clean_sal.head()

Unnamed: 0,Title,Company,Average Salary,city,state
9,Catastrophe Risk Analyst,Network ESC,107500.0,New York,NY
34,"Underwriting Specialist, Property, National In...",Liberty Mutual,107000.0,New York,NY
38,Junior Catastrophe Risk Analyst,S.C. International,75000.0,New York,NY
51,Actuarial Analyst II,S.C. International,75000.0,New York,NY
57,Upstate Actuarial Analyst #81164,Ezra Penland Actuarial Recruitment,67507.5,Syracuse,NY


In [326]:
median = np.median(clean_sal['Average Salary'])
median

88500.0

In [327]:
clean_sal.to_csv("~/Desktop/clean_salary.csv" , sep=',', encoding='utf-8')

In [383]:
c = pd.read_csv('~itecy/Desktop/clean_salary.csv')

In [384]:
c['dumsal'] = (c["Average Salary"] >= c["Average Salary"].median()).astype(int)
c.drop(['Unnamed: 0'], axis=1, inplace=True)
c.head()

Unnamed: 0,Title,Company,Average Salary,city,state,dumsal
0,Catastrophe Risk Analyst,Network ESC,107500.0,New York,NY,1
1,"Underwriting Specialist, Property, National In...",Liberty Mutual,107000.0,New York,NY,1
2,Junior Catastrophe Risk Analyst,S.C. International,75000.0,New York,NY,0
3,Actuarial Analyst II,S.C. International,75000.0,New York,NY,0
4,Upstate Actuarial Analyst #81164,Ezra Penland Actuarial Recruitment,67507.5,Syracuse,NY,0


In [385]:
c['dumsal'].value_counts()

1    83
0    82
Name: dumsal, dtype: int64

In [412]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score, cross_val_predict

In [400]:
X = c.iloc[:,:-1]
y = c.iloc[:,5]

X.shape

(165, 5)

In [401]:
X_city_dummy = pd.get_dummies(X['city'])
type(X_city_dummy)

pandas.core.frame.DataFrame

In [413]:
X_train, X_test, y_train, y_test = train_test_split(X_city_dummy, y, test_size=0.33)

In [414]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
logistic_model.score(X_test,y_test)

0.5818181818181818

In [416]:
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
# Perform 6-fold cross validation
scores = cross_val_score(logistic_model, X_test, y_test, cv=6)
print("Cross-Validated scores:", scores)
# Make cross validated predictions
predictions = cross_val_predict(logistic_model, X_test, y_test, cv=6)
accuracy = metrics.accuracy_score(y_test, predictions)
print("Cross-Predicted Accuracy:", accuracy)

Cross-Validated scores: [0.6        0.55555556 0.77777778 0.77777778 0.66666667 0.66666667]
Cross-Predicted Accuracy: 0.6727272727272727


In [410]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_city_dummy, y)
cv_model = cross_val_score(model, X_city_dummy, y, cv=6)
print('Cross-validated scores:', cv_model)
print('Average score:', cv_model.mean())
print('Standard deviation of score:', cv_model.std())

Cross-validated scores: [0.46428571 0.32142857 0.53571429 0.53571429 0.51851852 0.38461538]
Average score: 0.4600461267127934
Standard deviation of score: 0.0814380733706713


In [418]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight='balanced')
RF = rf.fit(X_train,y_train)
s = cross_val_score(rf, X_train, y_train, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forrest", s.mean().round(3), s.std().round(3)))

Random Forrest Score:	0.5 ± 0.048


In [429]:
def senior(x):
    if 'Senior' in x or 'Sr.' in x:
        return 1
    return 0

c['Senior'] = c['Title'].apply(senior)
c[c.Senior != 0].head()

Unnamed: 0,Title,Company,Average Salary,city,state,dumsal,Senior,Analyst,Consultant,Intern,Junior
5,Senior Actuarial Analyst - Pricing,Oliver James Associates,100000.0,New York,NY,1,1,1,0,0,0
7,Sr. Actuarial Analyst,S.C. International,85000.0,New York,NY,0,1,1,0,0,0
8,Health Actuarial Sr. Analyst,S.C. International,100000.0,New York,NY,1,1,1,0,0,0
11,Senior Commercial Lines Actuarial Analyst #80151,Ezra Penland Actuarial Recruitment,85000.5,New York,NY,0,1,1,0,0,0
12,Senior Actuarial Analyst,DW Simpson Global Actuarial Recruitment,150000.0,New York,NY,1,1,1,0,0,0


In [423]:
def analyst(x):
    if 'Analyst' in x:
        return 1
    return 0

c['Analyst'] = c['Title'].apply(analyst)
c[c.Analyst != 0].head()

Unnamed: 0,Title,Company,Average Salary,city,state,dumsal,Senior,Analyst
0,Catastrophe Risk Analyst,Network ESC,107500.0,New York,NY,1,0,1
2,Junior Catastrophe Risk Analyst,S.C. International,75000.0,New York,NY,0,0,1
3,Actuarial Analyst II,S.C. International,75000.0,New York,NY,0,0,1
4,Upstate Actuarial Analyst #81164,Ezra Penland Actuarial Recruitment,67507.5,Syracuse,NY,0,0,1
5,Senior Actuarial Analyst - Pricing,Oliver James Associates,100000.0,New York,NY,1,1,1


In [424]:
def consultant(x):
    if 'Consultant' in x:
        return 1
    return 0

c['Consultant'] = c['Title'].apply(consultant)
c[c.Consultant != 0].head()

Unnamed: 0,Title,Company,Average Salary,city,state,dumsal,Senior,Analyst,Consultant
50,Consultant,S.C. International,100000.0,Los Angeles,CA,1,0,0,1
58,"Underwriting Consultant, National Insurance Pr...",Liberty Mutual,120000.0,Philadelphia,PA,1,0,0,1
74,Consultant,S.C. International,100000.0,Atlanta,GA,1,0,0,1
108,Consultant,S.C. International,100000.0,Washington,DC,1,0,0,1
135,"Senior Analyst or Consultant, Advanced Analyti...",Liberty Mutual,111800.0,Boston,MA,1,1,1,1


In [427]:
def junior(x):
    if 'Junior' in x:
        return 1
    return 0

c['Junior'] = c['Title'].apply(junior)

In [430]:
def associate(x):
    if 'Associate' in x:
        return 1
    return 0

c['Associate'] = c['Title'].apply(associate)
c[c.Associate != 0].head()

Unnamed: 0,Title,Company,Average Salary,city,state,dumsal,Senior,Analyst,Consultant,Intern,Junior,Associate
17,Associate Actuary,S.C. International,80000.0,Chicago,IL,0,0,0,0,0,0,1
28,"Analyst, Actuarial I Associate",Blue Shield of California,104500.0,San Francisco,CA,1,0,1,0,0,0,1
90,Actuarial Associate,York Risk Services Group,106000.0,Portland,OR,1,0,0,0,0,0,1
105,Health Associate Actuary #80159,Ezra Penland Actuarial Recruitment,95000.0,Washington,DC,1,0,0,0,0,0,1
144,Associate Actuary - Medicare (Chicago,Milliman,80500.0,Brookfield,WI,0,0,0,0,0,0,1


In [447]:
feature = c.drop(['Title','Company','city','state','Average Salary','dumsal'], axis=1)
feature.head()

Unnamed: 0,Senior,Analyst,Consultant,Intern,Junior,Associate
0,0,1,0,0,0,0
1,0,0,0,0,0,0
2,0,1,0,0,1,0
3,0,1,0,0,0,0
4,0,1,0,0,0,0


In [448]:
X_train_2 = feature.join(X_city_dummy)
X_train_2.head(10)

Unnamed: 0,Senior,Analyst,Consultant,Intern,Junior,Associate,Arlington,Atlanta,Bala-Cynwyd,Baltimore,...,San Diego,San Francisco,San Mateo,Scottsdale,Seattle,Syracuse,Tampa,Tampa Bay,Washington,Wilsonville
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [449]:
X_train, X_test, y_train, y_test = train_test_split(X_train_2, y, test_size=0.33)

In [452]:
rf = RandomForestClassifier(class_weight='balanced')
RF = rf.fit(X_train,y_train)
s = cross_val_score(rf, X_train, y_train, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.727 ± 0.022


In [456]:
X_train, X_test, y_train, y_test = train_test_split(feature, y, test_size=0.33)
rf = RandomForestClassifier(class_weight='balanced')
RF = rf.fit(X_train,y_train)
s = cross_val_score(rf, X_train, y_train, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.718 ± 0.028
