# Web Scraping for Indeed.com & Predicting Salaries

For this project, we used a set of characteristics to predict the salaries for data science jobs in the Boston area. We scraped Indeed.com, an online jobsboard, picked out key phrases, and determined if each job's salary is above a threshold value. We then used logistic regression to model this system. **WE FOUND ... ** We expanded our analysis to other cities and compared the models, **AND FOUND ... **

## Code

### Import libraries

In [1]:
# Webscraping
from urllib.request import urlopen
from bs4 import BeautifulSoup

#Data and analysis
import pandas as pd
import statsmodels.formula.api as sm
import sklearn

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Define webscraping functions

In [2]:
def count_results(query = None, location = "Boston", binary_level = 1):
    job_ids = pd.DataFrame()
    
    result_list = []
        
    # Find the number of results
    URL_for_count = "http://www.indeed.com/jobs?q=data+scientist+{}&l={}".format(query, location)
    soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')

    results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
    number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))

    # Now loop through the pages. Viewing 100 results at a time means fewer page refreshes.
    i = int(number_of_results/100)
    for page_number in range(i + 1):
        URL_for_results = "http://www.indeed.com/jobs?q=data+scientist+{}&l={}&limit=100&start={}".format(query, location, str(100 * page_number))
        soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
        results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})
            
        # Extract the ID for each job listing
        for x in results:
            job_id = x.find('h2', attrs={"class": "jobtitle"})['id']
            job_title = x.find('a', attrs={'data-tn-element': "jobTitle"}).text.strip().capitalize()
            result_list.append([job_id, job_title, binary_level])
        
        # Add the job ID numbers
        job_ids = job_ids.append(result_list)
    
    # Remove re-posted jobs
    job_ids.drop_duplicates(inplace = True)
    return job_ids

In [3]:
# String format: "keyword keyword"
def count_results_by_keywords(query_string = None):
    
    # Ends the function if given invalid inputs
    if query_string == None:
        return(print("No keyword entered."))
    
    # Format the keyword string in to URL query
    query = "%20OR%20".join(query_string.split(" "))

    # Perform the search
    job_ids = count_results("%28{}%29".format(query))
    
    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', '{}'.format(" OR ".join(query_string.split(" ")))]
    
    return (job_ids)
    
    
def count_results_by_salary(salary_range_divider = 90000):
    
    # OPTIONAL: removes low values
    if salary_range_divider <= 20000:
        return(print("Enter a number larger than $20,000."))
        
    job_ids = pd.DataFrame()
    # Set dividing salaries
    divider_strings = ["+$20000-{}".format(salary_range_divider), "+${}".format(salary_range_divider)]
    
    # Perform two searches, starting with the low-salary jobs
    for level, salary_criterion in enumerate(divider_strings):
        job_ids = job_ids.append(count_results(salary_criterion, binary_level=level))
    
    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', "salary over {}".format(salary_range_divider)]
    
    return(job_ids)


def count_results_by_years_experience(years_experience = None):
    
    # Ends the function if given invalid inputs
    if years_experience == None or type(years_experience) != int:
        return(print("Enter an integer value."))
    
    # Format the keyword string in to URL query
    query = "{}+years+or+{}%2B+years".format(str(years_experience), str(years_experience))

    # Perform the search
    job_ids = count_results("%28{}%29".format(query), binary_level = years_experience)
    
    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', "years experience"]
    
    return (job_ids)

### Collecting and assembling data

In [4]:
phd_dataframe = count_results_by_keywords("PhD ph.d")

bachelors_dataframe = count_results_by_keywords("Bachelors BS BA") # Indeed.com's search includes "Bachelor's", "B.S.", etc.

python_dataframe = count_results_by_keywords("Python")

startup_dataframe = count_results_by_keywords("Startup start-up")

scientist_dataframe = count_results_by_keywords("Scientist")

experience_dataframe = pd.DataFrame()
for years in range(1+7):
    experience_dataframe = experience_dataframe.append(count_results_by_years_experience(years))

salary_dataframe = count_results_by_salary(90000)

In [5]:
# Merge dataframes
master_dataframe = phd_dataframe.merge(bachelors_dataframe, on = ['id', 'job title'], how = 'outer').merge(python_dataframe, on = ['id', 'job title'], how = 'outer').merge(startup_dataframe, on = ['id', 'job title'], how = 'outer').merge(scientist_dataframe, on = ['id', 'job title'], how = 'outer').merge(experience_dataframe, on = ['id', 'job title'], how = 'outer').merge(salary_dataframe, on = ['id', 'job title'], how = 'outer')

# Convert non-id columns to integers
data_conversion_mask = (master_dataframe.columns != 'id') & (master_dataframe.columns != 'job title')
master_dataframe.ix[:, data_conversion_mask] = master_dataframe.ix[:, data_conversion_mask].fillna(value = 0).astype(int)

In [6]:
master_dataframe.head()

Unnamed: 0,id,job title,PhD OR ph.d,Bachelors OR BS OR BA,Python,Startup OR start-up,Scientist,years experience,salary over 90000
0,jl_115b40def725eb2c,Biomedical data scientist,1,0,1,0,1,2,1
1,jl_2370348e80d8c420,Data scientist,1,0,1,0,1,3,1
2,jl_1f1198f76898781f,Data scientist,1,0,1,0,1,0,1
3,jl_04a882314da504d4,"Mass spectrometry data analyst, usa-boston",1,0,0,0,1,3,1
4,jl_65d62034685dd0ed,Data scientist,1,0,0,0,0,0,0


### Modeling with statsmodel logistic regression

In [7]:
model_dataframe = master_dataframe.drop(labels = ['id', 'job title'], axis = 1)
model_dataframe.columns = model_dataframe.columns.str.replace(' ', '_').str.replace('.', '_').str.replace('-', '_')
model_dataframe.columns

Index(['PhD_OR_ph_d', 'Bachelors_OR_BS_OR_BA', 'Python', 'Startup_OR_start_up',
       'Scientist', 'years_experience', 'salary_over_90000'],
      dtype='object')

In [8]:
model = sm.logit("{} ~ PhD_OR_ph_d + Bachelors_OR_BS_OR_BA + Python + Startup_OR_start_up + Scientist + years_experience".format(model_dataframe.columns[-1]), data=model_dataframe).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.639109
         Iterations 5


0,1,2,3
Dep. Variable:,salary_over_90000,No. Observations:,1907.0
Model:,Logit,Df Residuals:,1900.0
Method:,MLE,Df Model:,6.0
Date:,"Fri, 30 Dec 2016",Pseudo R-squ.:,0.07723
Time:,23:47:34,Log-Likelihood:,-1218.8
converged:,True,LL-Null:,-1320.8
,,LLR p-value:,2.6459999999999997e-41

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-0.6797,0.116,-5.868,0.000,-0.907 -0.453
PhD_OR_ph_d,0.2936,0.107,2.746,0.006,0.084 0.503
Bachelors_OR_BS_OR_BA,-0.5913,0.111,-5.317,0.000,-0.809 -0.373
Python,1.2878,0.132,9.731,0.000,1.028 1.547
Startup_OR_start_up,0.3981,0.177,2.244,0.025,0.050 0.746
Scientist,0.3324,0.134,2.484,0.013,0.070 0.595
years_experience,0.0670,0.023,2.939,0.003,0.022 0.112


### Modeling with sklearn's logistic regression

In [9]:
modeling_mask = (master_dataframe.columns != 'id') & (master_dataframe.columns != 'job title') & (master_dataframe.columns != 'salary over 90000')
X = sklearn.preprocessing.minmax_scale(X = master_dataframe.ix[:,modeling_mask], copy = True)
y = master_dataframe['salary over 90000']
logreg = sklearn.linear_model.LogisticRegressionCV(cv = 6)
model = logreg.fit(X, y)
predictions = model.predict(X)

AttributeError: module 'sklearn' has no attribute 'preprocessing'

In [None]:
print("Average cross validation scores: {}".format(sklearn.cross_validation.cross_val_score(logreg, X, y, scoring='roc_auc').mean()))
print("Confusion matrix:\n", sklearn.metrics.confusion_matrix(y, predictions))
print("Model accuracy (TP / TP + FP): {}".format(sklearn.metrics.accuracy_score(y, predictions)))

In [None]:
logreg_probabilites_dataframe = pd.DataFrame(logreg.predict_proba(X), columns = ["Over threshold", "Under threshold"])
model_vs_data_dataframe = pd.concat(objs = [logreg_probabilites_dataframe, master_dataframe[['salary over 90000']]],axis=1)
print(model_vs_data_dataframe.iloc[-10:])
print(master_dataframe[-10:])

It looks like the last 10 rows are not connected to the test features and might be throwing off the data. Let's see which data do not have any of our test features.

In [None]:
master_dataframe[master_dataframe.ix[:, 2:-2].sum(axis = 1) == 0]

# Assignment
>In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.
>
We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.
>
Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.
>
- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.
>
Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

## Scraping job listings from Indeed.com

Since each company has different data science needs, we expected each job posting's requirements to vary by a large range. We therefore decided to focus our analysis on a few key pieces of information in each job posting, specifically:

* If the job required a PhD or a bachelor's degree
* If the job asked for experience with Python
* If the company was likely to be a start-up
* If the job specifically used "scientist", as a proxy for expectations of technical knowledge

We also wanted to categorize each job according to whether its salary exceeded $90,000. This value represented the mean salary for the total search results.

By a rough estimation, only a fraction of jobs posts explicitly included more that one of these features. Rather than follow the hyperlink for each post and scanning each page, which would likely have formatting particular to each company, we reasoned that we could leverage Indeed.com's search to obtain information we could not readily access.

Each job post on Indeed.com has a unique ID. Therefore could we employ a strategy where we changed the search query and kept track of which jobs were among the results. For example, the following are the first few jobs that required a PhD and mentioned 3 years of experience:

In [None]:
master_dataframe[(master_dataframe['PhD OR ph.d'] == 1) & (master_dataframe['years experience'] == 3)][['job title', 'PhD OR ph.d', 'years experience']].head()

When capturing the data, we noticed approximately 20% had duplicated IDs. We determined these were reposted jobs, and therefore reasonably removed these data points.

Initial modeling had poor accuracy (approximately 61 %). This led to the suspicion that the analysis was missing significant features. However after inspecting the data, we found Indeed.com's search provided a significant number of results that contained keywords like "data" and "science" that were not related to data science (see the sample below). These data points were removed to avoid potentially skewing the results.

In [None]:
# SHOW EXAMPLES OF NON-DATA SCIENCE JOBS

# Results and Analysis

Since we examined live webpages, our results may not be current. Our analysis found data for 1891 jobs. From the data, we can see that 

In [None]:
print("{} jobs analyzed.\n".format(master_dataframe.shape[0]))
print("Keywords\tFrequency\n", master_dataframe.describe().loc['mean'])