<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">


# Web Scraping for Indeed.com and Predicting Salaries

### Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal wants you to

   - determine the industry factors that are most important in predicting the salary amounts for these data.

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries.

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer this question.

---

### Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to address the question above.

### Factors that impact salary

To predict salary the most appropriate approach would be a regression model.
Here instead we just want to estimate which factors (like location, job title, job level, industry sector) lead to high or low salary and work with a classification model. To do so, split the salary into two groups of high and low salary, for example by choosing the median salary as a threshold (in principle you could choose any single or multiple splitting points).

Use all the skills you have learned so far to build a predictive model.
Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to be able to extrapolate or predict the expected salaries for these listings.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10").

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters:

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [1]:
# pip install requests-ip-rotator

In [1]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [2]:
engine = "https://www.indeed.com"

In [3]:
#add ApiGateway to avoid CAPTCHA
from requests_ip_rotator import ApiGateway

In [4]:
gateway = ApiGateway(engine,access_key_id = 'AKIATABERFZYXBAC436R',
                     access_key_secret = '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')

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
import requests
import bs4
from bs4 import BeautifulSoup
from tqdm import tqdm


In [7]:
##testing below code to avoid captcha
gateway.start()
# Assign gateway to session
session = requests.Session()
session.mount(engine, gateway)
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}

response = session.get(URL, headers=header)
soup = BeautifulSoup(response.text,'html.parser')
print(soup.title.text)
# Delete gateways

gateway.shutdown()

Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://www.indeed.com - IP Rotate API' (0 new).
Data Scientist $20,000 Jobs, Employment in New York State | Indeed.com
Deleting gateways for site 'https://www.indeed.com'.
Deleted 10 endpoints with for site 'https://www.indeed.com'.


In [4]:
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

In [5]:
soup.title.text

'Data Scientist $20,000 Jobs, Employment in New York State | Indeed.com'

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is in a `span` with `class='salaryText'`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element='jobTitle'`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 
- Decide which other components could be relevant, for example the region or the summary of the job advert.

In [6]:
soup.find_all('td', class_='resultContent')[0]

<td class="resultContent"><div class="heading4 color-text-primary singleLineTitle tapItem-gutter"><h2 class="jobTitle jobTitle-color-purple"><span title="Senior Data Scientist - Nationwide Opportunities">Senior Data Scientist - Nationwide Opportunities</span></h2></div><div class="heading6 company_location tapItem-gutter"><pre><span class="companyName">Amazon Web Services, Inc.</span><span class="ratingsDisplay"><span aria-label="3.5 of stars rating" class="ratingNumber" role="img"><span aria-hidden="true">3.5</span><svg aria-hidden="true" class="starIcon" fill="none" height="12" role="presentation" viewbox="0 0 16 16" width="12" xmlns="http://www.w3.org/2000/svg"><path d="M8 12.8709L12.4542 15.5593C12.7807 15.7563 13.1835 15.4636 13.0968 15.0922L11.9148 10.0254L15.8505 6.61581C16.1388 6.36608 15.9847 5.89257 15.6047 5.86033L10.423 5.42072L8.39696 0.640342C8.24839 0.289808 7.7516 0.289808 7.60303 0.640341L5.57696 5.42072L0.395297 5.86033C0.015274 5.89257 -0.13882 6.36608 0.149443 6.615

In [7]:
job = [x.text for x in soup.find_all('td', class_='resultContent')]

In [8]:
job

['Senior Data Scientist - Nationwide OpportunitiesAmazon Web Services, Inc.3.5New York, NY$116,200 a year',
 'Data Scientist (Remote) - USAlphaSights4.0New York State+1 location•Remote',
 'newResearch Scientist, Economics (PhD - Core Data Science)Meta4.1New York, NY+22 locations',
 'newMarketing Data Scientist (Remote)CelsiusNew York, NY 10001 (Chelsea area)•Remote',
 '2022 Associate Data ScientistT. Rowe Price3.6New York, NY',
 'Data Scientist - Dalio Center for Health JusticeNewYork-Presbyterian Hospital4.2Manhattan, NY 10065 (Upper East Side area)',
 'Sr Data ScientistViacomCBS3.9New York, NY 10036 (Midtown area)',
 'Data Scientist - ProServeAmazon Web Services, Inc.3.5New York, NY$114,700 a year',
 'Data Scientist - Delta One StratsMorgan Stanley3.8New York, NY+2 locations',
 'Associate Scientist - Clinical Data MonitorPfizer4.2Pearl River, NY 10965',
 'Data ScientistDojoMojoNew York, NY 10012 (SoHo area)•Remote$83,690 - $160,000 a year',
 'newSenior Risk Data Analyst, Trust and Sa

### Write 4 functions to extract each item: location, company, job, and salary.

Example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it.
    - Remember to use `try/except` if you anticipate errors.
- **Test** the functions on the results above and simple examples.

In [10]:
def extract_location(result):
    location = []
    for loc in result.find_all('td', class_='resultContent'):
        location.append(loc.find(class_='companyLocation').text)
    return location
        

In [11]:
extract_location(soup)

[]

In [12]:
def extract_company(result):
    company = []
    for com in result.find_all('td', class_='resultContent'):
        try:
            company.append(com.find(class_='companyName').text)
        except:
            company.append(np.nan)
    return company

In [13]:
extract_company(soup)

[]

In [28]:
 
def extract_salary(result):
    salary = []
    for sal in result.find_all('td', class_='resultContent'):
        try: 
            salary.append(sal.find('div', class_='metadata salary-snippet-container').text)
        except:
            salary.append(np.nan)
    return salary

In [29]:
extract_salary(soup)

[nan,
 nan,
 nan,
 '$60,000 - $70,000 a year',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan]

In [14]:
def extract_job(result):
    job = []
    for j in result.find_all('td', class_='resultContent'):
        try:
            job.append(j.find('h2', class_='jobTitle jobTitle-color-purple').text)
        except:
            job.append(j.find(class_='heading4 color-text-primary singleLineTitle tapItem-gutter').text)


    return job

In [15]:
extract_job(soup)

['Behavioral Data Scientist',
 'Sr. Data Scientist',
 'Product Analytics - Data Scientist',
 'Data Scientist',
 'Healthcare Data Scientist',
 'Associate Data Scientist',
 'newData Analyst thru Data Analyst Senior',
 'Data Scientist',
 'Data Scientist',
 'Sr. Data Analyst',
 'Sr. Business Data Analyst',
 'Senior-Data Scientist',
 'newStaff (Lead) Data Scientist (Remote, United States)',
 'Data Scientist',
 'newPOC Associate Data Scientist']

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search.
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different.

In [9]:
#to help slow down scrape and avoid captcha
import time 
import random

In [22]:
max_results_per_city = 2000 
roles = ['data+scientist', 'data+analyst']

results_dict = {'location':[], 'title':[], 'company':[], 'salary':[]}
gateway.start()

session = requests.Session()
session.mount(engine, gateway)
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}


for city in set(['New+York', 'Chicago','Dallas', 'Philadelphia', 'Portland', 'Houston', 'Miami',
                'Los+Angeles', 'Las+Vegas', 'San+Francisco', 'Washington+DC', 'Seattle',
                'Austin', 'Phoenix', 'San+Jose', 'Denver' ]):
    
    for start in tqdm(range(1500, max_results_per_city, 10)):
        
        for role in roles:
        
            url = f"http://www.indeed.com/jobs?q={role}+%2420%2C000&l={city}&start={start}"
            response = session.get(url, headers=header)
            soup = BeautifulSoup(response.text,'html.parser')
            for job in soup.find_all('td', class_='resultContent'):
                try:
                    results_dict['location'].append(job.find(class_='companyLocation').text)
                except:
                    results_dict['location'].append('None')
                try:
                    results_dict['title'].append(job.find('h2', class_='jobTitle jobTitle-color-purple').text)

                except:
                    results_dict['title'].append(job.find(class_='heading4 color-text-primary singleLineTitle tapItem-gutter').text.lstrip('new'))
                try:
                    results_dict['company'].append(job.find(class_='companyName').text)
                except:
                    results_dict['company'].append('None')
                try:
                    results_dict['salary'].append(job.find('div', class_='metadata salary-snippet-container').text)
                except:
                    results_dict['salary'].append('None')
        #slow down the scrape 
            time.sleep(random.randint(7,20))
results = pd.DataFrame(results_dict)
gateway.shutdown()       


Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://www.indeed.com - IP Rotate API' (10 new).


100%|███████████████████████████████████████████| 50/50 [23:19<00:00, 28.00s/it]


Deleting gateways for site 'https://www.indeed.com'.
Deleted 10 endpoints with for site 'https://www.indeed.com'.


- Completed webscrape in multiple small batches  

In [23]:
#see example of one of the scrapes
results.shape


(1140, 4)

In [24]:
results

Unnamed: 0,location,title,company,salary
0,"Rochester, NY 14624",Analytic and Data Science Engineer,Kodak Alaris,
1,"New York, NY",Data Platform Cloud Solutions Architect,Avanade,
2,"New York, NY 10001 (Chelsea area)",Sr. Big Data Engineer Consultant,IBM,
3,"New York, NY 10001 (Chelsea area)",Senior Machine Learning Software Engineer,AppFolio,
4,"New York, NY",Managing Business Systems Analyst - MarTech an...,Capgemini,
...,...,...,...,...
1135,"Brooklyn, NY",Interface Analyst,HCTec,
1136,"Brooklyn, NY 11237 (Bushwick area)",Revenue Cycle Reimbursement Analyst,Wyckoff Heights Medical Center,
1137,"Endicott, NY",Business Analyst,Entegee,$25 - $30 an hour
1138,"New York, NY 10017 (Midtown area)",Application Project Manager / Business Analyst,Robert Half,"$100,000 - $120,000 a year"


In [25]:
results.location.value_counts()

New York, NY                                    236
United States                                   130
New York State•Remote                           106
United States•Remote                             73
New York, NY 10001 (Chelsea area)                27
                                               ... 
Queens, NY+3 locations•Remote work available      1
Brooklyn, NY 11201+3 locations•Remote             1
New York, NY 10277 (Battery Park area)            1
New York, NY 10010 (Flatiron area)•Remote         1
Philadelphia, PA                                  1
Name: location, Length: 130, dtype: int64

In [26]:
results.salary.value_counts()

None                          919
$25.27 - $33.68 an hour        22
$85,000 - $185,000 a year      20
$40 - $65 an hour              20
$27 an hour                    14
$75,504 - $93,776 a year       10
$45 - $53 an hour               8
Up to $110,000 a year           8
$25 - $30 an hour               8
$50,000 a year                  5
$80,000 - $110,000 a year       5
$150,000 - $180,000 a year      5
$66.50 - $77.00 an hour         5
$100,000 - $175,000 a year      4
From $30 an hour                4
$50,000 - $60,000 a year        4
$62.06 an hour                  4
From $40,000 a year             4
$50,000 - $52,000 a year        3
$130,000 - $150,000 a year      3
$120,000 - $125,000 a year      3
$120,000 - $130,000 a year      3
$50 - $60 an hour               3
$80 - $86 an hour               3
From $60,000 a year             3
$45,282 - $85,000 a year        3
$52,000 a year                  3
$100,000 a year                 2
$20 - $25 an hour               2
From $67,000 a

In [27]:
#save scrape as csv 
results.to_csv('web scrapes/results_v20.csv')