# Indeed Scraper

For this project, I wanted to explore data science-related jobs posted to a variety of cities on indeed.com, a job aggregator that updates multiple times daily. I conducted my scraping using the “requests” and “BeautifulSoup” libraries in python to gather and parse information from indeed’s pages, before using the “pandas” library to assemble my data into a dataframe for further cleaning and analysis.

### Examining the URL and Page structure
First, let’s look at a sample page from [indeed](https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10).

Notice a few things about the way the URL is structured:

* note “q=” begins the string for the “what” field on the page, separating search terms with “+” (i.e. searching for “data+scientist” jobs)

* when specifying salary, it will parse by commas in the salary figure, so the beginning of the salary will be preceded by %24 and then the number before the first comma, it will then be broken by %2C and continue with the rest of the number (i.e. %2420%2C000 = $20,000)

* note “&l=” begins the string for city of interest, separating search terms with “+” if city is more than one word (i.e. “New+York”
note “&start=” notes the search result where you want to begin (i.e., start by looking at the 10th result)

The URL structure will come in handy as we build a scraper to look at and gather data from a series of pages. Keep this in mind for later.

![title](./images/pic1.png)


Each page of job results will have 15 job posts. Five of these are “sponsored” jobs, which are specially displayed by indeed outside of the normal order of results. The remaining 10 results are specific to the page being viewed.

All of the information on this page is coded with HTML tags. HTML (HyperText Markup Language), is the coding that tells your internet browser how to display a given page’s contents upon accessing it. This includes its basic structure and order. HTML tags also have attributes that are a helpful way of keeping track of what information can be found where within the structure of the page.

Chrome users can examine the HTML structure of a page by right-clicking on a page and choosing “Inspect” from the menu that appears. A menu will appear on the right-hand side of your page, with a long list of nested HTML tags housing the information currently displayed in your browser window. In the upper-left of this menu, there’s a small box with an arrow icon in it. Once clicked, the box will illuminate in blue (notice in the screenshot below). This will allow you to cursor over the elements in the page to display both the tag associated with that item, and to bring your inspection window directly to that item’s place in the HTML for the page.

![title](./images/pic2.png)

In the screen show above, I’ve cursored over one of the job postings to show how the entire job’s contents is held within a <div> tag, with attributes including “class = ‘row result’”, “id=’pj_7a21e2c11afb0428'”, etc,. Luckily, we will not need to know every attribute of every tag to extract our information, but it is helpful to know how to read the basic structure of a page’s HTML. <break>

Now, let’s turn to python to extract the html from the page and look to building our scraper.

### Building the Scraper Components
Now that we’ve looked at the basic structure of the page and know a little about it’s basic HTML structure, we can see about building code to pull out the information we’re interested in. We’ll import our libraries first. Note, I’m also importing “time”, which can be a helpful way of staggering page requests to not overwhelm a site’s servers when scraping information.

In [54]:
# Import packages 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

import warnings
warnings.filterwarnings("ignore")

In [55]:
URL ='https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10'

#conducting a request of the stated URL above:
page = requests.get(URL)

#specifying a desired format of “page” using the html parser - this allows python 
#to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, 'html.parser')

#printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())


<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/0247a00/en_US.js" type="text/javascript">
  </script>
  <link href="/s/970d98c/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://rss.indeed.com/rss?q=data+scientist+%2420%2C000&amp;l=New+York" rel="alternate" title="Data Scientist $20,000 Jobs, Employment in New York State" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+scientist+%2420%2C000&amp;l=New+York" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+scientist+%2420%2C000&amp;l=New+York" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
        window['closureReadyCallbacks'] = [];
    }

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureReadyCall

Now, we know that our variable “soup” has all of the information housed in our page of interest. It is now a matter of writing code to iterate through the various tags (and nested tags therein) to capture the information we want.

While this is not the appropriate place to go over all of the ways in which information can be found or withdrawn from a page’s HTML, the BeautifulSoup documentation has a lot of helpful information that can guide one’s searching.

### Withdrawing Basic Elements of Data
Approaching this task, I wanted to find and extract five key pieces of information from each job posting: Job Title, Company Name, Location, Salary, and Job Summary.

I know from looking at the page, that there must be 15 job postings therein. As such, I know that each function I write to withdraw a piece of information should yield 15 different items. If my output provides fewer than this, I can refer back to the page itself to see what information I’m not capturing.

### Job Title
As noted above, I could tell that the entirety of each job posting is nested under &lt;div&gt; tags, with an attribute “class” = “row result”.

From there, I could see that job titles are listed under &lt;a&gt tags, with attribute “title = (title)”. One can find the value of a tag’s attribute with tag &lt;attribute&gt;, so I could use this to find the job title for each posting.

My function for withdrawing job title information involved three steps:

* pulling out all &lt;div&gt; tags with class including “row”
    
* identifying &lt;a&gt; tags with attribute “data-tn-element”:”jobTitle”
    
* for each of these &lt;a&gt; tags, find the value of attributes “title”

In [56]:
divs=soup.find_all(name='div', attrs={'class':'row'})
def extract_job_title_from_result(soup): 
    jobs = []
    for div in divs:
        for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
            jobs.append(a['title'])
    return(jobs)
extract_job_title_from_result(soup)

['Data Scientist',
 'Data Scientist',
 'Research Data Associate',
 'Research Scientist I - Chemical Development',
 'Machine Learning Developer',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist, Intern (2019)',
 'Data Scientist',
 'Data Warehouse Intern',
 'Data Engineer, Baseball Operations',
 'Quantitative Analysis, Full Time Analyst (North America - 2019)',
 'Data Scientist',
 'Adjunct Associate Faculty role, Machine Learning, (Online/REMOTE)',
 'DATA SCIENCE -INTERN',
 'ANALYST | DATA ANALYSIS',
 'Data Scientist',
 'Associate Data Scientist',
 'Part-Time Data Collection and Research Analyst']

### Company Name
Company names were a bit tricky, as most would appear in &lt;span&gt; tags, with “class”:”company”. Rarely, however, they will be housed in &lt;span&gt; tags with “class”:”result-link-source”.

I developed if/else statements to extract the company info from either of these places. Company names are output with a lot of white space around them, so inputting .strip() at the end helps to remove this when extracting the information.


In [59]:
def extract_company_from_result(soup): 
    companies = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        company = div.find_all(name='span', attrs={'class':'company'})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
            for span in sec_try:
                companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(soup)

['Fora Financial LLC',
 'Biz2Credit Inc.',
 'NYU Langone Health',
 'Albany Molecular Research',
 'Civicom, Inc.',
 '05 Ascensia Diabetes Care US Inc.',
 'National Debt Relief',
 '1010data',
 'Rent the Runway',
 'Newsela',
 'New York Yankees',
 'Citi',
 'Intent Media',
 'Columbia University',
 'Foot Locker, Inc.',
 'New York City OFFICE OF MANAGEMENT & BUDGET',
 'Fora Financial LLC',
 'Church Pension Group',
 'Tarifica']

### Location
Locations are located under the &lt;span&gt; tags. Span tags are sometimes nested within each other, such that the location text may sometimes be within “class” : “location” attributes, or nested in “itemprop” : “addressLocality”. However, a simple for loop can examine all span tags for text wherever it may be and retrieve the necessary information.

In [60]:
def extract_location_from_result(soup): 
    locations = []
    spans = soup.findAll('span', attrs={'class': 'location'})
    for span in spans:
        locations.append(span.text)
    return(locations)
extract_location_from_result(soup)


['New York, NY',
 'New York, NY',
 'New York, NY',
 'Bronx, NY',
 'New York, NY 10261 (Murray Hill area)',
 'New York, NY',
 'New York, NY 10027 (Hamilton Heights area)',
 'New York, NY',
 'Manhattan, NY',
 'New York, NY']

### Salary
Salary was the most difficult data to extract from job postings. Most postings don’t contain any salary information at all. Among those that do, it can be in one of two different places. So, we need to write a function that can look in multiple places for information, and we need to create a placeholder “Nothing Found” value for any jobs that don’t contain salary. We want this placeholder to ensure that all of the information from any given job post can line up with all of the other pieces of relevant data when we later assemble our data into a single data frame.


In [61]:
def extract_salary_from_result(soup): 
    salaries = []
    spans2 = soup.findAll('span', attrs={'class': 'no-wrap'})
    if len(spans2) > 0:
        for s in spans2:
            salaries.append(s.text.strip())
    #else:
        #salaries.append('Nothing_found')
    return(salaries)       
extract_salary_from_result(soup)

['relevance -\n            date',
 '$50,000 - $65,000 a year',
 '$58,162 - $65,433 a year',
 '$17 - $20 an hour']

### Job Summary
Finally, the job summaries. Unfortunately, the entirety of the job summaries are not included in the HTML from a given indeed page, however, we can get some information about each job from what’s provided. Selenium is a suite of tools that could allow a web scraper to click through different links on a page to withdraw this information from the full job postings. However, I did not utilize selenium for this particular effort.

Summaries are located under &lt;span&gt; tags. Span tags are sometimes nested within each other, such that the location text may sometimes be within “class” : “location” tags, or nested in “itemprop” : “addressLocality”. However, a simple for loop can examine all span tags for text wherever it may be and retrieve the necessary information.

In [62]:
def extract_summary_from_result(soup): 
    summaries = []
    spans = soup.findAll('span', attrs={'class': 'summary'})
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)
extract_summary_from_result(soup)

['Typical functions are loss forecasting, mix management, risk appetite management, collections strategy, and reporting related to KPIs that affect portfolio...',
 'Work on data projects and proposals involving Biz2Credit’s financial services partners worldwide (banks, non-banks, debt investors, equity investors and...',
 'Initiates and continues regular contact with patients; encourages visit reminders and compliance to research; ensures contact with patients and their...',
 'Optimize the reaction processes for scale-up by making appropriate modifications of known methods or modification of reaction conditions under the...',
 'We will be looking for a results/goal orientation, initiative, energy, integrity, common sense, the ability to adapt to changing roles and multitask, a...',
 'Possess ability to present, interpret, and recommend to senior management the results of work including the development including the development of new...',
 '- Fully share in the responsibility to optimi

### Putting all the Pieces Together
We’ve got all of the various pieces of our scraper. Now, we need to assemble them into the final scraper that will withdraw the appropriate information for each job post, keep it separate from all other job posts, and assemble all of my job posts into a single dataframe one at a time.

We can set up the initial conditions for each scrape by specifying a few pieces of information:

* We can detail how many results we want to scrape from each city of interest
* We can assemble a list of all of the cities for which we want to scrape job postings
* We can create an empty dataframe to house the scraped data for each posting. We can specify in advance the names of our columns for where we expect each piece of information to be located.

It goes without saying that the more results you want, and the more cities you look at, the longer the scraping process will take. This isn’t a huge issue if you start your scraper before you go out or go to sleep, but it’s still something to consider.

Assembling the actual scraper relates back to the patterns we noticed in the URL structure above. Because we know how the URLs will be patterned for each page, we can exploit this when building a loop to visit each page in a specific order to extract data.

In [72]:
def parse(url):
    start_time = time.clock()
    html = requests.get(url)
    time.sleep(1)
    soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
    df = pd.DataFrame(columns=["Title","Location","Company","Salary", "Summary"])
    for each in soup.find_all(class_= "result" ):
        try: 
            title = each.find(class_='jobtitle').text.replace('\n', '')
        except:
            title = ''
        try:
            location = each.find('span', {'class':"location" }).text.replace('\n', '')
        except:
            location = ''
        try: 
            company = each.find(class_='company').text.replace('\n', '')
        except:
            company = ''
        try:
            salary = each.find('span', {'class':'no-wrap'}).text.strip()
        except:
            salary = ''
        summary = each.find('span', {'class':'summary'}).text.replace('\n', '')
        df = df.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary,'Summary':summary}, ignore_index=True)
    print('Execution time is: %s seconds'%(time.clock() - start_time))
    return df
    

In [73]:
URL = "http://www.indeed.com/jobs?q=data+scientist&l=New+York&start=10"
parse(URL)


Execution time is: 0.28010600000015984 seconds


Unnamed: 0,Title,Location,Company,Salary,Summary
0,Data Scientist,,Biz2Credit Inc.,"$50,000 - $65,000 a year",As a Biz2Credit Da...
1,Part-Time Data Collection and Research Analyst,,Tarifica,$17 - $20 an hour,We are a rapidly g...
2,Machine Learning Developer,,"Civicom, Inc.",,We're looking for ...
3,Data Scientist,,05 Ascensia Diabetes Care US Inc.,,Experience in and ...
4,Data Scientist,,National Debt Relief,,Retrieve data and ...
5,Associate Data Scientist,,Guardian Life Insurance Company,,Provide data model...
6,"Data Scientist-Manager/ Sr Manager, Credit and...",,American Express,,As a Data Scientis...
7,Data Scientist,"New York, NY 10001 (Chelsea area)",MediQuire,,Developing innovative methods to c...
8,"Data Scientist, User Fraud","New York, NY 10011 (Chelsea area)",Spotify,,Coding skills for analytics and da...
9,Junior Data Scientist,"New York, NY 10176 (Murray Hill area)",Dow Jones,,You have worked with visualization...


### The Complete Code
I modified the code to get the results from a longer list of cities with different job titles. The complete code is below. 

In [76]:
# Import packages 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

#import warnings
#warnings.filterwarnings("ignore")

start_time = time.clock()

url_template = "http://www.indeed.com/jobs?q={}&l={}&start={}"
max_results_per_city = 1000 # Set this to a high-value (5000) to generate more results. 

job_set = ['data+scientist']#,'quantitative+analyst']
cities=['New+York','Brooklyn','Long+Island','Mechanicsburg', 'Harrisburg','Carlisle',
      'Anaheim','Brea','Buena+Park','Costa+Mesa','Cypress','Dana+Point',
      'Fountain+Valley','Fullerton','Garden+Grove','Westminster','Huntington+Beach','Irvine',
      'La+Habra','La+Palma','Laguna+Beach','Laguna+Hills','Laguna+Niguel','Laguna+Woods',
      'Lake+Forest','Los+Alamitos','Orange','Santa+Ana','Seal+Beach','Stanton',
      'Los+Angeles', 'San+Diego','Fresno', 'Long+Beach','Sacramento', 'Oakland',
       'New+York', 'Chicago', 'San+Francisco', 'San+Jose', 'San+Diego',  
        'Washington%2C+DC', 'Boston', 'Pittsburgh', 'Philadelphia', 'Atlanta', 'Cincinnati',
        'St.+Louis', 'Tampa', 'Oakland', 'Austin', 'Houston', 'Dallas', 'Seattle', 'Portland',
        'Denver', 'Phoenix', 'Minneapolis', 'Miami', 'Charlotte', 'Jacksonville', 'Indianapolis',
        'Nashville', 'Kansas+City', 'Columbus']
columns = ['city_key', 'Title', 'Company', 'Location', 'Summary', 'Salary']



# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.
i = 0
results = []
df_more = pd.DataFrame(columns=columns)
for city in set(cities):
    for start in range(0, max_results_per_city, 10):
        #print(city+" pg"+str(start/10.)) #to keep track of progress
        # Grab the results from the request (as above)
        url = url_template.format(job_set,city, start)
        # Append to the full set of results
        html = requests.get(url)
        time.sleep(1)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try:
                city=city
            except:
                city = ''
            try: 
                title = each.find(class_='jobtitle').text.replace('\n', '')
            except:
                title = ''
            try:
                location = each.find('span', {'class':"location" }).text.replace('\n', '')
            except:
                location = ''
            try: 
                company = each.find(class_='company').text.replace('\n', '')
            except:
                company = ''
            try:
                salary = each.find('span', {'class':'no-wrap'}).text.strip()
            except:
                salary = ''
            try:
                summary = each.find('span', {'class':'summary'}).text.replace('\n', '')
            except:
                synopsis = ''
            df_more = df_more.append({'city_key':city,'Title':title,'Company':company, 'Location':location,'Summary':summary,'Salary':salary}, ignore_index=True)
            i += 1
            if i % 1000 == 0: 
                print('You have ' + str(i) + ' results.')
print('You have total '+str(df_more.shape[0])+ ' jobs. ' +str(df_more.dropna().drop_duplicates().shape[0]) + " of these aren't rubbish.")
print('Execution time is: %s seconds'%(time.clock() - start_time))
df_more.to_csv('Indeed_not_cleaned_long.csv', encoding='utf-8')


You have 1000 results.
You have 2000 results.
You have 3000 results.
You have 4000 results.
You have 5000 results.
You have 6000 results.
You have 7000 results.
You have 8000 results.
You have 9000 results.
You have 10000 results.
You have 11000 results.
You have 12000 results.
You have 13000 results.
You have 14000 results.
You have 15000 results.
You have 16000 results.
You have 17000 results.
You have 18000 results.
You have 19000 results.
You have 20000 results.
You have 21000 results.
You have 22000 results.
You have 23000 results.
You have 24000 results.
You have 25000 results.
You have 26000 results.
You have 27000 results.
You have 28000 results.
You have 29000 results.
You have 30000 results.
You have 31000 results.
You have 32000 results.
You have 33000 results.
You have 34000 results.
You have 35000 results.
You have 36000 results.
You have 37000 results.
You have 38000 results.
You have 39000 results.
You have 40000 results.
You have 41000 results.
You have 42000 results.
Y

You have 334000 results.
You have 335000 results.
You have 336000 results.
You have 337000 results.
You have 338000 results.
You have 339000 results.
You have 340000 results.
You have 341000 results.
You have 342000 results.
You have 343000 results.
You have 344000 results.
You have 345000 results.
You have 346000 results.
You have 347000 results.
You have 348000 results.
You have 349000 results.
You have 350000 results.
You have 351000 results.
You have 352000 results.
You have 353000 results.
You have 354000 results.
You have 355000 results.
You have 356000 results.
You have 357000 results.
You have 358000 results.
You have 359000 results.
You have 360000 results.
You have 361000 results.
You have 362000 results.
You have 363000 results.
You have 364000 results.
You have 365000 results.
You have 366000 results.
You have 367000 results.
You have 368000 results.
You have 369000 results.
You have 370000 results.
You have 371000 results.
You have 372000 results.
You have 373000 results.
