In [14]:
from urllib.parse import urlparse, parse_qs

from queries import QueryArgument, Query, SimpleIndeedQuery, AdvancedIndeedQuery

# Indeed Scraping Overview

I wrote an indeed scraper back in 2018, but it looks like they added some fancy new JS to their site. I'm gonna document the exploratory scraping process here.


## Queries

It's probably best to start at the entry point to the site. In our case with indeed, the entry point is going to be their search routes that they have built it. From looking at the site, it appears there's the standard simple search that you can reach from the landing page, and an advanced search hidden behind a link.

### Simple Search

Let's start by figuring out everything that Indeed passes on when we perform a routine search directly from
https://indeed.com. We can do that by just filling any entries with known values, and seeing what the parameters get passed through to the url. Here's what we'll be using as the reference search:

![Simple Search](simple-search.png)

which gives us a url of https://www.indeed.com/jobs?q=job+$60,000&l=Cheyenne,+WY&radius=15&rbc=State+of+Wyoming&jcid=45d7691df85ac203&jt=fulltime&explvl=entry_level.

Let's make use of urllib to see what data was associated with each of those fields.

In [15]:
url = 'https://www.indeed.com/jobs?q=job+$60,000&l=Cheyenne,+WY&radius=15&rbc=State+of+Wyoming&jcid=45d7691df85ac203&jt=fulltime&explvl=entry_level'
params = urlparse(url)
parse_qs(params.query)

{'q': ['job $60,000'],
 'l': ['Cheyenne, WY'],
 'radius': ['15'],
 'rbc': ['State of Wyoming'],
 'jcid': ['45d7691df85ac203'],
 'jt': ['fulltime'],
 'explvl': ['entry_level']}

#### Observations

* It appears that the salary gets brought up into the query, and just seperated by a space.
* It looks like the upper bound of the salary was just ignored.
* It looks as if the rbc/jcid work in tandem for identifying the specific company.
* The options for the job type look like they ignore the given space.
* The options for job level look like they use an underscore for the space.
* Going to the next page adds a parameter 'start', with it's value being what number entry to start on.
* Searching without 'l' throws a validation error.

#### Defining the Simple Query Argument Objects

We'll make use of the QueryArgument dataclass object that we defined to code up what we just discovered.

In [16]:
what = QueryArgument('q', str, disp_name='What')
where = QueryArgument('l', str, required=True, disp_name='Where')
radius = QueryArgument('radius', int, choices=(0, 5, 10, 15, 25, 50, 100), disp_name="Miles away")
min_salary = QueryArgument('q', int, fmt="${}", disp_name="Minimum Salary")
company = QueryArgument('rbc', str, requires='jcid', disp_name="Compay Name")
company_id = QueryArgument('jcid', str, requires='rbc', disp_name="Company id")
job_type = QueryArgument('jt', str, choices=('fulltime', 'parttime', 'contract', 'internship', 'temporary', 'commission'), disp_name="Job Type")
experience = QueryArgument('explvl', str, choices=('entry_level', 'mid_level', 'senior_level'), disp_name="Experience Level")
start = QueryArgument('start', int, disp_name="start")

### Advanced Search

Let's nab everything that was being passed to advanced now too. Some of the arguments might be repeated from the simple search, and for those, we'll go ahead and ignore them. We'll use the following advanced search and url to help us figure out the arguments:

![Advanced Search](advanced-search.png)

which gives a url of https://www.indeed.com/jobs?as_and=with_all_of_these_words&as_phr=with_the_exact_phrase&as_any=with_at_least_one_of_these_words&as_not=with_none_of_these_words&as_ttl=with_these_words_in_the_title&as_cmp=from_this_company&jt=all&st=jobsite&sr=directhire&as_src=from_this_job_site&salary=&radius=15&l=San+Francisco%2C+CA&fromage=15&limit=20&sort=date&psf=advsrch&from=advancedsearch.

Again, let's use urllib to make better sense of this.

In [18]:
url = 'https://www.indeed.com/jobs?as_and=with_all_of_these_words&as_phr=with_the_exact_phrase&as_any=with_at_least_one_of_these_words&as_not=with_none_of_these_words&as_ttl=with_these_words_in_the_title&as_cmp=from_this_company&jt=all&st=jobsite&sr=directhire&as_src=from_this_job_site&salary=&radius=15&l=San+Francisco%2C+CA&fromage=15&limit=20&sort=date&psf=advsrch&from=advancedsearch'
params = urlparse(url)
parse_qs(params.query)

{'as_and': ['with_all_of_these_words'],
 'as_phr': ['with_the_exact_phrase'],
 'as_any': ['with_at_least_one_of_these_words'],
 'as_not': ['with_none_of_these_words'],
 'as_ttl': ['with_these_words_in_the_title'],
 'as_cmp': ['from_this_company'],
 'jt': ['all'],
 'st': ['jobsite'],
 'sr': ['directhire'],
 'as_src': ['from_this_job_site'],
 'radius': ['15'],
 'l': ['San Francisco, CA'],
 'fromage': ['15'],
 'limit': ['20'],
 'sort': ['date'],
 'psf': ['advsrch'],
 'from': ['advancedsearch']}

#### Observations

* We have an overlap of 'l', 'jt', and 'radius'
* 'q' looks like it's been expanded to handle input from a combination of any 'as_\*' arguments
* 'st' corresponds to the "Show jobs from" dropdown, with choices being 'jobsite', and 'employer'
* 'sr' only has a value when "Exclude staffing agencies is checked", and that value is 'directhire'
* 'fromage' can be either 'last', 1, 3, 7, 15, or 'any'
* 'limit' is the amount of entries to be displayed per page
* 'sort' only has a value for 'date'. The relevance choice gives no 'sort' parameter.
* 'psf' and 'from' are always 'advsrch' and 'advancedsearch' respectively
* 'start' is still the parameters for controlling paging

#### Defining the Advanced Query Arguments

Let's formalize the rest of the arguments into Python

In [19]:
#Search Strings
all_of_these_words = QueryArgument('as_and', str, disp_name='All of these words')
exact_phrase = QueryArgument('as_phr', str, disp_name='Exact phrase')
any_of_these_words = QueryArgument('as_any', str, disp_name="Any of these words")
none_of_these_words = QueryArgument('as_not', str, disp_name="None of these words")
title_words = QueryArgument('as_ttl', str, disp_name="These words in title")
from_company = QueryArgument('as_cmp', str, disp_name="From this company")
from_this_job_site = QueryArgument('as_src', str, disp_name="From this job site")

#Ad Origin
posted_to = QueryArgument('st', str, choices=('jobsite', 'employer'), disp_name="Posted to")
hired_by = QueryArgument('sr', str, choices=('directhire',), disp_name="Who handles hiring")

#Sort and paging
sort_by = QueryArgument('sort', str, choices=('date',), disp_name="Sort by")
limit = QueryArgument('limit', int, choices=(10, 20, 30, 50), disp_name="Per Page")

#Age
from_age = QueryArgument('fromage', choices=('last', 1, 3, 7, 15, 'any'), disp_name="Max days old")

#Magic Args
psf = QueryArgument('psf', str, required=True, value='advsrch', mutable=False, requires='from')
searched_from = QueryArgument('from', str, required=True, value='advancedsearch', mutable=False, requires='psf')

### The Query Object

Now that we have all of the query arguments for both query types. Let's go ahead and define our query objects to work with.

#### The Simple Query

This is the base query that we encounter on Indeed's landing page.

In [20]:
simple_kwargs = {
    'what' : what,
    'where' : where,
    'radius' : radius,
    'min_salary' : min_salary,
    'company' : company,
    'company_id' : company_id,
    'job_type' : job_type,
    'experience' : experience,
    'start' : start
}

IndeedSimple = Query('https://indeed.com/jobs', **simple_kwargs)

#### The Advanced Query

This is the query that we encounter from the "Advanced Search" link.

In [21]:
advanced_kwargs = {
    'where' : where,
    'radius' : radius,
    'min_salary' : min_salary,
    'job_type' : job_type,
    'experience' : experience,
    'start' : start,
    'all_words' : all_of_these_words,
    'exact_phrase' : exact_phrase,
    'any_words' : any_of_these_words,
    'none_words' : none_of_these_words,
    'title_words' : title_words,
    'from_company' : from_company,
    'from_job_site' : from_this_job_site,
    'posted_to' : posted_to,
    'hired_by' : hired_by,
    'sort_by' : sort_by,
    'limit' : limit,
    'age' : from_age,
    'psf' : psf,
    'searched_from': searched_from
}

IndeedAdvanced = Query('https://indeed.com/jobs', **advanced_kwargs)

### Using the Query Objects

We can now make use of these query objects by setting the arguments via setattr calls. For example, if we wanted a simple query, where we were searching for "python" as the what, "New York City, New York" as the where, for full-time, mid-level, positions, we could simply use the object to get our search url as follows.

In [22]:
IndeedSimple.what = "python"
IndeedSimple.where = "New York City, New York"
IndeedSimple.job_type = "fulltime"
IndeedSimple.experience = "mid_level"

IndeedSimple.url

'https://indeed.com/jobs?q=python&l=New+York+City%2C+New+York&jt=fulltime&explvl=mid_level'

And we can use the advanced object the same way. Let's now go the same thing, but we want Python to be in the title, and we don't want to be writing any C#, VBA, or .NET.

In [23]:
IndeedAdvanced.title_words = "python"
IndeedAdvanced.none_words = "C# .NET VBA"
IndeedAdvanced.where = "New York City, New York"
IndeedAdvanced.job_type = "fulltime"
IndeedAdvanced.experience = "mid_level"

IndeedAdvanced.url

'https://indeed.com/jobs?l=New+York+City%2C+New+York&jt=fulltime&explvl=mid_level&as_not=C%23+.NET+VBA&as_ttl=python&psf=advsrch&from=advancedsearch'

### Indeed Specific Objects

While, we have access and control to the full Query object, until Indeed changes their site again, these parameters can all basically stay the same, the only thing we're going to be changing is the values. We'll take what we discovered from above and use it to create the Indeed specific query objects.

In [24]:
simple = SimpleIndeedQuery(where="New York City, New York", what="Python")
simple.url

'https://indeed.com/jobs?q=Python&l=New+York+City%2C+New+York&explvl=mid_level'

In [25]:
advanced = AdvancedIndeedQuery(where="San Francisco, CA", hired_by="directhire", experience="mid_level",
                              sort_by="date", age=15)
advanced.url

'https://indeed.com/jobs?l=San+Francisco%2C+CA&explvl=mid_level&sr=directhire&sort=date&fromage=15&psf=advsrch&from=advancedsearch'

## Scraping Page Results

Now that we can generate any search url, let's setup our methods for scraping the results from each query.