## Web Srapping Indeed 

This tutorial looks at data scientists on indeed and uses the below url as an example. The tutorial is from a post on Medium.com and you can find it [here](https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b)

https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10 

### URL Notes:

The tutorial has a different link that seems to have much more going on, but the link above should work just fine.

* **"q="** begins the search feild criteria.
* Detatils are parsed by **"+"** 
* Moving through other search criteria is broken up by **"&"**
* Location starts with **"l="**. That's an "L" not a 1.
* And **"start="** denotes the search result where you want to begin. In this case it's the 10th result.

### Page Notes:

The number of results on each page seems to vary.

All information on the page is coded with HTML tags.

* Results are stored under the tag id **"resultsCol"**
* Each job listing is under the tag class **"jobsearch-SerpJobCar unifiedRow row result clickcard"**
* Each job listing has a unique id. The first id in my page is **"id="pj_d3129653f60d24bb"**


### Building Scraper Components

Now with a basic understanding of the page components we will import the libraries.

The "time" library is imported to stagger the page requests and not overwhelm the site servers when scraping.

In [9]:
import requests
import bs4
from bs4 import BeautifulSoup

import pandas as pd
import time

To start, we will be pulling a single page to workout the code.

In [2]:
URL = "https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10"

# conducting a request of the stated URL above:
page = requests.get(URL)

# specifying a desired formate of "page" using the html parser - this allows python to read 
# the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, "html.parser")

# printing soup in a more structured tree format that makes for easier reading
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/498f66a/en_US.js" type="text/javascript">
  </script>
  <link href="/s/e1d8b3d/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://rss.indeed.com/rss?q=data+scientist+%2420%2C000&amp;l=New+York" rel="alternate" title="Data Scientist $20,000 Jobs, Employment in New York State" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+scientist+%2420%2C000&amp;l=New+York" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+scientist+%2420%2C000&amp;l=New+York" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallbacks'].push(cb);
}
}
  </script>
  <meta content="1" na

### Withdrawing Basic Elements of Data

There are 5 key pieces of information to extract from each job posting.

* Job Title
* Company Name
* Location
* Salary
* Job Summary

At the time this tutorial was created the author says there were about 15 job postings per page, but I discovered that it can range from 15 to 19, with no indication of a sponsor. But, we know that our results should fall between these two values.

### Job Title

Job titles are nested under **"div class="jobsearch-SerpJobCard"** and the title is under **<a> tag** with the class name **"title"**. With BeautifulSoup you can search for any div class that serves as a container for job title. Its better to be specific, though.

In [3]:
def extract_job_title_from_result(soup):
    """ Input a Beautiful soup parsed page fron Indeed and returns
        a list of job titles on that page
        
        ex. extract_job_title_from_result(soup)
        
        >>> ['Senior Data Scientist (Remote-friendly)',
            'Chief Data Scientist',
            'Data Scientist',
            'Data Science Specialist USA',
            'Sr. Data Scientist']
    """
    # Vector containing job titles
    jobs = []
    
    # search through all <div class="jobsearch-SerpJobCard" then find an
    # <a> tag containing data-tn-element = "jobTitle"
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

extract_job_title_from_result(soup)

['Senior Data Scientist (Remote-friendly)',
 'Data Science Specialist USA',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist - Responsible AI',
 'Data Scientist',
 'Data Scientist',
 'Data Science Intern',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist - Decision Science',
 'Data Scientist, Music Analytics',
 'Senior Data Analyst',
 'Sr. Data Scientist',
 'Chief Data Scientist',
 'Data Scientist / Machine Learning / DL / AI',
 'Data Scientist / Data Analytics']

### Company Name

* div class="jobsearch-SerpJobCar""
* span class="company"

The tutorial talks about a "span" class with the tag "result-link-source", but looking through the page, I did not see that occuring, so I omitted the conditional code. 
    
There was an issue with my code not pulling anything, so I checked to see if the provided example worked and it did. I reshaped it to look like mine and it still worked. Not sure why mine was not working.

In [4]:
def extract_company_from_result(soup): 
    
    companies = []
    
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}):
        company = div.find_all(name="span", attrs={"class":"company"})
        for b in company:
            companies.append(b.text.strip())
            
    return(companies)
 
extract_company_from_result(soup)

['Noom Inc.',
 'Mphasis',
 'Plug Power Inc',
 'Turner',
 'CompuForce',
 'Facebook',
 'Vettery',
 'CBS',
 'Mount Sinai',
 '1-800-Flowers',
 'Northwell Health',
 'Blackboard Insurance',
 'Spotify',
 'EmblemHealth/AdvantageCare Physicians',
 'Ascensia Diabetes Care',
 'Jobot',
 'Lockheed Martin',
 'Indotronix International Corporation']

### Location

* div class = "jobsearch_SerpJobCar"
* div class = "location accessible-contrast-color-location"

So, location is not located in a "span" class it is in a div class. Knowing that, we can technically reuse the code from before to get our values.

In [31]:
# incorrect method
def extract_location_from_result(soup):
    
    #location list
    locations = []
    div = soup.find_all("div", attrs={"class": "location"})
    for d in div:
        locations.append(d.text)
    return(locations)

extract_location_from_result(soup)

['New York, NY 10001 (Flatiron District area)',
 'New York, NY',
 'Latham, NY 12110',
 'New York, NY 10004 (Financial District area)',
 'Valhalla, NY 10595',
 'New York, NY',
 'Liverpool, NY 13088',
 'Orangeburg, NY']

This code is very similar to what was provided, but it only gave us 6 results. From the previous extractions there are more than twice that. Let me explore what the provided code gives us.

In [6]:
# incorrect method
def extract_location_from_result(soup):
    
    locations = []
    
    spans = soup.find_all("span", attrs={"class": "location"})
    for span in spans:
        locations.append(span.text)
    return(locations)

extract_location_from_result(soup)

['New York, NY',
 'New York, NY 10170 (Midtown area)',
 'New York, NY 10017 (Midtown area)',
 'New York, NY',
 'New York, NY 10176 (Murray Hill area)',
 'New York, NY',
 'Carle Place, NY 11514',
 'New Hyde Park, NY 11042',
 'New York, NY 10271 (Financial District area)',
 'New York, NY 10011 (Flatiron District area)']

I had to change the class tag from "locations" to "location" and it appears the remaining locations show up. So the location is located in either a "span" class or a "div" class.

Now I'll update the code to to accomodate for this.

In [13]:
def extract_location_from_result(soup):
    
    locations = []
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}):
        locationd = soup.find_all("div", attrs={"class": "location"})
        if len(locationd) > 0:
            for div in locationd:
                locations.append(div.text)
        else:
            locationsp = soup.find_all("span", attrs={"class": "location"})
            for span in locationsp:
                locations.append(span.text)
        return(locations)

extract_location_from_result(soup)

['New York, NY 10001 (Flatiron District area)',
 'New York, NY',
 'Latham, NY 12110',
 'New York, NY 10004 (Financial District area)',
 'Valhalla, NY 10595',
 'New York, NY',
 'Liverpool, NY 13088',
 'Orangeburg, NY']

### Salary

Salary will be different from the others. Most posts will not have a salary displayed. 