# Web scraping

Note: I have not tested this on the lab computers at all, so none of this might work.

*(Adapted from https://realpython.com/beautiful-soup-web-scraper-python)*

Web scraping is the act of extracting information from a web page. This can include copy-pasting something by hand, but we usually are referring to an automated process. There are two steps to this: acquiring the structured, unprocessed data from the target web page (usually HTML or XML), and extracting the data into a useful data type (like a DataFrame).

Web scraping is [hard](https://realpython.com/beautiful-soup-web-scraper-python/) because every web site is different and web sites constantly change. Any scraper you write may require constant maintenance.

In this activity, we'll scrape data from this [Fake Python Jobs](https://realpython.github.io/fake-jobs/) page that was created for web scraping practice. (The terms of service of individual web sites may make it illegal to use web scraping code on them. FYI. Also some sites use settings to block web scrapers and will cause your code to fail.)

We'll use the `requests` library to get the unprocessed data from the web page and the `BeautifulSoup` library to parse that data into the information we want.

In [None]:
!pip install requests beautifulsoup4

## Step 0: Inspect your data source

Before you start coding, you should look at the data source. (This isn't limited to web scraping...) Open up [Fake Python Jobs](https://realpython.github.io/fake-jobs/) in your browser.

Scroll through the site, see what happens when you click on things. Think about what data is present that someone might want to scrape.

Next, explore the HTML behind the scenes. This will give you an idea of how the raw data is structured. Right-click on something of interest and choose 'Inspect'. If that's not an option, Ctrl+Shift+I should open the Developer Tools on your browser. (If that still doesn't work, you can Ctrl+S to save the web page and open it in a text editor.)

Now we're ready to download the HTML code. We can do this manually, but we want to automate it.

## Step 1: Download HTML from the web page

For a static web page like the Fake Jobs we're looking at, this is super easy.

In [None]:
import requests

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)  # issues an HTTP GET request to URL & stores response as page

print(page.text)

Like I said, super easy. The value of `page.text` should look like the HTML you looked at in Step 0.

This is a bit harder if we want the results of some sort of search query or user input actions, and quite a bit harder if the page requires authentication to get to. (Still possible, but outside the scope of today's lesson.) 

It's also messy if you're getting data from some dynamic web site with lots of Javascript  that has to execute before you can get the data. You'll have to look for a different library for that (like Selenium). Just know that it's out there if you need it.

So now you have the HTML. Just read through it carefully and write down all the data you need. Web scraping complete! 😊

Not really. But we do need to study the HTML to understand the structure surrounding the data, which will allow us to exploit that structure in the next step.

Let's assume we want to extract the data that's on each of the job tiles (job title, company, location, date, and URL to the 'Apply' page). Try to locate one of the job tiles in the HTML above or by using Inspect on the page in your browser.

Seriously. Go look. This is probably the hardest part of web scraping, and if the page is well-designed, it's not that hard. When you think you have some idea about the structure, scroll down.

<div style="height: 1000px;">&nbsp;</div>

Did you notice `<div class="card">`? To me, that looks like the container for each of the job tiles. I've extracted one for a closer look:

```html
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-5">Senior Python Developer</h2>
        <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
      </div>
    </div>

    <div class="content">
      <p class="location">
        Stewartbury, AA
      </p>
      <p class="is-small has-text-grey">
        <time datetime="2021-04-08">2021-04-08</time>
      </p>
    </div>
    <footer class="card-footer">
        <a href="https://www.realpython.com" target="_blank" class="card-footer-item">Learn</a>
        <a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank" class="card-footer-item">Apply</a>
    </footer>
  </div>
</div>
```

To make things a little easier, next I've deleted the elements we don't care about (the image and the Learn link):

```html
<div class="card">
  <div class="card-content">
    <div class="media">
      
      <div class="media-content">
        <h2 class="title is-5">Senior Python Developer</h2>
        <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
      </div>
    </div>

    <div class="content">
      <p class="location">
        Stewartbury, AA
      </p>
      <p class="is-small has-text-grey">
        <time datetime="2021-04-08">2021-04-08</time>
      </p>
    </div>
    <footer class="card-footer">
        
        <a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank" class="card-footer-item">Apply</a>
    </footer>
  </div>
</div>
```

Now, focus on the data we care about and look for text that may help you identify it. Certain HTML tags and classes are useful here.

### Task

In the cell below, double-click it, then copy-paste just the lines/HTML tags that contain the data we want.

your answers here
```html



```

Do you see some structure? Scroll down to continue.

<div style="height: 500px;">&nbsp;</div>

Here are the five data items we want:

```html
<h2 class="title is-5">Senior Python Developer</h2>

<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>

<p class="location">
        Stewartbury, AA
</p>

<time datetime="2021-04-08">2021-04-08</time>

<a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank" class="card-footer-item">Apply</a>

```

The `class` value can help us find job title, company, and location. In HTML, an object can have multiple classes that are separated by spaces. Some of those may be uniquely identifying, others may be generic and used for formatting. Based on some experimentation, here's the tags/classes/attributes I think we need to focus on:

* *job title* - tag 'h2', class 'title'
* *company* - tag 'h3', class 'company'
* *location* - tag 'p', class 'location'
* *date* - tag 'time'
* *apply url* - tag 'a', class ???

The apply url is tricky, because while it has a class, that class is not unique. Look back at the full html for the job tile: both the Apply and Learn links use the same class:

```html
<footer class="card-footer">
    <a href="https://www.realpython.com" target="_blank" class="card-footer-item">Learn</a>
    <a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank" class="card-footer-item">Apply</a>
</footer>
```

So, will we be able to scrape the apply url? How might we differentiate between the two `<a href>`s in the footer? Think about this.

Another fact about our data is that it is all wrapped in a `<div class="card">`. That might come in handy later.

Okay, let's start to setup for extracting data from the page.

In [None]:
# bs4 is the module name they went with for BeautifulSoup *shrug*
from bs4 import BeautifulSoup

# we reuse the page object from the earlier use of requests
soup = BeautifulSoup(page.content, "html.parser")

This time, we say `page.content` instead of `page.text` because that can avoid some problems with character encodings (think the UTF-8 mess from the Olympics data).

Now, how to extract all the job tiles?

In [None]:
job_elements = soup.find_all("div", class_="card")  # find_all returns an iterable

print(job_elements[0])

Do you see where this is going?

Let's start writing a function to process each job tile.

### Task

In the function draft below, edit the code for company_element and location_element to use the correct values.

Then, add code to extract the date element. Be careful with the `class_`.

In [None]:
def process_job_tile(job_element):
    """Returns dict with desired data elements.
    
    Args:
      job_element (bs4.element.Tag): a single job tile (<div class="card">)
      
    Returns:
      dict: with keys title, company, location, date, and url
      
    """
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h2", class_="title")
    location_element = job_element.find("h2", class_="title")
    
    print(title_element)
    print(company_element)
    print(location_element)
    print()
    
    return dict()  # we'll get there eventually

# to test our function
process_job_tile(job_elements[0])

Progress! We still need the url, and we want to remove all the HTML tag mess from what we have so far.

The latter is easy. The HTML tag object that BeautifulSoup gives us have a `.text` attribute. These are Python strings. When we're working with strings from a foreign data source, there's a chance of extra hidden whitespace (again, think about the Olympics CSV exercise), so the `.strip()` method is a good idea.

### Task

Modify the function we're working on to instead print `title_element.text.strip()`. Likewise for the other elements.



To finish up, we need to solve that Apply url issue: both the Learn and Apply links have the same tag and class. What ideas did you come up with?

I came up with two options:

1. the Apply link is the second `<a href>` in the `<footer class="card-footer">`, so we could `.find_all()` on the footer and pick the second one
    1. similarily, it is the second `<a href>` with `class="card-footer-item"`
2. the Apply link is the only `<a href>` that has "Apply" as its `.text`

Pick which option you want to try <font color="green">(or, try both!)</font>. 

Hints: 

1. if you go with option 1, you can leverage the techniques we've already used: starting with the current job element, `.find()` to get the footer and/or `.find_all()` to get both links
2. if you go with option 2, you can [include a](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) `string=` argument in `.find_all()` to filter by the text (or you can `.find_all()` without filtering, then filter the results manually)

Once you have the `<a>` tag you need, you can extract the URL from its `href` attribute. Try `apply_element.href` or `apply_element['href']`.

### Task

Modify the `process_job_tile()` function to extract the Apply url.

## Finishing up

Nearly there!

### Task

One last task: fix the function to return a dictionary instead of printing everything, then write a loop to extract the data from all of the job listings.

For bonus "points", feed the data into a DataFrame/database.

## Next steps

Suppose you only wanted a subset of the data. You're applying to be a Python programmer, not just any job. Two options:

1. Scrape all the data, then filter locally (e.g., in a DataFrame/database)
2. Filter before scraping the data.

For 1, you can reference Pandas or SQL documentation.

For 2, you can make use of more `.find_all()` filtering. Check [this](https://realpython.com/beautiful-soup-web-scraper-python/#find-elements-by-class-name-and-text-content) out.

For a more complete study of BeautifulSoup and other techniques for navigating the web document, read through [this](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/). That can give you some strategies when your raw HTML is not well structured (no `class=` attributes that you can use, e.g.).

Happy web scraping!