## Jokes

For the first example, let's fetch the HTML code for the web page found at http://pun.me/pages/funny-jokes.php. There is a list of 35 jokes on this page. If you don't already have Google Chrome installed, take a moment to do that.

Open the URL in Chrome and then open Chrome Developer Tools. To do this, click on the More Options icon in the top right hand corner of your Chrome window, then click *More Tools* then *Developer Tools*.

![Open Dev Tools](assets/scraping-1.jpg)

In the developer tools window, select the *Elements* tab, or select `Ctrl + Shift + I` Windows / `Command + Shift + I` Mac. In this tab, the HTML code for the web page is displayed. As you move your mouse pointer over the code, the relevant parts of the webpage will be highlighted. See if you can find the HTML code associated with the jokes themselves.

![HTML Code](assets/scraping-2.png)

You can also access the HTML related to the jokes by hovering over one with your mouse. Right-click and select "Inspect" and you will arrive at the relevant HTML code. 

![right-click-element](assets/inspect-element-right-click.gif)


There is a lot going on in the HTML. As you drill down into the code, you will find a section of code that starts like this:

```html
<div style="float:left;width:100%;">
    <div class="content">
        <p>
            Our most-liked jokes which are genuinely funny - this list of jokes has been hand selected and contain a variety of clever, clean and silly jokes so be prepared to laugh.
        </p>
        <ol>
         <li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
        <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>
```


As you move your mouse pointer over that section of code the list of jokes itself is highlighted. In order to be able to parse the HTML code to extract the data, you will need to understand a little about the structure of HTML. If you are already familiar with HTML then you are all set, but if you are new to the language then there are a few things to be aware of.

HTML is made up of elements. An element is constructed from tags. A tag looks like this: `<tagname>`. In the code above, you can see that there are `<div>` tags, `<p>` tags, an `<ol>` tag and some `<li>` tags. Each tag has a corresponding closing tag, which looks just like the tag but with a slash. So the `<li>` tag has a corresponding closing `</li>` tag.

An `li` element is made up of the opening and closing `li` tags with some content between them like this:

```html
<li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
```

The element simply tells the browser what to display and maybe some information about the content. For instance, `<p>` is for a paragraph of text, `<ol>` is for ordered lists (that is, numbered lists), and `<li>` is a list item. Notice how the `<ol>` element contains a number of `<li>` elements. Notice also that each `<li>` element corresponds to a single numbered item in the list.

There are a great many other elements on the page, but they are for headings, menus, and navigations that we are not interested in right now. You can typically use the Chrome Developer tools to drill down to the most relevant elements on the page.

This example is simple enough that the URL does not need any special attention.


## Fetch the Code

The next step is to fetch the HTML code from the website using the Python `requests` library. This is the same library that you used to request data from an API earlier, except in this case we are not fetching JSON data. Instead, we can use the `text` property to get the HTML code.

In [1]:
import requests
url = ' http://pun.me/pages/funny-jokes.php'
response = requests.get(url)

# make sure we got a valid response
if(response.ok):
  # get the full data from the response
  data = response.text
  print(data)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<meta name="description" content="Our most-liked jokes which are genuinely funny - this list of jokes has been hand selected and contain a variety of clever, clean and silly jokes so be prepared to laugh.">
    <title>35 Genuinely Funny Jokes which will actually make you laugh! | Pun.me</title>
    <link href="css.css" rel="stylesheet">
	<link rel="shortcut icon" href="http://pun.me/favicon.ico" />
  </head>
  <body>
	<div id="fb-root"></div>
<script>(function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0];
  if (d.getElementById(id)) return;
  js = d.createElement(s); js.id = id;
  js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.5&appId=204272616306296";
  fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>

<a class="home" 

## Beautiful Soup

Parsing the HTML code that makes up a web page can be quite difficult especially as there is no guarantee that the code is formatted correctly, or consistently with any standard. Web pages are notoriously broken. Your web browser does a heroic job of rendering web pages even when they are broken, so even if you visit a website and it looks fine, that does not mean that the code is actually fine. In addition, to build a polite scraper as defined above requires even more complexity in your code. There are several good libraries that can help you navigate through HTML. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to simplify the parsing.


To use BeautifulSoup we will:

* import the library
* create an object with the response text 
 
 
BeautifulSoup supports several different HTML parsers. That is, there are different ways that a program may read and understand an HTML page. When creating the object out of the HTML we need to tell BeautifulSoup which parser to use. We will use the default parser to avoid having to install additional dependencies.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

The variable `soup` in this code now contains an object of type *BeautfulSoup*. This object represents the entire HTML document. We can see what it contains with the `prettify()` method.

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Our most-liked jokes which are genuinely funny - this list of jokes has been hand selected and contain a variety of clever, clean and silly jokes so be prepared to laugh." name="description"/>
  <title>
   35 Genuinely Funny Jokes which will actually make you laugh! | Pun.me
  </title>
  <link href="css.css" rel="stylesheet"/>
  <link href="http://pun.me/favicon.ico" rel="shortcut icon">
  </link>
 </head>
 <body>
  <div id="fb-root">
  </div>
  <script>
   (function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0];
  if (d.getElementById(id)) return;
  js = d.createElement(s); js.id = id;
  js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.5&appId=204272616306296";
  fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));
  </scr

## Navigating the Document

Now that we have the document parsed, we need to be able to target just the elements of the document that we need. The page contains many different elements, most of which we can ignore. Let's try to find the ones we want. Go back to Chrome Dev tools and find the list of jokes in the HTML. You will notice that each joke is in an `<li>` and together all the `<li>`s  are in an `<ol>` element. Let's try to access that element.

To get a list of a certain type of element we can use the `find_all()` method. This is probably the most common method for navigating through a document looking for specific tags.

In [4]:
soup.find_all('ol') # make a list of all ol elements

[<ol>
 <img alt="Funny quote about an old lady" src="funny-old-lady-joke.jpg" style="max-width:100%;"/><br/>
 <li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
 <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>
 <li>I told my girlfriend she drew her eyebrows too high. She seemed surprised.</li>
 <li>My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.</li>
 <li>I'm so good at sleeping. I can do it with my eyes closed.</li>
 <img alt="Joke about going home from work" src="funny-joke-about-work.jpg" style="max-width:100%;"/><br/>
 <li>My boss told me to have a good day.. so I went home.</li>
 <li>Why is Peter Pan always flying? He neverlands.</li>
 <li>A woman walks into a library and asked if they had any books about paranoia. The librarian says "They're right behind you!"</li>
 <li>The other day, my wife asked me to pass her li

That did the trick. When we are only expecting a single element we can use the `find()` method instead. Now that we have the list, let's try to get only the `<li>` elements from the list.

In [5]:
# get the list
list = soup.find('ol')
items = list.find_all('li')
print(items)

[<li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>, <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>, <li>I told my girlfriend she drew her eyebrows too high. She seemed surprised.</li>, <li>My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.</li>, <li>I'm so good at sleeping. I can do it with my eyes closed.</li>, <li>My boss told me to have a good day.. so I went home.</li>, <li>Why is Peter Pan always flying? He neverlands.</li>, <li>A woman walks into a library and asked if they had any books about paranoia. The librarian says "They're right behind you!"</li>, <li>The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.</li>, <li>Why do blind people hate skydiving? It scares the hell out of their dogs.</li>, <li>When you look really closely, all mirror

That produces a list of the `<li>` tags. Finally, we want to extract just the text from within those tags.

In [6]:
jokes = [joke.get_text() for joke in items]
print(jokes)

['Today at the bank, an old lady asked me to help check her balance. So I pushed her over.', "I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.", 'I told my girlfriend she drew her eyebrows too high. She seemed surprised.', 'My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.', "I'm so good at sleeping. I can do it with my eyes closed.", 'My boss told me to have a good day.. so I went home.', 'Why is Peter Pan always flying? He neverlands.', 'A woman walks into a library and asked if they had any books about paranoia. The librarian says "They\'re right behind you!"', "The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.", 'Why do blind people hate skydiving? It scares the hell out of their dogs.', 'When you look really closely, all mirrors look like eyeballs.', 'My friend says to me: "What rhymes with orange"

So that did the trick. One problem that we may face is if there were multiple `<ol>` elements on the page. In that case, we would want to be as specific as possible. In HTML each element can be given a *class* attribute and an *id* attribute. More than one element may have the same class, but ids are supposed to be unique.

A quick examination of the HTML shows that the `<ol>` element has neither a class nor an id. But, the `<div>` element that encloses it has the class *content*. We could select the `<div>` with class *content* then select the `<ol>` from within. Remember that even though this step isn't really necessary for this particular page, we want to make a robust scraper that will work even if the web site owner adds more lists to the same page.

As a quick illustration of this lets do exactly what was done above but with a different route to find the jokes.

In [7]:
# this gets a list of all divs on the page
# soup.find_all('div')

# get just the divs with class *content*
div = soup.find('div', class_='content')
list = div.find('ol')
items = list.find_all('li')
jokes = [joke.get_text() for joke in items]
print(jokes)

['Today at the bank, an old lady asked me to help check her balance. So I pushed her over.', "I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.", 'I told my girlfriend she drew her eyebrows too high. She seemed surprised.', 'My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.', "I'm so good at sleeping. I can do it with my eyes closed.", 'My boss told me to have a good day.. so I went home.', 'Why is Peter Pan always flying? He neverlands.', 'A woman walks into a library and asked if they had any books about paranoia. The librarian says "They\'re right behind you!"', "The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.", 'Why do blind people hate skydiving? It scares the hell out of their dogs.', 'When you look really closely, all mirrors look like eyeballs.', 'My friend says to me: "What rhymes with orange"

## Developer Jobs

Imagine that you've been tasked with creating a presentation on current trends in developer jobs. Your audience wants to understand things like which technologies are most in demand and what perks employers are using to attract talent. Your preliminary research reveals that the StackOverflow Jobs would be a good data source.

This website lists jobs that are targeted at developers and people in the software development community. On the page, there are a number of jobs listed, along with ads and other information. But this site uses pagination to display a long list of jobs one page at a time. To see the second page you can click the Next button at the bottom of the list. To scrape useful information from this website we will need to follow that link.

Let’s use the Chrome Developer Tools to examine the page a bit.

![Jobs](assets/scraping-3.jpg)

Drilling into the page structure you will find that there is a `<div>` element with class `listResults` that contains all the job listings. Each job listing has an HTML structure similar to this:

```html
<div data-jobid="240993" class="-item -job p24 pl48 bb ps-relative bc-black-2 js-dismiss-overlay-container   can-apply ">
  <div class="dismiss-overlay ps-absolute ta-center p16 t0 r0 b0 l0 grid ai-center jc-center o90 bg-black-050 fs-body3">
    <p class="mb0">
      Okay, you won’t see this job anymore. <a href="#" class="js-undismiss-job" data-id="240993">Undo</a>
    </p>
  </div>
  <div class="dismiss-trigger js-dismiss-job ps-absolute r12 fc-black-500 c-pointer" data-id="240993" data-referrer="JobSearch" title="Dismiss job"><svg aria-hidden="true" class="svg-icon iconClearSm" width="14" height="14" viewBox="0 0 14 14"><path d="M12 3.41L10.59 2 7 5.59 3.41 2 2 3.41 5.59 7 2 10.59 3.41 12 7 8.41 10.59 12 12 10.59 8.41 7z"></path></svg></div>
  <div class="-job-summary">
    <div class="-title">
      <span data-href="https://stackoverflow.com/users/signup?returnUrl=%2fjobs%2fset-favorite%2f240993%3freturnUrl%3d%252Fjobs%253Fmed%253Dsite-ui%2526ref%253Djobs-tab%26referrer%3dJobSearch%26sec%3dFalse&amp;ssrc=jobs" data-jobid="240993" class="fav-toggle ps-absolute l16 c-pointer js-fav-toggle " title="Click to add this job to your favorites." data-ga-label="Toptal | Front-End Developer @ PUB Team | 240993" )="">
        <svg aria-hidden="true" class="svg-icon iconStar" width="18" height="18" viewBox="0 0 18 18"><path d="M9 12.65l-5.29 3.63 1.82-6.15L.44 6.22l6.42-.17L9 0l2.14 6.05 6.42.17-5.1 3.9 1.83 6.16z"></path></svg>
      </span>
      <h2 class="fs-subheading job-details__spaced mb4">
        <a href="/jobs/240993/front-end-developer-pub-team-toptal?a=1iOW90N7AGru&amp;so=i&amp;pg=1&amp;offset=24&amp;total=965" title="Front-End Developer @ PUB Team" class="s-link s-link__visited">Front-End Developer @ PUB Team</a>        
      </h2>
      <span class="ps-absolute pt2 r0 fc-black-500 fs-body1 pr12 t24">5d ago</span>
    </div>
    <div class="fc-black-700 fs-body2 -company">
        <span>Toptal</span>
        <span class="fc-black-500">
             - No office location        
        </span>
    </div>
    <div class="mt2 -perks">
      <span class="-remote pr16">Remote</span>
    </div>
    <div class="mt12 -tags">
      <a href="/jobs/developer-jobs-using-javascript" class="post-tag job-link no-tag-menu">javascript</a>
      <a href="/jobs/developer-jobs-using-node.js" class="post-tag job-link no-tag-menu">node.js</a>
      <a href="/jobs/developer-jobs-using-css" class="post-tag job-link no-tag-menu">css</a>
      <a href="/jobs/developer-jobs-using-html" class="post-tag job-link no-tag-menu">html</a>
      <a href="/jobs/developer-jobs-using-reactjs" class="post-tag job-link no-tag-menu">reactjs</a>
    </div>
  </div>
</div>
```

So there is still a lot of extraneous information in there. We're only really interested in the job title, the company, the list of technologies, and the perks. Let's use BeautifulSoup to try to select the parts that we need.

In [8]:
url = 'https://stackoverflow.com/jobs'
response = requests.get(url)

# make sure we got a valid response
if(response.ok):
  # get the full data from the response
  data = response.text
  soup = BeautifulSoup(data, 'html.parser')
  
  # find all elements with class *-job-summary*
  summary =soup.find_all(class_='-job-summary')
  print(summary)

[<div class="-job-summary">
<div class="-title">
<span )="" class="fav-toggle ps-absolute l16 c-pointer js-fav-toggle" data-ga-label="SumUp | Lead Mobile Engineer - SumUp Bank | 320379" data-href="https://stackoverflow.com/users/signup?returnUrl=%2fjobs%2fset-favorite%2f320379%3freturnUrl%3d%252Fjobs%26referrer%3dJobSearch%26sec%3dFalse&amp;ssrc=jobs" data-jobid="320379" title="Click to add this job to your saved jobs.">
<svg aria-hidden="true" class="svg-icon iconStar" height="18" viewbox="0 0 18 18" width="18"><path d="M9 12.65l-5.29 3.63 1.82-6.15L.44 6.22l6.42-.17L9 0l2.14 6.05 6.42.17-5.1 3.9 1.83 6.16L9 12.65z"></path></svg>
</span>
<h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320379/lead-mobile-engineer-sumup-bank-sumup" title="Lead Mobile Engineer - SumUp Bank">Lead Mobile Engineer - SumUp Bank</a> </h2>
<span class="ps-absolute pt2 r0 fs-body1 pr12 t32">
<span class="fc-orange-400 fw-bold mr2">New</span>
<span class="fc-b

Not bad! That gave us a list of the divs containing job info. But, that is still a lot of extraneous HTML. We could narrow it down further by just getting the titles.

In [9]:
titles = soup.find_all(class_='-title')
print(titles)

[<div class="-title">
<span )="" class="fav-toggle ps-absolute l16 c-pointer js-fav-toggle" data-ga-label="SumUp | Lead Mobile Engineer - SumUp Bank | 320379" data-href="https://stackoverflow.com/users/signup?returnUrl=%2fjobs%2fset-favorite%2f320379%3freturnUrl%3d%252Fjobs%26referrer%3dJobSearch%26sec%3dFalse&amp;ssrc=jobs" data-jobid="320379" title="Click to add this job to your saved jobs.">
<svg aria-hidden="true" class="svg-icon iconStar" height="18" viewbox="0 0 18 18" width="18"><path d="M9 12.65l-5.29 3.63 1.82-6.15L.44 6.22l6.42-.17L9 0l2.14 6.05 6.42.17-5.1 3.9 1.83 6.16L9 12.65z"></path></svg>
</span>
<h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320379/lead-mobile-engineer-sumup-bank-sumup" title="Lead Mobile Engineer - SumUp Bank">Lead Mobile Engineer - SumUp Bank</a> </h2>
<span class="ps-absolute pt2 r0 fs-body1 pr12 t32">
<span class="fc-orange-400 fw-bold mr2">New</span>
<span class="fc-black-500">&lt; 1h ago</span

Better, but the actual title text that we want is in an `<a>` tag inside an `<h2>` tag. We can continue navigating down the document or we can use another technique: [**CSS Selectors**](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors). CSS is the language used to apply styles to web pages. CSS selectors are used to select specific elements on the page so that styles may be applied to just those elements. But the syntax is so powerful that we can use the same selector syntax in BeautifulSoup. 

To select all elements with a particular class use the `.classname` selector.

In [10]:
titles = soup.select('.-title')
print(titles)

[<div class="-title">
<span )="" class="fav-toggle ps-absolute l16 c-pointer js-fav-toggle" data-ga-label="SumUp | Lead Mobile Engineer - SumUp Bank | 320379" data-href="https://stackoverflow.com/users/signup?returnUrl=%2fjobs%2fset-favorite%2f320379%3freturnUrl%3d%252Fjobs%26referrer%3dJobSearch%26sec%3dFalse&amp;ssrc=jobs" data-jobid="320379" title="Click to add this job to your saved jobs.">
<svg aria-hidden="true" class="svg-icon iconStar" height="18" viewbox="0 0 18 18" width="18"><path d="M9 12.65l-5.29 3.63 1.82-6.15L.44 6.22l6.42-.17L9 0l2.14 6.05 6.42.17-5.1 3.9 1.83 6.16L9 12.65z"></path></svg>
</span>
<h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320379/lead-mobile-engineer-sumup-bank-sumup" title="Lead Mobile Engineer - SumUp Bank">Lead Mobile Engineer - SumUp Bank</a> </h2>
<span class="ps-absolute pt2 r0 fs-body1 pr12 t32">
<span class="fc-orange-400 fw-bold mr2">New</span>
<span class="fc-black-500">&lt; 1h ago</span

This gives us the same output as earlier. So let's extend this a bit. To select only the `<h2>` elements in the divs with class *-title*, we can use the **child** selector. The `>` symbol is used to specify a child element. A child element is an element contained within another element.

In [11]:
title_headings = soup.select('.-title > h2')
print(title_headings)

[<h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320379/lead-mobile-engineer-sumup-bank-sumup" title="Lead Mobile Engineer - SumUp Bank">Lead Mobile Engineer - SumUp Bank</a> </h2>, <h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320172/fundraising-database-administrator-raisers-american-physical-therapy" title="Fundraising &amp; Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).">Fundraising &amp; Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).</a> </h2>, <h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/320374/consultores-senior-sap-fi-co-excelium-consulting" title="CONSULTORES SENIOR SAP FI CO">CONSULTORES SENIOR SAP FI CO</a> </h2>, <h2 class="fs-body2 job-details__spaced mb4">
<a class="s-link s-link__visited job-link" href="/jobs/314316/student-information-system-team-lead-isu-university-human-resources

Okay. That's better, but it'd be even better if we selected only the `<a>` tags `<h2>`s.

In [12]:
title_links = soup.select('.-title > h2 > a')
print(title_links)

[<a class="s-link s-link__visited job-link" href="/jobs/320379/lead-mobile-engineer-sumup-bank-sumup" title="Lead Mobile Engineer - SumUp Bank">Lead Mobile Engineer - SumUp Bank</a>, <a class="s-link s-link__visited job-link" href="/jobs/320172/fundraising-database-administrator-raisers-american-physical-therapy" title="Fundraising &amp; Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).">Fundraising &amp; Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).</a>, <a class="s-link s-link__visited job-link" href="/jobs/320374/consultores-senior-sap-fi-co-excelium-consulting" title="CONSULTORES SENIOR SAP FI CO">CONSULTORES SENIOR SAP FI CO</a>, <a class="s-link s-link__visited job-link" href="/jobs/314316/student-information-system-team-lead-isu-university-human-resources" title="Student Information System Team Lead">Student Information System Team Lead</a>, <a class="s-link s-link__visited job-link" href="/jobs/315368/software-engineer-intern-java-blizzard-entertainmen

Now, we can use `get_text()` as we did above to extract the text content of the links.

In [13]:
jobs = [job.get_text() for job in title_links]
print(jobs)

['Lead Mobile Engineer - SumUp Bank', "Fundraising & Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).", 'CONSULTORES SENIOR SAP FI CO', 'Student Information System Team Lead', 'Software Engineer Intern, Java', 'Data Manager', 'Data Engineer', 'Associate Software Engineer', 'Software Engineer Intern - Gameplay', 'Data Science Intern, Service Technologies', 'Software Engineer Intern, Gameplay', 'Software Engineer Intern, Tools', 'Software Engineer Intern, IT Corp Apps', 'Software Engineer Intern, Gameplay (Diablo 4)', 'Software Engineer Intern - WoW Classic', 'Software Engineer Intern, Server', 'Associate Data Specialist (Temp)', 'Software Engineer Intern, Server', 'Sr. Systems Admin', 'DESARROLLADOR/A WEB FRONTEND', 'Lead Software Engineer (Remote)', 'Software Development Engineer', 'Full Stack Developer', 'Java Full-Stack Developer', 'Test & Integration Lead']


How would we get a list of the companies that posted jobs on this page? Take a few minutes and try to construct the CSS selector expression that will return this information.

It may be obvious from the HTML that the company information is inside the job summary in a `<div>` with class `-company`. If you select the `<span>`s content, you will get the company information.

In [14]:
company_spans = soup.select('.-job-summary > .-company > span')
companies = [company.get_text() for company in company_spans]
print(companies)

['SumUp\r\n        ', '\r\n             - \r\nSão Paulo, Brazil        ', 'American Physical Therapy Association (Foundation for Physical Therapy Research)\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nAlexandria, VA        ', 'EXCELIUM CONSULTING\r\n        ', '\r\n             - \r\nBarcelona, Spain        ', 'ISU/University Human Resources\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nAmes, IA        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nAustin, TX        ', 'VIOLA\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nNew York, NY        ', 'VIOLA\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nNew York, NY        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nIrvine, CA        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', '\r\n             - \r\nIrvine, CA 

We are getting the information that we want, but there's extra text here that we do not want. Look carefully at the div with class *-company* and notice that it actually contain multiple spans. The company name is usually the first one. We can use the `nth-of-type()` selector to, say, only grab me the first match.

In [15]:
company_spans = soup.select('.-company > span:nth-of-type(1)')
companies = [company.get_text() for company in company_spans]
print(companies)

['SumUp\r\n        ', 'American Physical Therapy Association (Foundation for Physical Therapy Research)\r\n                via\r\nPandoLogic        ', 'EXCELIUM CONSULTING\r\n        ', 'ISU/University Human Resources\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'VIOLA\r\n                via\r\nPandoLogic        ', 'VIOLA\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogic        ', 'Blizzard Entertainment\r\n                via\r\nPandoLogi

Getting the tags for each job is much the same. Let's try to put together some code that will get all the information about the job that we need and put it into a dictionary object.


In [16]:
# iterate over all jobs
raw_jobs = soup.select('.-job-summary')
jobs = []
for job in raw_jobs:
  title = job.select_one('.-title > h2 > a').get_text() # extract the title
  company = job.select_one('.-company > span:nth-of-type(1)').get_text().strip() # extract the company and remove leading and trailing white space
  tags = [tag.get_text() for tag in job.select('.-tags a')] # extract a list of tags
  job = {'title': title, 'company': company, 'tags': tags} # construct a dictionary
  jobs.append(job) # add dictionary to list
print(jobs)  

[{'title': 'Lead Mobile Engineer - SumUp Bank', 'company': 'SumUp', 'tags': ['android', 'kotlin', 'flutter', 'react-native']}, {'title': "Fundraising & Database Administrator (RAISER'S EDGE CERTIFICATION A PLUS).", 'company': 'American Physical Therapy Association (Foundation for Physical Therapy Research)\r\n                via\r\nPandoLogic', 'tags': ['c', 'database']}, {'title': 'CONSULTORES SENIOR SAP FI CO', 'company': 'EXCELIUM CONSULTING', 'tags': []}, {'title': 'Student Information System Team Lead', 'company': 'ISU/University Human Resources\r\n                via\r\nPandoLogic', 'tags': ['project-management', 'web-services']}, {'title': 'Software Engineer Intern, Java', 'company': 'Blizzard Entertainment\r\n                via\r\nPandoLogic', 'tags': ['java', 'web-services', 'java-ee']}, {'title': 'Data Manager', 'company': 'VIOLA\r\n                via\r\nPandoLogic', 'tags': ['web-services', 'python']}, {'title': 'Data Engineer', 'company': 'VIOLA\r\n                via\r\n

## Following a link
Now that we can scrape the list of jobs from the first page, what about the other pages? Let's look at the HTML code for the next button.

```html
<a href="/jobs?sort=i&pg=2" title="page 2 of 39" class="prev-next job-link test-pagination-next">
  <span class="text">next</span>
  <i class="material-icons">chevron_right</i>
</a>
```

The next button is made up of a clickable link styled to look like a button. The `href` attribute of the link contains the URL for the next page. The URL of this button is `/jobs?sort=i&pg=2`. Notice that this is only part of the full URL for that page. A URL is made up of the domain name and path and optional query string. Since we are on the `https://stackoverflow.com/jobs` website, the browser knows that if we click this link we want to remain on the same website and just visit a different path. So the actual request will be sent to `https://stackoverflow.com/jobs?sort=i&pg=2`. The bit after the `?` is called the query string, and it is one way to send data to the server describing what you are requesting. It appears that this link will accept a parameter named `pg` representing the page number. If you click on the next button several times you will notice that the bit at the end of the URL that says `pg=2` changes to `pg=3` then `pg=4` and so on. 

That means that we can modify the URL by incrementing the page number to request more jobs. We can continue to do so until there are no more pages. 

To set the query string on the request URL, we can use the `params` parameter on the get method. For example,

```python
page_number = 1
query = {'sort':'i', 'pg': page_number}
url = 'https://stackoverflow.com/jobs'
response = requests.get(url, params=query)
```

We could then update the `page_number` variable and repeat this call.

This raises a few questions:

 - How do we know when there are no more pages?
 - If we rapidly send many requests for many pages to the server, will we overwhelm the server and possibly get ourselves blocked?
 
There may be several answers to the first question. If you go to the website and visit the last page of results you will see that the next button is not displayed on that page. So the absence of the next button could be the condition on which we stop the processing. There may be others that you can readily spot but we will go with that option for now.

The second one does not rely on the HTML structure of the page, but rather on the way we code up the solution. The above code would process a page of jobs and create a list of dictionaries from that page. To repeat that we need some sort of loop that will iterate over all pages and for each page perform something similar to the code that we already wrote. When we get to the last page we stop the loop and at that point we will have a list of all jobs on the website. The problem here is that our program will process each page much more rapidly than a human browsing the website would. If we're not careful, this could appear to be a denial of service attack against the website, and since most web servers can identify such attacks and take steps to protect themselves, we want to avoid this behavior. 

One solution is to deliberately slow down the rate of requests that we make. After making one request, we can wait a few seconds before making another one. If we wait 1 second between requests and we need to request 500 pages, then that is at least 500 seconds for our program to run. That is not so bad since hopefully scraping the data is a one time affair. There are some libraries that help to moderate your request rate if you are buiding a more sophisticated scraper. For our simple scraper, we can use the Python's `sleep()` method to pause execution for a short time.

## Some other considerations

Before we jump into code, there are a few other considerations that we should make since we are requesting multiple pages. We should clearly identify ourselves to the web server. That is, when your browser makes a request to a web server it sends along a header named **User-Agent** with a value that identifies the browser itself. For instance, in Chrome the value looks like this:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
```

And in Firefox the value looks like this:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0
```

This way the web server can track how many visitors use a particular browser. We can set a value that identifies us and at least provide a contact. If the website owners wish, they may contact us to get some clarity on why we are scraping data from their website and maybe negotiate a better experience for themselves as well as us. We could provide a value like:

```
'jobscraper - school project (yourname@gmail.com)'
```

For example, when making the request we could set the header like this:

```
url = '...'
headers = {'user-agent': 'jobscraper - school project (yourname@gmail.com)'}
response = requests.get(url, headers=headers)
```

Another consideration is monitoring the program as it runs. We are creating a program that may take 10 to 20 minutes or even more to run. It would be important to "see" what is happening as the program runs. This helps us to follow the progress, ensure the  program isn't hung up, and to detect problems early. To this end, we can simply print the number of requests that have been made so far, the frequency of the requests, the number of items found so far and any errors encountered.

Putting all of this together can be a bit daunting, but careful step-by-step thought will make it possible. A complete working program is given below. Before you run it, ensure you set your email address in the header. Also, notice the variable `MAX_REQUESTS` which is used to set an upper limit on the number of requests made. You may adjust this value as you wish.

In [17]:
# import all dependencies at the top
from time import time
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import requests


# define a function to process the page
def process_page(soup, jobs):  
  
  # find all elements with class *-job-summary*
  raw_jobs = soup.select('.-job-summary')

  # same as above, extract the info we need
  for job in raw_jobs:
    title = job.select_one('.-title > h2 > a').get_text() # extract the title
    company = job.select_one('.-company > span:nth-of-type(1)').get_text().strip() # extract the company
    tags = [tag.get_text() for tag in job.select('.-tags a')] # extract a list of tags
    job = {'title': title, 'company': company, 'tags': tags} # construct a dictionary
    jobs.append(job) # add dictionary to list

    
# prepare for the monitoring logic
start_time = time() # note the system time when the program starts
request_count = 0 # track the number of requests made

# create a list to store the data in
jobs = []

# variables to handle the request loop
has_next_page = True
MAX_REQUESTS = 100 # do not request more than 100 pages
page_number = 1
url = 'https://stackoverflow.com/jobs'
headers = {'user-agent': 'jobscraper - school project (myeamail@gmail.com)'}

while has_next_page and request_count <= MAX_REQUESTS:
  # keep the output clear
  clear_output(wait = True)
  
  # make an initial request
  query = {'sort':'i', 'pg': page_number}
  response = requests.get(url, params=query, headers=headers)

  # make sure we got a valid response
  if(response.ok):
    # get the full data from the response
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    process_page(soup, jobs)

    # check for the next page
    # look for the presence of element with class *test-pagination-next*
    next_button = soup.select('.test-pagination-next')
    has_next_page = len(next_button) > 0
    
  else:
    # display a warning if there are any problems
    warn('Request #: {}, Failed with status code: {}'.format(request_count, response.status_code))
  
  request_count += 1
  
  # go to sleep for a bit
  # we use a random number between 1 and 5 so
  # We can wait as long as 5 seconds to make a second request
  
  sleep(randint(1,3))
  
  # output some logs for monitoring
  elapsed_time = time() - start_time
  print('Requests: {}, Frequency: {} requests/s, {} jobs processed.'.format(request_count, request_count/elapsed_time, len(jobs)))
  
  # prepare for next iteration
  page_number += 1
      
print('Scraping complete')
print('Requests: {}, Frequency: {} requests/s, {} jobs processed.'.format(request_count, request_count/elapsed_time, len(jobs)))

Requests: 101, Frequency: 0.4251301557767797 requests/s, 2525 jobs processed.
Scraping complete
Requests: 101, Frequency: 0.4251301557767797 requests/s, 2525 jobs processed.


In [18]:
# print the first five jobs
jobs[0:5]

[{'title': 'Software Developer/Engineer - Full Stack',
  'company': 'Ridgeline International, Inc.',
  'tags': ['angular', 'typescript', 'java', 'node.js', 'postgresql']},
 {'title': 'Software Developer',
  'company': 'Global Reach Consulting LLC',
  'tags': ['python', 'c', 'ruby', 'linux', 'c++']},
 {'title': 'Mid-Level Full Stack Software Developer',
  'company': 'Global Reach Consulting LLC',
  'tags': ['python', 'javascript', 'reactjs', 'angular', 'encryption']},
 {'title': 'Full Stack Engineer',
  'company': 'Next Century Corporation',
  'tags': ['user-experience', 'java', 'python', 'typescript', 'javascript']},
 {'title': 'Software Developer/Engineer - Angular',
  'company': 'Ridgeline International, Inc.',
  'tags': ['reactjs', 'typescript', 'angular', 'node.js', 'java']}]

## Save to a file
Getting the data from the server is only the first part of the problem. The next step is to persist the data in some format that makes it available for analysis later. A CSV file is a fairly common format and well supported in many environments. The `csv` Python library provides the tools to read and write csv files in Python. 

## Note about Colab

Since this is a hosted service writing a file to the filesystem is not quite as straightforward as if this code was running on your local machine. We can get around this by writing to your Google Drive. In order to use Google Drive, you will need to be authenticated. Luckily, Google provides a library named *PyDrive* that makes this simple.

First, ensure that PyDrive is installed in the environment. This command only needs to be executed once in a notebook.


In [19]:
# install this library
!pip install -U -q PyDrive
!pip install google-colab

Collecting google-colab
[?25l  Downloading https://files.pythonhosted.org/packages/70/9f/d3ec1275a089ec017f9c91af22ecd1e2fe738254b944e7a1f9528fcfacd0/google-colab-1.0.0.tar.gz (72kB)
[K     |████████████████████████████████| 81kB 27.4MB/s eta 0:00:01
[?25hCollecting google-auth~=1.4.0 (from google-colab)
[?25l  Downloading https://files.pythonhosted.org/packages/56/80/369a47c28ce7d9be6a6973338133d073864d8efbb62747e414c34a3a5f4f/google_auth-1.4.2-py2.py3-none-any.whl (64kB)
[K     |████████████████████████████████| 71kB 29.5MB/s eta 0:00:01
[?25hCollecting ipykernel~=4.6.0 (from google-colab)
[?25l  Downloading https://files.pythonhosted.org/packages/18/c3/76775a650cae2e3d9c033b26153583e61282692d9a3af12a3022d8f0cefa/ipykernel-4.6.1-py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 49.9MB/s eta 0:00:01
[?25hCollecting ipython~=5.5.0 (from google-colab)
[?25l  Downloading https://files.pythonhosted.org/packages/07/63/c987612bcf82c56eaacaf6bf01e31e53a244a0a

To authenticate with Google, do the following. Note that you will be prompted to login with your Google credentials and then you will be given a code to enter. When you enter that code, this notebook will have permission to write files to your Google Drive.

In [20]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


FileNotFoundError: [Errno 2] No such file or directory: 'gcloud': 'gcloud'

Next, we can create a csv string of our jobs and write it to Google Drive with a file  name of "jobs.csv"

In [None]:
import csv
import io

# Create an output stream
output = io.StringIO()

# these are the names of the properties in the dictionary
fieldnames = ['title', 'company', 'tags']

# create a writer object, it can write dictionaries to the output stream
writer = csv.DictWriter(output, fieldnames=fieldnames)

# write all the headings 
writer.writeheader()

# iterate the jobs and write each one
for job in jobs:
  writer.writerow(job)


# Create & upload a text file.
uploaded = drive.CreateFile({'title': 'jobs.csv'})
uploaded.SetContentString(output.getvalue())
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))  
      

Visit your Google Drive and find the file named "jobs.csv", and verify that it contains a list of jobs that was scraped from the StackOverflow jobs site.