## The Scraping Process
In order to properly parse web sites for data we will need to perform a few steps:

1.  Identify the website where the data is located
2. Verify that scraping the data does not violate the site's policies
3. Manually examine the web site with Chrome Developer Tools
  - learn the HTML structure of the site
  - learn how the data is requested by the site - that is, how is the URL of the website structured to get the data
4. Use the `requests` library to fetch the HTML code of the  website
5. Use `BeautifulSoup` to parse the HTML and extract the content
6. Follow links to get more data
7. Store the data in a file for future analysis

In this lesson we will look at two examples. First, we will scrape a list of jokes from this page: http://pun.me/pages/funny-jokes.php. Then we will scrape a list of questions from [StackOverflow](https://stackoverflow.com/questions).

## Jokes

For the first example, let us fetch the HTML code for the web page found at http://pun.me/pages/funny-jokes.php. There is a list of 50 jokes on this page. Open the URL in your Chrome web browser then open the Chrome Developer Tools. To open the developer tools, click on the more options icon in the top right hand corner of your Chrome window, then click More Tools then Developer Tools.

![Open Dev Tools](https://farm8.staticflickr.com/7874/32380905107_f0c79fa663_z.jpg)

In the developer tools window, select the Elements tab. In this tab the HTML code for the web page is displayed. As you move your mouse pointer over the code, the relevant parts of the webpage are highlighted. See if you can find the HTML code associated with the jokes themselves.

![HTML Code](https://farm8.staticflickr.com/7881/32380910067_f7eb8b9927_o.png)


There is a lot going on in the HTML code, but as you drill down into the code you will find the section of code that starts like this:

```
    <div style="float:left;width:100%;">
        <div class="content">
            <p>
                Our most-liked jokes which are genuinely funny - this list of jokes has been hand selected and contain a variety of clever, clean and silly jokes so be prepared to laugh.
            </p>
            <ol>
             <li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
            <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>
    ...
```


As you move your mouse pointer over that section of code, the list of jokes itself is highlighted. In order to be able to parse the HTML code to extract the data, you will need to understand a little about the structure of HTML. If you are already familiar with HTML then you are all set, but if you are new to the language then there are a few things to be aware of.

HTML is made up of elements. An element is constructed from tags. A tag looks like this: `<tagname>`. In the code above, you can see that there are `<div>` tags, `<p>` tags, an `<ol>` tag and some `<li>` tags. Each tag has a corresponding close tag, which looks just like the tag but with a slash. So the `<li>` tag has a corresponding close `</li>` tag.

An `li` element is made up of the open and close `li` tags with some content between them like this:

```
    <li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
```

The element simply tells the browser what to display and maybe some information about the content. For instance, `<p>` is for a paragraph of text, `<ol>` is for ordered lists, that is numbered lists, and `<li>` is a list item. Notice how the `<ol>` element contains a number of `<li>` elements. Notice also that each `<li>` element corresponds to a single numbered item in the list.

There are a great many other elements on the page but they are for the headings and menus and navigations that we are not interested in right now. You can typically use the Chrome Developer tools to drill down to the most relevant elements on the page.

This example is simple enough that the URL does not need any special attention.

### Fetch the Code
The next step is to fetch the HTML code from the website using the Python `requests` library. This is the same library that you used to request data from an API earlier, except in this case we are not fetching JSON data. Instead we can use the `text` property to get the HTML code.

In [3]:
import requests
url = ' http://pun.me/pages/funny-jokes.php'
response = requests.get(url)

# make sure we got a valid response
if(response.ok):
  # get the full data from the response
  data = response.text
  print(data)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	  <meta name="description" content="Laugh out loud with our list of our genuinely funny jokes, our hand-picked list contains a variety of hilarious jokes to make you chuckle.">
    <title>List of the 50 Funniest Jokes to make you laugh out loud | Pun.me</title>
    <meta property="og:image" content="https://pun.me/pages/funny-jokes.jpg" />
    <link href="css-2020.css" rel="stylesheet">
	<link rel="shortcut icon" href="https://pun.me/favicon.ico" />
  </head>
  <body>
	<div id="fb-root"></div>
<script>(function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0];
  if (d.getElementById(id)) return;
  js = d.createElement(s); js.id = id;
  js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.5&appId=204272616306296";
  fjs.parentNode.insertBefore(js, fjs);
}(document, 'scr

## Beautiful Soup
Parsing the HTML code that makes up a web page can be quite difficult, especially as there is no guarantee that the code is formatted correctly or consistently with any standard. Web pages are notoriously broken. Your web browser does a heroic job of rendering web pages even when they are broken, so even if you visit a website and it looks fine, that does not mean that the code is actually fine. In addition, to build a polite scraper as defined above requires even more complexity in your code. There are several good libraries that can help you navigate through the HTML code. We will use [BeatifulSoup](https://www.crummy.com/software/BeautifulSoup/) to simplify the parsing.


To use BeautifulSoup we will:
  - import the library
  - create an object with the response text 
 
 
BeautifulSoup supports several different HTML parsers. That is, there are different ways that a program may read and understand an HTML page. When creating the object out of the HTML, we need to tell BeautifulSoup which parser to use. We will use the default parser to avoid having to install additional dependencies.

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

The variable `soup` in this code now contain an object of type *BeautifulSoup*. This object represents the entire HTML  document. We can see what it contains with the `prettify()` method.

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Laugh out loud with our list of our genuinely funny jokes, our hand-picked list contains a variety of hilarious jokes to make you chuckle." name="description"/>
  <title>
   List of the 50 Funniest Jokes to make you laugh out loud | Pun.me
  </title>
  <meta content="https://pun.me/pages/funny-jokes.jpg" property="og:image">
   <link href="css-2020.css" rel="stylesheet"/>
   <link href="https://pun.me/favicon.ico" rel="shortcut icon">
   </link>
  </meta>
 </head>
 <body>
  <div id="fb-root">
  </div>
  <script>
   (function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0];
  if (d.getElementById(id)) return;
  js = d.createElement(s); js.id = id;
  js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.5&appId=204272616306296";
  fjs.parentNode.insertBefore(

## Navigating the Document
Now that we have the document parsed, we need to be able to target just the elements of the document that we need. The page contains many different elements, most of which we can ignore.  Let us try to find the ones we want. Go back to Chrome Dev tools and find the list of jokes in the HTML. You would notice that each joke is in an `<li>` and together all the `<li>`'s  are in an `<ol>` element. Let's try to access that element.

To get a list of a certain type of element we can use the `find_all()` method. This is probably the most common method for navigating through the document looking for specific tags.

In [6]:
soup.find_all('ol') # make a list of all ol elements

[<ol>
 <img alt="Funny joke about an old lady" src="funny-old-lady-joke.jpg" style="max-width:100%;"/><br/>
 <li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>
 <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>
 <li>I told my girlfriend she drew her eyebrows too high. She seemed surprised.</li>
 <li>My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.</li>
 <li>I'm so good at sleeping. I can do it with my eyes closed.</li>
 <img alt="Joke about going home from work" src="funny-joke-about-work.jpg" style="max-width:100%;"/><br/>
 <li>My boss told me to have a good day.. so I went home.</li>
 <li>Why is Peter Pan always flying? He neverlands.</li>
 <li>A woman walks into a library and asked if they had any books about paranoia. The librarian says "They're right behind you!"</li>
 <li>The other day, my wife asked me to pass her lip

That did the trick. When we are only expecting a single element, we can use the `find()` method instead. Now that we have the list, let's try to get only the `<li>` elements from the list.

In [7]:
# get the list
list = soup.find('ol')
items = list.find_all('li')
print(items)

[<li>Today at the bank, an old lady asked me to help check her balance. So I pushed her over.</li>, <li>I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.</li>, <li>I told my girlfriend she drew her eyebrows too high. She seemed surprised.</li>, <li>My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.</li>, <li>I'm so good at sleeping. I can do it with my eyes closed.</li>, <li>My boss told me to have a good day.. so I went home.</li>, <li>Why is Peter Pan always flying? He neverlands.</li>, <li>A woman walks into a library and asked if they had any books about paranoia. The librarian says "They're right behind you!"</li>, <li>The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.</li>, <li>Why do blind people hate skydiving? It scares the hell out of their dogs.</li>, <li>When you look really closely, all mirror

That produces a list of the `<li>` tags. Finally, we want to extract just the text from within those tags.

In [0]:
jokes = [joke.get_text() for joke in items]
print(jokes)

['Today at the bank, an old lady asked me to help check her balance. So I pushed her over.', "I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.", 'I told my girlfriend she drew her eyebrows too high. She seemed surprised.', 'My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.', "I'm so good at sleeping. I can do it with my eyes closed.", 'My boss told me to have a good day.. so I went home.', 'Why is Peter Pan always flying? He neverlands.', 'A woman walks into a library and asked if they had any books about paranoia. The librarian says "They\'re right behind you!"', "The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.", 'Why do blind people hate skydiving? It scares the hell out of their dogs.', 'When you look really closely, all mirrors look like eyeballs.', 'My friend says to me: "What rhymes with orange"

That did the trick. One problem that we may face is if there were multuple `<ol>` elements on the page. In that case, we would want to be as specific as possible. In HTML, each element can be given a *class* attribute and an *id* attribute. More than one element may have the same class, but ids are supposed to be unique.

A quick examination of the HTML shows that the `<ol>` element  has neither a class nor an id. However, the `<div>` element that encloses it has the class *content*. We could select the `<div>` with class *content* then select the `<ol>` from within. Remember that even though this step isn't really necessary for this particular page, we want to make a robust scraper that will work even if the web site owner adds more lists to the same page.

As a quick illustration of this, let's do exactly what was done above but with a different route to find the jokes.

In [0]:
# this gets a list of all divs on the page
# soup.find_all('div')

# get just the divs with class *content*
div = soup.find('div', class_='content')
list = div.find('ol')
items = list.find_all('li')
jokes = [joke.get_text() for joke in items]
print(jokes)

['Today at the bank, an old lady asked me to help check her balance. So I pushed her over.', "I bought some shoes from a drug dealer. I don't know what he laced them with, but I've been tripping all day.", 'I told my girlfriend she drew her eyebrows too high. She seemed surprised.', 'My dog used to chase people on a bike a lot. It got so bad, finally I had to take his bike away.', "I'm so good at sleeping. I can do it with my eyes closed.", 'My boss told me to have a good day.. so I went home.', 'Why is Peter Pan always flying? He neverlands.', 'A woman walks into a library and asked if they had any books about paranoia. The librarian says "They\'re right behind you!"', "The other day, my wife asked me to pass her lipstick but I accidentally passed her a glue stick. She still isn't talking to me.", 'Why do blind people hate skydiving? It scares the hell out of their dogs.', 'When you look really closely, all mirrors look like eyeballs.', 'My friend says to me: "What rhymes with orange"

## Programming Questions

Suppose that you were tasked with creating a presentation of current trends in programming questions being asked on Stack Overflow. What technologies are developers asking the most questions about, which technologies get the most attention, and so on. You can get some of that information from the StackOverflow website.

This website consists of questions and answers by software developers and people in the software development community. On the page there are a number of questions listed, along with the ads and other information. But this site uses pagination to display a long list of questions one page at a time. To see the second page you can click the Next button at the bottom of the list. To scrape useful information from this website we will need to follow that link.

Let’s use the Chrome Developer Tools to examine the page a bit.

![Questions](https://live.staticflickr.com/65535/49505599618_2c40361f2f_c_d.jpg)

Drilling into the page structure, you will find that there is a `<div>` element with id `questions` that contains all the questions displayed on the page. Each question listing has an HTML structure similar to:

```html

<div class="question-summary" id="question-summary-60128236">
    <div class="statscontainer">
        <div class="stats">
            <div class="vote">
                <div class="votes">
                    <span class="vote-count-post "><strong>0</strong></span>
                    <div class="viewcount">votes</div>
                </div>
            </div>
            <div class="status unanswered">
                <strong>0</strong>answers
            </div>
        </div>
        <div class="views " title="2 views">
    2 views
</div>
    </div>
    <div class="summary">
        <h3><a href="/questions/60128236/semantic-segmentation-accuracy-starting-close-to-zero" class="question-hyperlink">Semantic Segmentation accuracy starting close to zero</a></h3>
        <div class="excerpt">
            I'm trying to build a machine learning semantic classification algorithm for the CamVID dataset using a U-net approach in Keras. 

I've used the following tutorial and tried to apply it to the ...
        </div>
        <div class="tags t-machine-learning t-keras">
            <a href="/questions/tagged/machine-learning" class="post-tag" title="show questions tagged 'machine-learning'" rel="tag">machine-learning</a> <a href="/questions/tagged/keras" class="post-tag" title="show questions tagged 'keras'" rel="tag">keras</a> 
        </div>
        <div class="started fr">
            <div class="user-info ">
    <div class="user-action-time">
        asked <span title="2020-02-08 15:27:35Z" class="relativetime">13 mins ago</span>
    </div>
    <div class="user-gravatar32">
        <a href="/users/3429605/rutger-hofste"><div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/4cab1e2f110ecff055651ca7a94940c1?s=32&amp;d=identicon&amp;r=PG&amp;f=1" alt="" width="32" height="32" class="bar-sm"></div></a>
    </div>
    <div class="user-details">
        <a href="/users/3429605/rutger-hofste">Rutger Hofste</a>
        <div class="-flair">
            <span class="reputation-score" title="reputation score " dir="ltr">2,071</span><span title="1 gold badge" aria-hidden="true"><span class="badge1"></span><span class="badgecount">1</span></span><span class="v-visible-sr">1 gold badge</span><span title="15 silver badges" aria-hidden="true"><span class="badge2"></span><span class="badgecount">15</span></span><span class="v-visible-sr">15 silver badges</span><span title="22 bronze badges" aria-hidden="true"><span class="badge3"></span><span class="badgecount">22</span></span><span class="v-visible-sr">22 bronze badges</span>
        </div>
    </div>
</div>
        </div>
    </div>
</div>
```

So there is still a lot of extraneous information in there. We may be interested in the question title, the excerpt, and the list of tags. Let's use BeautifulSoup to try to select the parts that we need.

In [9]:
url = 'https://stackoverflow.com/questions'
response = requests.get(url)

# make sure we got a valid response
if(response.ok):
  # get the full data from the response
  data = response.text
  soup = BeautifulSoup(data, 'html.parser')
  
  # find all elements with class *question-summary*
  summary =soup.find_all(class_='question-summary')
  #print(summary)

[<div class="question-summary" id="question-summary-62239235">
<div class="statscontainer">
<div class="stats">
<div class="vote">
<div class="votes">
<span class="vote-count-post"><strong>0</strong></span>
<div class="viewcount">votes</div>
</div>
</div>
<div class="status unanswered">
<strong>0</strong>answers
            </div>
</div>
<div class="views" title="2 views">
    2 views
</div>
</div>
<div class="summary">
<h3><a class="question-hyperlink" href="/questions/62239235/binary-file-conversion-in-distributed-manner-spark-flume-or-any-other-option">Binary file conversion in distributed manner - Spark Flume ? or any other option</a></h3>
<div class="excerpt">
            We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format say XML or JSON and write to ...
        </div>
<div class="tags t-apache-spark t-distributed t-flume t-asnû1">
<a class="post-tag" href="/quest

Not bad. That did give us a list of the div's that contain the question information. However, there is still a lot of extraneous HTML. We may not be interested in the number of votes for instance. We could narrow it down further by just getting the summaries. Notice that within each `question-summary` there is another div with the class `summary`. That inner div contains all the information that we are interested in collecting.

In [0]:
summaries = soup.find_all(class_='summary')
print(summaries)

Much better, but the actual title text that we want is in an `<a>` tag inside an `<h3>` tag. We can continue navigating down the document or we can use another technique: [**CSS Selectors**](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors). CSS is the language used to apply styles to web pages. CSS selectors are used to select specific elements on the page so that styles may be applied to just those elements. But the syntax is so powerful that we can use the same selector syntax in BeautifulSoup. 

To select all elements with a particular class, use the `.classname` selector.

In [10]:
summaries = soup.select('.summary')
print(summaries)

[<div class="summary">
<h3><a class="question-hyperlink" href="/questions/62239235/binary-file-conversion-in-distributed-manner-spark-flume-or-any-other-option">Binary file conversion in distributed manner - Spark Flume ? or any other option</a></h3>
<div class="excerpt">
            We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format say XML or JSON and write to ...
        </div>
<div class="tags t-apache-spark t-distributed t-flume t-asnû1">
<a class="post-tag" href="/questions/tagged/apache-spark" rel="tag" title="show questions tagged 'apache-spark'">apache-spark</a> <a class="post-tag" href="/questions/tagged/distributed" rel="tag" title="show questions tagged 'distributed'">distributed</a> <a class="post-tag" href="/questions/tagged/flume" rel="tag" title="show questions tagged 'flume'">flume</a> <a class="post-tag" href="/questions/tagged/asn.1" rel="tag" title="sh

This gives the same output from above. So let's extend this a bit. To select only the `<h3>` elements in the div's with class *summary*, we can use the **child** selector. The `>` symbol is used to specify a child element. A child element is an element contained within another element.

In [11]:
title_headings = soup.select('.summary > h3')
print(title_headings)

[<h3><a class="question-hyperlink" href="/questions/62239235/binary-file-conversion-in-distributed-manner-spark-flume-or-any-other-option">Binary file conversion in distributed manner - Spark Flume ? or any other option</a></h3>, <h3><a class="question-hyperlink" href="/questions/62239233/qualtrics-count-the-number-of-times-a-survey-already-completed-message-appears">Qualtrics count the number of times a Survey already completed message appears</a></h3>, <h3><a class="question-hyperlink" href="/questions/62239232/alter-how-arguments-are-processed-before-theyre-passed-to-sub-main">Alter how arguments are processed before they're passed to sub MAIN</a></h3>, <h3><a class="question-hyperlink" href="/questions/62239230/call-a-static-method-of-a-static-class-using-reflection">Call a static method of a static class using reflection</a></h3>, <h3><a class="question-hyperlink" href="/questions/62239228/writing-unit-test-for-my-class-based-views-which-also-require-loginrequiredmixin">Writing Un

Some progress. How about selecting only the `<a>` tags found inside the `<h3>` tags?

In [12]:
title_links = soup.select('.summary > h3 > a')
print(title_links)

[<a class="question-hyperlink" href="/questions/62239235/binary-file-conversion-in-distributed-manner-spark-flume-or-any-other-option">Binary file conversion in distributed manner - Spark Flume ? or any other option</a>, <a class="question-hyperlink" href="/questions/62239233/qualtrics-count-the-number-of-times-a-survey-already-completed-message-appears">Qualtrics count the number of times a Survey already completed message appears</a>, <a class="question-hyperlink" href="/questions/62239232/alter-how-arguments-are-processed-before-theyre-passed-to-sub-main">Alter how arguments are processed before they're passed to sub MAIN</a>, <a class="question-hyperlink" href="/questions/62239230/call-a-static-method-of-a-static-class-using-reflection">Call a static method of a static class using reflection</a>, <a class="question-hyperlink" href="/questions/62239228/writing-unit-test-for-my-class-based-views-which-also-require-loginrequiredmixin">Writing Unit Test for my Class based views which a

We can use `get_text()` as we did above to extract the text content of the links.

In [13]:
questions = [question.get_text() for question in title_links]
print(questions)

['Binary file conversion in distributed manner - Spark Flume ? or any other option', 'Qualtrics count the number of times a Survey already completed message appears', "Alter how arguments are processed before they're passed to sub MAIN", 'Call a static method of a static class using reflection', 'Writing Unit Test for my Class based views which also require LoginRequiredMixin', 'Named semaphores instead of mutex - readers writers problem without multithreading', 'Load GeoJSON file into redshift using copy command', 'Java - Allow java method arguments to accept constructor paramaters', 'Sending mkv file over udpsink gstreamer', 'Redirect Flask [python]', 'Scrapy - how to save the file generated via POST submission', 'How to calculate subtotals, line by line of a Data Frame in Python?', 'Get AWS redshift global session id within Airflow DAG run', 'Applying K nearest neighbors algorithm python 3.8 causing issue with method train', 'how do i fix problem for adding an Image on xml file for 

Now, we have a list of the question titles found on the page.

How would we get a list of the excerpts for each question on this page? Take a few minutes and try to construct the CSS selector expression that will return this information.

It may be obvious from the HTML that the excerpt is inside the summary in a `<div>` with class `excerpt`. If you select the `<div>`'s content you will get the excerpt.

In [14]:
excerpt_divs = soup.select('.summary > .excerpt')
excerpts = [excerpt.get_text() for excerpt in excerpt_divs]
print(excerpts)

['\r\n            We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format say XML or JSON and write to ...\r\n        ', '\r\n            In Qualtrics, a survey link is usually set to be used once.  If the recipient clicks on the same link they will get a message saying the link has expired.  In the survey termination options, I can ...\r\n        ', "\r\n            Given the documentation and the comments on an earlier question, by request I've made a minimal reproducible example that demonstrates a difference between these two statements:\n\nmy %*SUB-MAIN-OPTS = :...\r\n        ", '\r\n            i trying to call a static method like this :\nCall MyMethod :\n\nAssembly myAssembly = Assembly.LoadFrom(filePath);\nType Mytype = myAssembly.GetType("MyClass");\nstring returnedValue = Mytype.GetMethod("...\r\n        ', '\r\n            I have one ListView and one DetailView and b

Getting the tags of each question is similar. The tags are in a `div` with class `tags`. Each tag is in an `a` tag.


Let's try to put together some code that will get all the information about the question that we need and put it into a dictionary object.

In [15]:
# iterate over all questions
raw_questions = soup.select('.summary')
questions = []
for question in raw_questions:
  title = question.select_one('h3 > a').get_text() # extract the title
  excerpt = question.select_one('.excerpt').get_text().strip() # extract the excerpt
  tags = [tag.get_text() for tag in question.select('.tags a')] # extract a list of tags
  new_question = {'title': title, 'excerpt': excerpt, 'tags': tags} # construct a dictionary
  questions.append(new_question) # add dictionary to list
print(questions)  

[{'title': 'Binary file conversion in distributed manner - Spark Flume ? or any other option', 'excerpt': 'We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format say XML or JSON and write to ...', 'tags': ['apache-spark', 'distributed', 'flume', 'asn.1']}, {'title': 'Qualtrics count the number of times a Survey already completed message appears', 'excerpt': 'In Qualtrics, a survey link is usually set to be used once.  If the recipient clicks on the same link they will get a message saying the link has expired.  In the survey termination options, I can ...', 'tags': ['javascript', 'html', 'qualtrics']}, {'title': "Alter how arguments are processed before they're passed to sub MAIN", 'excerpt': "Given the documentation and the comments on an earlier question, by request I've made a minimal reproducible example that demonstrates a difference between these two statements:\n\nmy %*S

## Following a link
Now that we can scrape the list of questions from the first page, what about the other pages? Let's look at the HTML code for the next button.

```html
    <a 
       href="/questions?tab=newest&amp;page=2" 
       rel="next" 
       title="go to page 2">
          <span class="page-numbers next"> next</span> 
    </a>
```


The next button is made up of a clickable link styled to look like a button. The `href` attribute of the link contains the URL of the next page. The URL of this button is `/questions?tab=newest&amp;page=2`. First, that is only part of a URL. A URL is made up of the domain name and path and optional query string. Since we are on the `https://stackoverflow.com/questions` website the browser knows that if we click this link we want to remain on the same website and just visit a different path. So the actual request will be sent to `https://stackoverflow.com/questions?tab=newest&amp;page=2`. The bit after the `?` is called the query string and it is one way to send data to the server describing what you are requesting. It appears that this link will accept a parameter named `page` representing the page number. If you click on the next button several times you will notice that the bit at the end of the URL that says `page=2` changes to `page=3` then `page=4` and so on. 

That means that we can modify the URL by incrementing the page number and request more questions. We can continue to do so until there are no more pages. 

To set the query string on the request URL we can use the `params` parameter on the get method. For example,

```python
page_number = 1
query = {'tab':'newest', 'page': page_number}
url = 'https://stackoverflow.com/questions'
response = requests.get(url, params=query)

```

We can then update the `page_number` variable and repeat this call.

This raises a few questions:

 - how do we know when there are no more pages?
 - if we rapidly sent many requests for many pages to the server would we overwhelm the server and possibly get ourselves blocked?
 
There may be several answers to the first question. If you go to the website and visit the last page of results you will see that the next button is not displayed on that page. So the absence of the next button  could be the condition on which we stop the processing. There may be others that you can readily spot but we will go with that option for now. The next button does not contain a class or id that we can easily select it by. However, it does have an attribute `rel` with value `next`. It is possible to select an element by the value of an attribute like this:

```python
print(soup.select('a[rel="next"]'))
```

The second one does not rely on the HTML structure of the page, but rather on the way we code up the solution. The above code would process a  page of questions and create a list of dictionaries from that page. To repeat that we need some sort of loop that will iterate over all pages and for each page perform something similar to the code that we already wrote. When we get to the last page, we stop the loop, and at that point we will have a list of all questions on the website. The problem is, the computer would process the page rapidly enough that the request for the next page would happen faster than if a human were browsing the website. This of course would act like a denial of service attack against the website and since most web servers can identify such attacks and take steps to protect themselves, we want to avoid such behaviour. 

One solution is to deliberately slow down the rate of requests that we make. We can, for instance, wait for a few seconds before making another request. How long, exactly, will depend. If we wait 1 second between requests, and we need to request 500 pages, then that is at least 500 seconds for our program to run. That is not so bad since hopefully scraping the data is a one time affair. There are some libraries that help to moderate your request rate if you are buiding a more sophisticated scraper. For our simple scraper we can use Python's `sleep()` method to pause execution for a short time.

But let's add one more artificial restriction. There are over 1.2 million pages of questions on the website (at the time of writing, so there will be many more by the time you read this). To avoid making over a million requests and having our program take days to run, we can deliberately only request the first ten pages. Once we have a working program we can always modify the number of pages that we request.

### Some other considerations
Before we jump into the code there are a few other considerations that we should make since we are requesting multiple pages. We should clearly identify ourselves to the web server. That is, when your browser makes a request to a web server, it sends along a header named **User-Agent** with a value that identifies the browser itself. For instance, in Chrome the value looks like this:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
```

And in Firefox the value looks like this:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0
```

This way, the web server can track how many visitors use a particular browser. We can set a value that identifies us and at least provide a contact. If the website owners wish, they may contact us to get some clarity on why we are scraping data from their website and maybe negotiate a better experience for themselves as well as us. We could provide a value like:

```
'questionscraper - school project (yourname@gmail.com)'
```

For example, when making the request, we could set the header like this:

```
url = '...'
headers = {'user-agent': 'questionscraper - school project (yourname@gmail.com)'}
response = requests.get(url, headers=headers)
```

Another consideration is monitoring the program as it runs. We are potentially creating a program that may take 10, 15 or 20 minutes or even more to run. It would be important to "see" what is happening as the program runs. This helps us to follow the progress, ensure the  program isn't hung up, and to detect problems early. We could simply print the number of requests that have been made so far, the frequency of the requests, the number of items found so far, and any errors encountered.

Putting all of this together can be a bit daunting, but careful step by step thought will make it possible. A complete working program is given below. Before you run it, ensure you set your email address in the header. Also, a variable named `MAX_REQUESTS` was created to set an upper limit on the number of requests made. You may adjust this value as you wish.

In [0]:
# import all depenencies at the top
from time import time
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import requests


# define a function to process the page
def process_page(soup, questions):  
  
  # find all elements with class *summary*
  raw_questions = soup.select('.summary')

  # same as above, extract the info we need
  for question in raw_questions:
    title = question.select_one('h3 > a').get_text() # extract the title
    excerpt = question.select_one('.excerpt').get_text().strip() # extract the excerpt
    tags = [tag.get_text() for tag in question.select('.tags a')] # extract a list of tags
    new_question = {'title': title, 'excerpt': excerpt, 'tags': tags} # construct a dictionary
    questions.append(new_question) # add dictionary to list

    
# prepare for the monitoring logic
start_time = time() # note the system time when the program starts
request_count = 0 # track the number of requests made

# create variables to store the data
questions = []

# variables to handle the request loop
has_next_page = True
MAX_REQUESTS = 10 # do not request more than 10 pages
page_number = 1
query = {'tab':'newest', 'page': page_number}
url = 'https://stackoverflow.com/questions'
headers = {'user-agent': 'questionscraper - school project (myeamail@gmail.com)'}

while has_next_page and request_count < MAX_REQUESTS:
  # keep the output clear
  clear_output(wait = True)
  
  # make an initial request
  response = requests.get(url, params=query, headers=headers)

  # make sure we got a valid response
  if(response.ok):
    # get the full data from the response
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    process_page(soup, questions)

    # check for the next page
    # look for the presence of element with class *test-pagination-next*
    next_button = soup.select('a[rel="next"]')
    has_next_page = len(next_button) > 0
    
  else:
    # display a warning if there are any problems
    warn('Request #: {}, Failed with status code: {}'.format(request_count, response.status_code))
  
  request_count += 1
  
  # go to sleep for a bit
  # we use a random number between 1 and 5 so
  # We can wait as long as 5 seconds to make a second request
  
  sleep(randint(1,3))
  
  # output some logs for monitoring
  elapsed_time = time() - start_time
  print('Requests: {}, Frequency: {} requests/s, {} questions processed.'.format(request_count, request_count/elapsed_time, len(questions)))
  
  # prepare for next iteration
  page_number += 1
      
print('Sraping complete')
print('Requests: {}, Frequency: {} requests/s, {} questions processed.'.format(request_count, request_count/elapsed_time, len(questions)))

Requests: 11, Frequency: 0.5421090145287822 requests/s, 165 questions processed.
Sraping complete
Requests: 11, Frequency: 0.5421090145287822 requests/s, 165 questions processed.


In [0]:
# print the first five questions
questions[0:5]

[{'excerpt': "I'm using SpringXD where I have a Rabbit Source with outputType application/json. Next module receive it and convert it to Java Object. RabbitMq is my transport bus.\n\nMy configuration is pretty ...",
  'tags': ['java',
   'spring',
   'spring-integration',
   'spring-cloud-dataflow',
   'spring-xd'],
  'title': 'Not able to convert Message from Json String to Object. Cast exception'},
 {'excerpt': "I'm trying to create a program where I insert a value at a requested node in a doubly linked list. My code works, but it inserts the value one position after the requested node. How do I change my ...",
  'tags': ['c++', 'doubly-linked-list'],
  'title': 'Inserting before a specified position in doubly linked list'},
 {'excerpt': 'I am trying to learn about Firebase, so "I made" a small app where I save data from a form and then retrieve them. But I am getting an object when retrieving the snapshot value:\n\nThis is what my code ...',
  'tags': ['node.js',
   'firebase',
   '

## Save to a file
Getting the data from the server is only the first part of the problem. The next step is to persist the data in some format that makes it available for analysis later. A CSV file is a fairly common format and well-supported in many environments. The `csv` Python library provides the tools to read and write csv files in Python. 

### Note about Colab
Since this is a hosted service writing a file to the filesystem is not quite as straightforward as if this code was running on your local machine. We can get around this by writing to your Google Drive. The file will then be accessible from your Google Drive. In order to use Google Drive, you will need to be authenticated. Luckily, Google provides a library named *PyDrive* that makes all this simple. We do not have to go into the details of what this does. First ensure that PyDrive is installed in the environment. This command only needs to be executed once in a notebook.


In [0]:
# install this library
!pip install -U -q PyDrive

[K    100% |████████████████████████████████| 993kB 19.7MB/s 
[?25h  Building wheel for PyDrive (setup.py) ... [?25ldone
[?25h

Then, to authenticate with Google, do the following. Note that you will be prompted to login with your Google credentials and then you will be given a code to enter. When you enter that code, this notebook will have permission to write files to your Google Drive.

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Next, we can create a csv string of our questions and write it to Google Drive with a file  name of "questions.csv"

In [0]:
import csv
import io

# Create an output stream
output = io.StringIO()

# these are the names of the properties in the dictionary
fieldnames = ['title', 'excerpt', 'tags']

# create a writer object, it can write dictionaries to the output stream
writer = csv.DictWriter(output, fieldnames=fieldnames)

# write all the headings 
writer.writeheader()

# iterate the questions and write each one
for question in questions:
  writer.writerow(question)


# Create & upload a text file.
uploaded = drive.CreateFile({'title': 'questions.csv'})
uploaded.SetContentString(output.getvalue())
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))  
      

Uploaded file with ID 1WZ5DRTIPEHhp3ZVvPa5b8oDHlyOiJ6CA


Visit your Google Drive and find the file named "questions.csv" and verify that it contains a list of questions that was scraped from the StackOverflow questions site.