Blogs and other regularly updating websites usually have a front page with the most recent post as well as a Previous button on the page that takes you to the previous post. Then that post will also have a Previous button, and so on, creating a trail from the most recent page to the first post on the site. If you wanted a copy of the site’s content to read when you’re not online, you could manually navigate over every page and save each one. But this is pretty boring work, so let’s write a program to do it instead.

XKCD is a popular geek webcomic with a website that fits this structure (see Figure 11-6). The front page at http://xkcd.com/ has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.

Here’s what your program does:

* Loads the XKCD home page.
* Saves the comic image on that page.
* Follows the Previous Comic link.
* Repeats until it reaches the first comic.
<table>
    <tr>
        <td><img src='https://automatetheboringstuff.com/images/000016.jpg' align='left'/></td>
    </tr>
</table>

Figure 11-6. XKCD, “a webcomic of romance, sarcasm, math, and language”  

This means your code will need to do the following:
* Download pages with the requests module.
* Find the URL of the comic image for a page using Beautiful Soup.
* Download and save the comic image to the hard drive with iter_content().
* Find the URL of the Previous Comic link, and repeat.
* Open a new file editor window and save it as downloadXkcd.py.

In [2]:
import requests, bs4, os

In [6]:
url = 'http://xkcd.com'
os.makedirs('CH11_Project_xkcd', exist_ok=True)
count = 0
while not url.endswith('#'):
	print('Searching {} ...'.format(url))
	comicElem = requests.get(url)
	comicElem.raise_for_status()
	img_soup = bs4.BeautifulSoup(comicElem.text, 'html.parser')
	find = img_soup.select('div#comic img')
	if find == []:
		print('Could not find img in {}.'.format(url))
	else:
		try:
			comicUrl = 'http:' + find[0].get('src')
			# Download image
			print("Downloading image {}...".format(comicUrl))
			img = requests.get(comicUrl)
			img.raise_for_status()
			# Save the img tp './xkcd' if no 'get url error' occur
			imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
			for chunk in img.iter_content(100000):
				imageFile.write(chunk)
			imageFile.close()
			count += 1

		# 發現無法get img時候執行
		except requests.exceptions.MissingSchema:
			print('Url {} is missing.'.format(comicUrl))

		except requests.exceptions.InvalidURL:
			print('Url {} is Invalid.'.format(comicUrl))

		except requests.exceptions.HTTPError:
			print('503 Server Error: Backend unavailable with Url {}.'.format(comicUrl))

	# Get the prev button's url.
	prevLink = img_soup.select('a[rel="prev"]')[0]
	url = 'http://xkcd.com' + prevLink.get('href')
print('Total download: {} comics.'.format(count))
print('Done')
