In [2]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


# Connecting Reliably and Handling Exceptions
### Two main things can go wrong in this line:
• The page is not found on the server (or there was an error in retrieving it).
• The server is not found.
In the first situation, an HTTP error will be returned. This HTTP error may be “404
Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the
urlopen function will throw the generic exception HTTPError. You can handle this
exception in the following way:

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
try:
  html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
  print(e)
 # return null, break, or do some other "Plan B"
else:
 # program continues. Note: If you return or break in the
 # exception catch, you do not need to use the "else" statement

>If an HTTP error code is returned, the program now prints the error, and does not
execute the rest of the program under the else statement.

>If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the
URL is mistyped), urlopen will throw an URLError. This indicates that no server
could be reached at all, and, because the remote server is responsible for returning
HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError
must be caught. You can add a check to see whether this is the case:

In [5]:
from urllib.error import HTTPError
from urllib.error import URLError
try:
  html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
  print(e)
except URLError as e:
  print('The server could not be found!')
else:
  print('It Worked!')

The server could not be found!


>Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what you expected. Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually
exists. If you attempt to access a tag that does not exist BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

>The following line (where nonExistentTag is a made-up tag, not the name of a real BeautifulSoup function)

In [7]:
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


In [8]:
print(bs.nonExistentTag)

None


  name=tag_name


returns a None object. This object is perfectly reasonable to handle and check for. The trouble comes if you don’t check for it, but instead go on and try to call another function on the None object, as illustrated in the following:

In [9]:
print(bs.nonExistentTag.someTag)

  name=tag_name


AttributeError: ignored

The easiest way is to explicitly check for both situations:

In [12]:
try:
  badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
  print('Tag was not found')
else:
  if badContent == None:
    print('Tag was not found')
  else:
      print(badContent)

Tag was not found


  name=tag_name


This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code, for example, is our same scraper
written in a slightly different way:

In [15]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
  try:
    html = urlopen(url)
  except HTTPError as e:
    return None
  try:
    bs = BeautifulSoup(html.read(), 'html.parser')
    title = bs.body.h1
  except AttributeError as e:
    return None
  return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
  print('Title could not be found')
else:
  print(title)

<h1>An Interesting Title</h1>


In this example, you’re creating a function getTitle, which returns either the title of the page, or a None object if there was a problem retrieving it. Inside getTitle, you check for an HTTPError, as in the previous example, and encapsulate two of the BeautifulSoup lines inside one try statement. An AttributeError might be thrown from either of these lines (if the server did not exist, html would be a None object, and html.read() would throw an AttributeError). You could, in fact, encompass as many lines as you want inside one try statement, or call another function entirely,
which can throw an AttributeError at any point.