# Chapter 1 Your First Web Scraper

## Connecting

https://docs.python.org/3.7/howto/urllib2.html

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## An introduction to BeautifulSoup

It helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

### Installing BeautifulSoup

In [2]:
from bs4 import BeautifulSoup

### Running BeautifulSoup

In [12]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


By convention, only one h1 tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you’re looking for.

In addition to the text string, BeautifulSoup can also use the file object directly returned by urlopen, without needing to call `.read()` first:

In [18]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')

In [21]:
print(bs.h1)
print(bs.body.h1)
print(bs.html.body.h1)
print(bs.html.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


When you create a BeautifulSoup object, two arguments are passed in
- The first is the HTML text the object is based on
- the second specifies the parser that you want BeautifulSoup to use in order to create that object

`html.parser` is a parser that is included with Python 3 and requires no extra installations in order to use.

Another popular parser is `lxml`.

In [22]:
!pip install lxml

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


lxml can be used with BeautifulSoup by changing the parser string provided:

In [23]:
bs = BeautifulSoup(html.read(), 'lxml')

lxml has some advantages over html.parser in that it is generally better at parsing “messy” or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also somewhat faster than html.parser, although speed is not necessarily an advantage in web scraping, given that **the speed of the network itself will almost always be your largest bottleneck**.

One of the disadvantages of lxml is that it has to be installed separately and depends on third-party C libraries to function. This can cause problems for portability and ease of use, compared to html.parser.

Another popular HTML parser is html5lib. Like lxml, html5lib is an extremely forgiving parser that takes even more initiative correcting broken HTML. It also depends on an external dependency, and is slower than both lxml and html.parser. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

In [25]:
!pip install html5lib
bs = BeautifulSoup(html.read(), 'html5lib')

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


### Connecting Reliably and Handling Exceptions

One of the most frustrating experiences in web scraping is to go to sleep with a scraper running, dreaming of all the data you’ll have in your database the next day—only to find that the scraper hit an error on some unexpected data format and stopped execution shortly after you stopped looking at the screen.

```python
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
```
Two main things can go wrong in this line:
- The page is not found on the server (or there was an error in retrieving it).
- The server is not found.

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception `HTTPError`.


In [31]:
from urllib.request import urlopen 
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/404notfound.html') 
except HTTPError as e:
    print(e) 
    # return null, break, or do some other "Plan B" 
else:
    print('NO ERROR')
    # program continues. Note: If you return or break in the 
    # exception catch, you do not need to use the "else" statement


HTTP Error 404: Not Found


If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the URL is mistyped), urlopen will throw an `URLError`.

In [29]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

The server could not be found!


Every time you access a tag in a `BeautifulSoup` object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, `BeautifulSoup` will return a `None` object. The problem is, attempting to access a tag on a `None` object itself will result in an `AttributeError` being thrown.

In [32]:
try:
    badContent = bs.nonExistingTag.anotherTag 
except AttributeError as e:
    print('Tag was not found') 
else:
    if badContent == None:
        print ('Tag was not found') 
    else:
        print(badContent)

Tag was not found


  name=tag_name


In [33]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>


When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions and make it readable at the same time. You’ll also likely want to heavily reuse code. Having generic functions such as getSiteHTML and getTitle (complete with thorough exception handling) makes it easy to quicklyand reliably—scrape the web.