# 1. Your First Web Scraper

Browser does a lot for you. (HTML formatting, CSS, Javascript, etc) 

Let's start with GET request. 

## Connecting

GET request: 
- Client sends a stream of 1 and 0. (information that contains a header, body. The header contains an immediate destination of local router's MAC with a final destination to server's IP address. The body contains the request.)
- Client's local router receives it, interprets them as packet. The router stamps its own IP on the packet as the 'from' IP and sends it off. 
- Client's packet traverses several intermediary servers to the target server. 
- The server receives the packet at its IP address. 
- The server reads the packet port destination in the header and passes it off to the appropriate web server application.
- The server application receives a stream of data from server processor. It says:
    - This is a GET request. 
    - The following file is requested: index.html
- The web server locates the correct file and bundles it up into a new packet. Then sends it to the client. 
<br>
<br>

IP address is like the street address, packet port is like an apartment number for packet data. 

** packet port destination is almost always port 80 for web applications. 

<br>

The above process does not require a browser. Web browser is a recent invention. 

In [8]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html') # not only html but any other file stream. 
html.read()

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

In [11]:
type(html)

http.client.HTTPResponse

In [10]:
type(html.read())

bytes

Accurately, this outputs the HTML "file" page1.html, found in the directory < web root >/pages, on the server located at the domain name http://pythonscraping.com

Not a page, a file. (It is important to think this way.)

urllib is a standard Python library. (batteries included)


## An Introduction to BeautifulSoup 

Format and organize messy web by 
- fixing bad HTMl 
- and presenting us with easily traversable Python objects representing XML structures. 

### Installing BeautifulSoup

pip install beautifulsoup4 (or pipenv install beautifulsoup4. It's NOT beautifulsoup.)

** It is always good to have a virtualenv. 

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser') # HTML content transformed into a BeautifulSoup object. 
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [7]:
bs.h1

<h1>An Interesting Title</h1>

bs.h1 returns only the first instance of the h1 tag.

By convention, only one h1 tag should be used on a single page. (but this convention is often broken in the wild) 

The below is also possible.

In [13]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [14]:
bs.body.h1

<h1>An Interesting Title</h1>

Another parser option is lxml. 

This can be installed by 

pip install lxml

In [19]:
# bs = BeautifulSoup(html.read(), 'lxml')
# bs

## error. 
# https://stackoverflow.com/questions/24398302/bs4-featurenotfound-couldnt-find-a-tree-builder-with-the-features-you-requeste
# python -m pip install lxml seems to fix the problem but I won't bother.

lxml has some advantage over html.parser:
- generally better at parsing 'messy' or malformed HTML. 
- like unclosed tags or improperly nested. 
- somewhat faster than html.parser

some disadvantages are:
- it depends on third-party C libraries to function. 
- bad portability / ease of use 

Another option is 'html5lib'. (also good for messier HTML, but slower than the previous two.)

### Connecting Reliably and Handling Exceptions 

Web is messy. (data poorly formatted, website down, closing tags missing, etc)

You have to take care of exceptions if you don't want to face error in the next morning. 

In [20]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

# html = urlopen('http://www.pythonscraping.com/pages/page1.html') # This can't handle 404 or 500 HTTP error. 

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error") # Any HTTP error. 
except URLError as e:
    print("The server could not be found!") # Site is down or url is mistyped. The server can't even send out HTTP error. (more serious)
else:
    print(html.read())

The server could not be found!


In [12]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e: # 'Is it a Nonetype object? Any possibility of error?' 
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>


When writing scrapers, it's important to think about the overall pattern of your code. 
- handle exceptions and make it readable at the same time. 
- reuse code 
    - eg: getSiteHTML(), getTitle(), ....