<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch1_preface_and_first_web_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To request the complete HTML code for *page1* located at the URL *http://pythonscraping.com/pages/page1.html.* just like what a web browser does, try the following three lines of codes:

In [0]:
from urllib.request import urlopen

In [0]:
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This outputs the HTML file *page1.html*, found in the directory *\<web root\>/pages*, on the server located at the domain name *http://pythonscraping.com*.

*urllib* is a standard Python library and contains functions for requesting data across the web, handling cookies, and changing metadata such as headers and user agent. \\
`ulropen` is used to open a remote object across a network and read it.

#An Introduction to BeautifulSoup
The *BeautifulSoup* library helps format and orgainze the messy web by fixing bad HTML and presenting with easily traversable Python objects representing XML structures.


##Installing BeautifulSoup

In [0]:
!pip3 install beautifulsoup4



In [0]:
# test installation of BeautifulSoup4
from bs4 import BeautifulSoup

##Running BeautifulSoup


In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [0]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')

The `BeautifulSoup` object requires two arguments: The first is the HTML text the object is based on, and the second specifies the parser
that you want BeautifulSoup to use in order to create that object.

In [0]:
print(bs.h1)

<h1>An Interesting Title</h1>


This returns only the first instance of the `h1` tag found on the page, not necessarily the whole paragraph.

BeautifulSoup needs to call `html.read()` method first in order to get the HTML content of the page, unlike what the book said that without calling `.read()` method. Otherwise, an error shows up.

In [0]:
bs2 = BeautifulSoup(html, 'html.parser')

This HTML content is transformed into a `BeautifulSoup` object, with the following structure:

In [0]:
print(bs2.h1) # nothing shows up

None


The HTML content is transformed into a `BeautifulSoup` object, with the following structure:

In [0]:
print(bs)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



All of the following functions calls would produce the same output, even though the `h1` tag is nested two layers deep into the `BeautifulSoup` object structure (`html` $\rightarrow$ 'body' $\rightarrow$ 'h1'):

In [0]:
print(bs.h1)

<h1>An Interesting Title</h1>


In [0]:
print(bs.html.body.h1)

<h1>An Interesting Title</h1>


In [0]:
print(bs.body.h1)

<h1>An Interesting Title</h1>


In [0]:
print(bs.html.h1)

<h1>An Interesting Title</h1>


`html.parser` is a parser that is included with Python 3. Another popular parser is `lxml`. 

In [0]:
!pip3 install lxml



`lxml` can be used with BeautifulSoup by changing the parser string provided:

In [0]:
bs = BeautifulSoup(html.read(), 'lxml')

lxml has some advantages over html.parser in that it is generally better at parsing
“messy” or malformed HTML code. It is forgiving and fixes problems like unclosed
tags, tags that are improperly nested, and missing head or body tags.

Another popular HTML parser is `html5lib`, which is an extremely forgiving parser that takes even more initiative correcting broken HTML.

In [0]:
bs = BeautifulSoup(html.read(), 'html5lib')

##Connecting Reliably and Handling Exceptions
There could be two things going wrong when scraping HTML data:
1. The page is not found on the serveer (or there was an error in retrieving it)
2. The server is not found

The first situation can be handled in the following exception:

In [0]:
from urllib.request import urlopen
from urllib.error import HTTPError

In [0]:
try:
  html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
  print(e)
  # return null, break, or do some other "Plan B"
else:
  # program continues. Note: If you return or break in the exception catch, you do not need to use the "else" statement

If an HTTP error code is returned, the program now prints the error, and does not
execute the rest of the program under the `else` statement.

In the second situation (where the HTML is down or the URL is mistyped), `urlopen` will throw an `URLError`:

In [0]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

In [0]:
try:
  html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
  print(e)
except URLError as e:
  print('The server could not be found!')
else:
  print('It Worked!')

When a nonexisting tag is accessed, BeautifulSoup will return a `None` object. Attempting to access a tag on a `None` object itself will result in a `AttributeError` being thrown:

In [0]:
print(bs.fakeTag)

None


  name=tag_name


The issue comes if another function is called on the `None` object:

In [0]:
print(bs.fakeTag.someTag)

  name=tag_name


AttributeError: ignored

To handle these situations both, try to check the following:

In [0]:
try:
  badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
  print('Tag was not found')
else:
  if badContent == None:
    print('Tag was not found')
  else:
    print(badContent)

To reorganize for easy reading and better integration:

In [0]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

In [0]:
def getTitle(url):
  try:
    html = urlopen(url)
  except HTTPError as e:
    return None
  
  try:
    bs = BeautifulSoup(html.read(), 'html.parser')
    title = bs.body.h1
  except AttributeError as e:
    return None
  return title

In [0]:
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
  print('Title could not be found')
else:
  print(title)

<h1>An Interesting Title</h1>


The function `getTitle` returns either the title of the page or a `None` object if there was a problem retrieving it.

When writing scrapers. it is important to think about the overall pattern of the code in order to handle exceptions and make it readable at the same time. Also having more generic functions such as `getSiteHTML` and `getTitle` to reuse code and make web scraping easily and quickly.