### **WEBSCRAPING WITH PYTHON**

- Web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser).

### **Introduction to Web Scraping**

In [None]:
from urllib.request import urlopen

- urllib --> Python library containing functions or requesting data accross the web
- urlopen --> Used to open remote objects accross a network and read them; fairly generic function.

#### Introduction to BeautifulSoup

- *BeautifulSoup* is a library that formats and organizes the messy web by fixing bad HTML and presenting users with transversable Python onjects representing XML structures.

#### Running BeautifulSoup

In [1]:
# The most commonly used object is the BeautifulSoup object

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


- When you create a BeautifulSoup object, two arguments are passed in; The first is the HTML text the object is based on, and the second specifies the parser that you want BeautifulSoup to use in order to create that object.


#### Connecting Reliably and Handling Exceptions

- When using a scrapper, an error may occur when retrieving a page on the server or the server may simply not be found.

In [6]:
# Handling a page not found error
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found')
else:
    print('It worked!')

It worked!


- Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

In [8]:
# Checking for tags

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsp = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = get_title('http://www.pythonscraping.com/pages/page1.html')

if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>
