## Get html from single page

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read()) # return bytes

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


Above code outputs the content of HTML file page1.html, found in the directory <web root>/pages, on the server located at the domain name http://pythonscraping.com.

```.read()``` only work once

In [2]:
html.read()

b''

### error handling

In [3]:
from urllib.error import HTTPError, URLError

If page is not found on the server, a HTTPError is raised

In [4]:
try:
    html = urlopen('http://pythonscraping.com/pages/page_balabala.html')
except HTTPError as e:
    print(e)

HTTP Error 404: Not Found


If the server is not found at all, a URLError is raised

In [5]:
try:
    html = urlopen('http://pythonscraping_balabala.com/pages/page_balabala.html')
except URLError as e:
    print(e)

<urlopen error [Errno 8] nodename nor servname provided, or not known>


## Beautiful Soup

In [6]:
from bs4 import BeautifulSoup

In [7]:
html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [8]:
bs.h1

<h1>An Interesting Title</h1>

In [9]:
bs.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

Below codes have the same result. Since bs object search along the element trees

In [10]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [11]:
bs.body.h1

<h1>An Interesting Title</h1>

In [12]:
bs.html.h1

<h1>An Interesting Title</h1>

If a non existent tag is requested, None returns

In [13]:
type(bs.non_existent_tag)

NoneType

### Different Parser

|parser|speed|forgiving|built-in|
|---|---|---|---|
|'html.parser'|middle|low|yes|
|'lxml'|higher|more forgiving|no|
|'html5lib'|most slow|most forgiving|no|

### Error Handling

In [14]:
try:
    bs.non_existent_tag
except AttributeError as e:
    print(e)

In [15]:
class Tag:
    def __init__(self, bs_tag):
        self.bs_tag = bs_tag
        
    def get(self, tag):
        bs_tag = self.bs_tag.find(tag)
        
        if bs_tag is None:
            raise AttributeError('{} not found'.format(tag))
        else:
            return Tag(bs_tag)
        

In [16]:
bs.body.h1

<h1>An Interesting Title</h1>

### Get text within tag

In [17]:
bs.body.h1.get_text()

'An Interesting Title'

In [18]:
bs.find('body').find('h1')

<h1>An Interesting Title</h1>

In [19]:
Tag(bs).get('body').get('h1').bs_tag

<h1>An Interesting Title</h1>

In [20]:
Tag(bs).get('bodyy').get('h1')

AttributeError: bodyy not found

## Robustness

Avoid hard-coded paths since the slightest change will break the scraper

In [None]:
bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')

Search for more table existence

In [None]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

### get tag attributes

In [None]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

In [None]:
img_tag = bs.find('img')
img_tag.attrs

### Search for multiple tag

In [None]:
bs.findAll(['img', 'h1'])

In [None]:
bs.findAll('h1')

In [None]:
bs.find('h1')

### Filter by class and id

In [None]:
bs.findAll('tr', {'class':'gift', 'id':'gift1'}) # ''mean any tag

In [None]:
bs.findAll('tr', class_ = 'gift', id= 'gift1') # ''mean any tag

### Filter by text

In [None]:
bs.findAll(text='\nVegetable Basket\n')

Above code won't work

In [None]:
bs.findAll({'text':'\nVegetable Basket\n'})

### Tag Tree Navigation

In [None]:
tag = bs.findAll(text='\nVegetable Basket\n')[0]
tag

In [None]:
tag.parent

In [None]:
list(tag.parent.children)

In [None]:
list(tag.parent.previous_siblings)

In [None]:
list(tag.parent.next_siblings)

### search by regular expression

In [None]:
bs.findAll('img', {'src':re.compile('.*\/gifts\/.*\.jpg')})

### search by lambda 

In [None]:
bs.findAll(lambda tag:('src' in tag.attrs))

## Alternatives
* try another website
* check url
* check javascript

## Tips

To make you scraper more robust, it's best to make tag selection as specific as possible

## Web crawler

The deep web is any part of the web that’s not part of the surface web.2  The surface is part of the internet that is indexed by search engines. Estimates vary widely, but the deep web almost certainly makes up about 90% of the internet.

The dark web, also known as the darknet, is another beast entirely.3 It is run over the existing network hardware infrastructure but uses Tor, or another client, with an application protocol that runs on top of HTTP, providing a secure channel to exchange information.

### transverse the whole site

* generate sitemap
* gathering data

warning: recursion limit

Keys:

* Avoid duplicated url

### Redirect

* Server-side redirects, where the URL is changed before the page is loaded

 If you’re using the urllib library with Python 3.x, it handles redirects automatically! If you’re using the requests library, make sure to set the allow-redirects flag to True:
 
 ```r = requests.get('http://github.com', allow_redirects=True)```
* Client-side redirects, sometimes seen with a “You will be redirected in 10 seconds” type of message, where the page loads before redirecting to the new one

