# Overview Of Web Scrapping with Python

## Python's urllib Library

* One useful package for web scraping that you can find in Python’s standard library is __urllib__, which contains tools for working with __URLs__. 
* In particular, the __urllib.request__ module contains a function called __urlopen()__ that can be used to open a __URL__ within a program.

In [1]:
from urllib.request import urlopen
print(dir(urlopen))

['__annotations__', '__builtins__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']


* To open a web page, pass a url to the __urlopen()__ function.

In [2]:
url = "http://olympus.realpython.org/profiles/aphrodite"

In [3]:
# To open the web page, pass url to urlopen():
page = urlopen(url)

# urlopen() returns an HTTPResponse object:
page

<http.client.HTTPResponse at 0x15485acdf00>

* To extract the __HTML__ from the page, first use the __HTTPResponse__ object’s __.read()__ method, which returns a sequence of bytes. 
* Then use __.decode()__ to decode the bytes to a string using UTF-8:

In [7]:
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode('utf-8')
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



* Once you have the HTML as text, you can extract information from it in a couple of different ways.

### Extract Text From HTML using String Methods

* One way to extract information from a web page’s HTML is to use string methods. For instance, you can use __.find()__ to search through the text of the HTML for the < title > tags and extract the title of the web page.

* Since __.find()__ returns the index of the first occurrence of a substring, you can get the index of the opening __< title >__ tag by passing the string "< title >" to __.find()__:

In [8]:
title_index = html.find('<title>')
title_index

14

* You don’t want the index of the < title > tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string "< title >" to title_index:

In [9]:
start_index = title_index + len("<title>")
start_index

21

* Now get the index of the closing </ title > tag by passing the string "</ title >" to __.find()__:

In [10]:
end_index = html.find("</title>")
end_index

39

* Finally, you can extract the title by slicing the html string:

In [11]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'