### Web Scrapper
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial.

One useful package for web scraping that you can find in Python’s standard library is __urllib__, which contains tools for working with __URLs__. In particular, the __urllib.request__ module contains a function called __urlopen()__ that can be used to open a __URL__ within a program.

In [2]:
from urllib.request import urlopen
print(dir(urlopen))

['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']


In [3]:
url = "http://olympus.realpython.org/profiles/aphrodite"
# To open the web page, pass url to urlopen():
page = urlopen(url)

# urlopen() returns an HTTPResponse object:
page

<http.client.HTTPResponse at 0x1de63f35c08>

To extract the HTML from the page, first use the __HTTPResponse__ object’s __.read()__ method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

In [6]:
html_bytes = page.read()
html = html_bytes.decode('utf-8')
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



Once you have the HTML as text, you can extract information from it in a couple of different ways.

### Extract Text From HTML With String Methods
One way to extract information from a web page’s HTML is to use string methods. For instance, you can use __.find()__ to search through the text of the HTML for the <title> tags and extract the title of the web page.

Let’s extract the __title__ of the web page you requested in the previous example. If you know the index of the first character of the title and the first character of the closing </title> tag, then you can use a string slice to extract the title.

Since __.find()__ returns the index of the first occurrence of a substring, you can get the index of the opening __<title>__ tag by passing the string "<title>" to __.find()__:

In [8]:
title_index = html.find('<title>')
title_index

14

You don’t want the index of the <title> tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string "<title>" to title_index:

In [9]:
start_index = title_index + len("<title>")
start_index

21

Now get the index of the closing </title> tag by passing the string "</title>" to .find():

In [10]:
end_index = html.find("</title>")
end_index

39

Finally, you can extract the title by slicing the html string:

In [11]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

### Regular Expressions
__Regular expressions__ —or __regexes__ for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s __re__ module.

In [12]:
import re 
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


In [None]:
__Regular expressions__ use special characters called __metacharacters__ to denote different patterns. For instance, the asterisk character (*) stands for zero or more of whatever comes just before the asterisk.

In the following example, you use __findall()__ to find any text within a string that matches a given regular expression:

In [14]:
re.findall("ab*c", "ac")

['ac']

The first argument of __re.findall()___ is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern "ab*c" in the string "ac".

The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", and has zero or more instances of "b" between the two. __re.findall()__ returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

Here’s the same pattern applied to different strings:

In [15]:
re.findall("ab*c", "abcd")

['abc']

In [16]:
re.findall("ab*c", "acc")

['ac']

In [17]:
re.findall("ab*c", "abcac")

['abc', 'ac']

In [18]:
re.findall("ab*c", "abdc")

[]

Notice that if no match is found, then findall() returns an empty list.

Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value __re.IGNORECASE__:

In [19]:
re.findall("ab*c", "ABC")

[]

In [20]:
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:

In [21]:
re.findall("a.c", "abc")

['abc']

In [22]:
re.findall("a.c", "abbc")

[]

In [25]:
re.findall("a.c", "ac")

[]

In [26]:
re.findall("a.c", "acc")

['acc']

The pattern __.*__ inside a regular expression stands for any character repeated any number of times. For instance, "a.*c" can be used to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:

In [27]:
re.findall("a.*c", "abc")

['abc']

In [28]:
re.findall("a.*c", "abbc")

['abbc']

In [29]:
re.findall("a.*c", "ac")

['ac']

In [30]:
re.findall("a.*c", "acc")

['acc']

Often, you use __re.search()__ to search for a particular pattern inside a string. This function is somewhat more complicated than __re.findall()__ because it returns an object called a __MatchObject__ that stores different groups of data. This is because there might be matches inside other matches, and __re.search()__ returns every possible result.

calling __.group()__ on a __MatchObject__ will return the first and most inclusive result, which in most cases is just what you want:

In [32]:
match_results = re.search("ab*c", "ABC", re.IGNORECASE)
match_results.group()

'ABC'

There’s one more function in the re module that’s useful for parsing out text. __re.sub()__, which is short for __substitute__, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the __.replace()__ string method.

The arguments passed to __re.sub()__ are the regular expression, followed by the replacement text, followed by the string. Here’s an example:

In [33]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*>", "ELEPHANTS", string)
string

'Everything is ELEPHANTS.'

Perhaps that wasn’t quite what you expected to happen.

__re.sub()__ uses the regular expression "<.*>" to find and replace everything between the first < and last >, which spans from the beginning of <replaced> to the end of <tags>. This is because Python’s regular expressions are __greedy__, meaning they try to find the longest possible match when characters like * are used.

Alternatively, you can use the non-greedy matching pattern __*?__, which works the same way as * except that it matches the shortest possible string of text:

In [34]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string)
string

"Everything is ELEPHANTS if it's in ELEPHANTS."

This time, __re.sub()__ finds two matches, <replaced> and <tags>, and substitutes the string "ELEPHANTS" for both matches.

### Extract Text From HTML With Regular Expressions
let’s now try to parse out the title from a new profile page (http://olympus.realpython.org/profiles/dionysus), which includes this rather carelessly written line of HTML:

The __.find()__ method would have a difficult time dealing with the inconsistencies here, but with the clever use of regular expressions, you can handle this code quickly and efficiently:

In [35]:
import re 
from urllib.request import urlopen

In [37]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Dionysus


## Use an HTML Parser for Web Scraping in Python
Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the __Beautiful Soup__ library is a good one to start with.

#### Create a BeautifulSoup Object

In [39]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [41]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

This program does three things:

1. Opens the URL http://olympus.realpython.org/profiles/dionysus using urlopen() from the urllib.request module
2. Reads the HTML from the page as a string and assigns it to the html variable
3. Creates a BeautifulSoup object and assigns it to the soup variable

The __BeautifulSoup__ object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

#### Use a BeautifulSoup Object
Save and run the above program. When it’s finished running, you can use the soup variable in the interactive window to parse the content of html in various ways.

For example, __BeautifulSoup__ objects have a __.get_text()__ method that can be used to extract all the text from the document and automatically remove any HTML tags.

In [42]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string __.replace()__ method if you need to.

Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the __.find()__ string method is sometimes easier than working with regular expressions.

However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of <img> HTML tags.

In this case, you can use find_all() to return a list of all instances of that particular tag:

In [43]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a list of all <img> tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.

Let’s explore this a little by first unpacking the Tag objects from the list:

In [44]:
image1, image2 = soup.find_all("img")

Each Tag object has a .name property that returns a string containing the HTML tag type:

In [45]:
image1.name

'img'

You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.

For example, the <img src="/static/dionysus.jpg"/> tag has a single attribute, src, with the value "/static/dionysus.jpg". Likewise, an HTML tag such as the link <a href="https://realpython.com" target="_blank"> has two attributes, href and target.

To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:

In [46]:
image1["src"]

'/static/dionysus.jpg'

In [47]:
image2["src"]

'/static/grapes.png'

Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the <title> tag in a document, you can use the .title property:

In [48]:
soup.title

<title>Profile: Dionysus</title>

Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.

You can also retrieve just the string between the title tags with the .string property of the Tag object:

In [49]:
soup.title.string

'Profile: Dionysus'

One of the more useful features of __Beautiful Soup__ is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the <img> tags that have a src attribute equal to the value /static/dionysus.jpg, then you can provide the following additional argument to .find_all():

In [50]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.

Then, instead of relying on complicated regular expressions or using .find() to search through the document, you can directly access the particular tag you’re interested in and extract the data you need.

BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far.

## Interact With HTML Forms
The __urllib__ module you’ve been working with so far in this tutorial is well suited for requesting the contents of a web page. Sometimes, though, you need to interact with a web page to obtain the content you need. For example, you might need to submit a form or click a button to display hidden content.

The Python standard library doesn’t provide a built-in means for working with web pages interactively, but many third-party packages are available from PyPI. Among these, __MechanicalSoup__ is a popular and relatively straightforward package to use.

In essence, MechanicalSoup installs what’s known as a headless browser, which is a web browser with no graphical user interface. This browser is controlled programmatically via a Python program.

```py
pip install mechanicalsoup
import mechanicalsoup
browser = mechanicalsoup.Browser()
```

##### References
1. https://realpython.com/python-web-scraping-practical-introduction/