## A Practical Introduction to Web Scraping in Python

https://realpython.com/python-web-scraping-practical-introduction/

### Table of Contents

*   Scrape and Parse Text From Websites
    *   Build Your First Web Scraper
    *   Extract Text From HTML With String Methods
    *   Get to Know Regular Expressions
    *   Extract Text From HTML With Regular Expressions
    *   Check Your Understanding
*   Use an HTML Parser for Web Scraping in Python
    *   Install Beautiful Soup
    *   Create a BeautifulSoup Object
    *   Use a BeautifulSoup Object
    *   Check Your Understanding
*   Interact with HTML Forms
    *   Install MechanicalSoup
    *   Create a Browser Object
    *   Submit a Form With MechanicalSoup
    *   Check Your Understanding
*   Interact with Websites in Real Time
*   Conclusion
*   Additional Resources



In [71]:
import re
import time
from bs4 import BeautifulSoup
import mechanicalsoup
from urllib.request import urlopen

In [None]:
# url = "http://olympus.realpython.org/profiles/aphrodite"
# url = "http://olympus.realpython.org/profiles/poseidon"
# url = "http://olympus.realpython.org/profiles/dionysus"
url = "http://olympus.realpython.org/profiles"
page = urlopen(url)
html = page.read().decode("utf-8")

In [None]:
# grab the HTML as text
print(html)

<html>
<head>
<title>All Profiles</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<h1>All Profiles:</h1>
<br><br>
<h2>
<a href="/profiles/aphrodite">Aphrodite</a>
<br><br>
<a href="/profiles/poseidon">Poseidon</a>
<br><br>
<a href="/profiles/dionysus">Dionysus</a>
</h2>
</center>
</body>
</html>



In [None]:
# String Methods
title_index = html.find("<title>")
start_index = title_index + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'\n<head>\n<TITLE >Profile: Dionysus</title  / >\n</head>\n<body bgcolor="yellow">\n<center>\n<br><br>\n<img src="/static/dionysus.jpg" />\n<h2>Name: Dionysus</h2>\n<img src="/static/grapes.png"><br><br>\nHometown: Mount Olympus\n<br><br>\nFavorite animal: Leopard <br>\n<br>\nFavorite Color: Wine\n</center>\n</body>\n</html>'

In [None]:
for string in ["Name: ", "Favorite Color: "]:
    string_start_idx = html.find(string)
    text_start_idx = string_start_idx + len(string)

    nxt_html_tag_offset = html[text_start_idx:].find("<")
    text_end_idx = text_start_idx + nxt_html_tag_offset

    raw_text = html[text_start_idx:text_end_idx]
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)

Dionysus
Wine


In [None]:
# Regular Expressions
pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags
title

'Profile: Dionysus'

### Beautiful Soup

In [None]:
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())



All Profiles




All Profiles:


Aphrodite

Poseidon

Dionysus







In [None]:
soup.title.string

'All Profiles'

In [None]:
for link in soup.find_all("a"):
    link_url = url + link["href"]
    print(link_url)

http://olympus.realpython.org/profiles/profiles/aphrodite
http://olympus.realpython.org/profiles/profiles/poseidon
http://olympus.realpython.org/profiles/profiles/dionysus


### Mechanical Soup

In [54]:
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup

In [55]:
# form = login_html.select("form")[0]
form = login_html.form
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

In [57]:
profiles_page = browser.submit(form, login_page.url)

In [66]:
links = profiles_page.soup.select("a")
base_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")

Aphrodite: http://olympus.realpython.org/profiles/aphrodite
Poseidon: http://olympus.realpython.org/profiles/poseidon
Dionysus: http://olympus.realpython.org/profiles/dionysus


In [68]:
profiles_page.soup.title

<title>All Profiles</title>

### Real Time Interaction

In [75]:
# real time
browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")

    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10)

The result of your dice roll is: 2
The result of your dice roll is: 6
The result of your dice roll is: 5
The result of your dice roll is: 2


In [72]:
print("I'm about to wait for five seconds...")
time.sleep(5)
print("Done waiting!")

I'm about to wait for five seconds...
Done waiting!


### Conclusion



*   Request a web page using Python's built-in **urllib** module
*   Parse HTML using **Beautiful Soup**
*   Interact with web forms using **MechanicalSoup**
*   Repeatedly request data from a website to **check for updates**

