# Web Scraping Tutorial

[A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)

## Scrape and Parse Text From Websites

In [1]:
from urllib.request import urlopen

In [3]:
url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")

print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



In [4]:
title_index = html.find("<title>")
start_index = title_index + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'Profile: Aphrodite'

**As HTML code is not always clean and tidy, the above way of using String match method is not reliable.** 

Therefore, turn to using regex. 

In [2]:
import re
from urllib.request import urlopen

In [3]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # remove HTML tags

print(title)

Profile: Dionysus


## Use an HTML Parser for Web Scraping in Python

In [12]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [13]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

soup = BeautifulSoup(html, "html.parser")

In [16]:
print(soup.get_text()) # extract all the text from the document and automatically remove any HTML tags



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






In [17]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

In [20]:
# image1 and image2 are tag objects
image1, image2 = soup.find_all("img") 

In [21]:
image1.name # property

'img'

In [22]:
image1["src"] # attribute

'/static/dionysus.jpg'

In [23]:
soup.title

<title>Profile: Dionysus</title>

In [25]:
soup.title.string

'Profile: Dionysus'

In [26]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

## Interact With HTML Forms

In [1]:
import mechanicalsoup

In [2]:
browser = mechanicalsoup.Browser()

In [3]:
url = "http://olympus.realpython.org/login"
page = browser.get(url)

In [31]:
page

<Response [200]>

In [33]:
type(page.soup)

bs4.BeautifulSoup

In [34]:
page.soup

<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h2>Please log in to access Mount Olympus:</h2>
<br/><br/>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>
</center>
</body>
</html>

In [4]:
login_html = page.soup

form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

profiles_page = browser.submit(form, page.url)

In [5]:
profiles_page.url

'http://olympus.realpython.org/profiles'

In [43]:
base_url = "http://olympus.realpython.org"

links = profiles_page.soup.select("a")

for link in links: 
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")

Aphrodite: http://olympus.realpython.org/profiles/aphrodite
Poseidon: http://olympus.realpython.org/profiles/poseidon
Dionysus: http://olympus.realpython.org/profiles/dionysus


## Interact With Websites in Real Time

In [12]:
import mechanicalsoup

In [13]:
broswer = mechanicalsoup.Browser()
page = broswer.get("http://olympus.realpython.org/dice")

In [16]:
# use the CSS ID selector "#" to indicate that result is an id value
tag = page.soup.select("#result")[0]
result = tag.text
print(f"The result of your dice roll is: {result}")

The result of your dice roll is: 4


In [24]:
import time

In [30]:
# refresh the page 4 times at 10-sec interval 
for i in range (4):
    page = broswer.get("http://olympus.realpython.org/dice") # refresh the page 
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")
    if (i < 3):
        time.sleep(5)

The result of your dice roll is: 5
The result of your dice roll is: 3
The result of your dice roll is: 3
The result of your dice roll is: 6
