# Web Scraping with Python

**Credit Note**
The work below follows closely with Real Python's [guide](https://realpython.com/python-web-scraping-practical-introduction/) to web scraping. Full credit where credit is due.

My intent is to apply the underlying concepts to assist in my everyday work.

## Introduction
1. Import `urlopen` package from the `urllib.request` module, which comes natively with Python.
2. Define a url link

In [4]:
from urllib.request import urlopen

In [5]:
url = "http://olympus.realpython.org/profiles/aphrodite"

page = urlopen(url)

In [6]:
page

<http.client.HTTPResponse at 0x108d2aef1f0>

This HTTPResponse object has a `.read()` method that returns a sequence of bytes. These bytes can be decoded with the `.decode()` method.

In [7]:
page_bytes = page.read()
decoded_page = page_bytes.decode('utf-8')

In [8]:
print(decoded_page)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## Extracting Text from HTML

We can use string methods to navigate HTML.

In [11]:
title_index = decoded_page.find('<title>')
title_index

14

In [12]:
# Increase the index to the end of the title tag
title_start_idx = title_index + len('<title>')
title_start_idx

21

In [13]:
# Find index of the closing title tag
title_end_idx = decoded_page.find('</title>')
title_end_idx

39

So, we've discovered that the page's title is located at `decoded_page[21:39]`.

In [16]:
page_title = decoded_page[title_start_idx:title_end_idx]
page_title

'Profile: Aphrodite'

## Extracting from a More Complicated Page

In [23]:
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url)
page_bytes = page.read()
decoded = page_bytes.decode('utf-8')
print(decoded)

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>



Instead of repeating the code, I'm going to make a function to find a given tag

In [25]:
def find_tag(html_page, tag):
    pass