# Web Scraping with Python

**Credit Note**
The work below follows closely with Real Python's [guide](https://realpython.com/python-web-scraping-practical-introduction/) to web scraping. Full credit where credit is due.

My intent is to apply the underlying concepts to assist in my everyday work.

## Introduction
1. Import `urlopen` package from the `urllib.request` module, which comes natively with Python.
2. Define a url link

In [4]:
from urllib.request import urlopen

In [5]:
url = "http://olympus.realpython.org/profiles/aphrodite"

page = urlopen(url)

In [6]:
page

<http.client.HTTPResponse at 0x108d2aef1f0>

This HTTPResponse object has a `.read()` method that returns a sequence of bytes. These bytes can be decoded with the `.decode()` method.

In [7]:
page_bytes = page.read()
decoded_page = page_bytes.decode('utf-8')

In [8]:
print(decoded_page)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## Extracting Text from HTML

We can use string methods to navigate HTML.

In [11]:
title_index = decoded_page.find('<title>')
title_index

14

In [12]:
# Increase the index to the end of the title tag
title_start_idx = title_index + len('<title>')
title_start_idx

21

In [13]:
# Find index of the closing title tag
title_end_idx = decoded_page.find('</title>')
title_end_idx

39

So, we've discovered that the page's title is located at `decoded_page[21:39]`.

In [16]:
page_title = decoded_page[title_start_idx:title_end_idx]
page_title

'Profile: Aphrodite'

## Intro to RegEx
Python has a built-in module `re` to handle regular expressions.
* `*` represents zero or more of whatever comes just before the asterisk.

In [50]:
import re

re.findall('ab*c',' abcd') # Looks for anything in between a and c

['abc']

In [51]:
re.findall("ab*c", "acc")

['ac']

## Beautiful Soup

In [53]:
conda install -c anaconda beautifulsoup4

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\James\miniconda3\envs\web-scraping

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.3       |     pyhb0f4dca_0          87 KB  anaconda
    ca-certificates-2020.10.14 |                0         159 KB  anaconda
    soupsieve-2.0.1            |             py_0          33 KB  anaconda
    ------------------------------------------------------------
                                           Total:         279 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     anaconda/noarch::beautifulsoup4-4.9.3-pyhb0f4dca_0
  soupsieve          anaconda/noarch::soupsieve-2.0.1-py_0

The following packages will be SUPERSEDED by a higher-priority channel:



  current version: 4.9.1
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base -c defaults conda




### Create a BeautifulSoup Object

In [59]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")

The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, the string `"html.parser"`, tells the object which parser to use behind the scenes. `"html.parser"` represents Python’s built-in HTML parser.

In [60]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine




