In [1]:
%load_ext autoreload
%autoreload 2

# Scraping Task
Imagine we need to scrape all of the navigation links from this portfolio website HTML.  We want to print out a list of them, represented as strings.

In [2]:
from scrape_portfolio import get_portfolio_html_text

In [3]:
html_string = get_portfolio_html_text()
html_string



## The annoying way (don't do this)
You could just treat the HTML as a big string, and split and slice it until you find the content you want

In [4]:
# we notice that all of the navigation labels are immediately followed by "</a>"
split_string = html_string.split("</a>")

In [5]:
# experimenting to find the beginning of the label.  Turns out it's
# consistently ">", then the label text
greater_than_index = split_string[0].rfind(">")

In [6]:
greater_than_index

1187

In [7]:
# oh no, this selected something that we didn't want
split_string[0][greater_than_index:]

'>\n    '

In [8]:
# it worked for the 1-th index though
greater_than_index = split_string[1].rfind(">")
split_string[1][greater_than_index+1:]

'About'

In [9]:
# okay, here's a loop over everything so far
# (note: `repr` is a built-in function that shows whitespace more clearly)
for segment in split_string:
    greater_than_index = segment.rfind(">")
    print(repr(segment[greater_than_index+1:]))

'\n    '
'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'
'name@email.com'
'\n          '
'\n          '
'\n          '
'\n          '
''


In [10]:
# great, that gets the links we wanted, but also some extra stuff
# let's do some cleanup, remove the links that only contain whitespace
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0:
        print(repr(content))

'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'
'name@email.com'


In [11]:
# we still have this pesky email link
# maybe we get rid of it with a hack like this
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0 and "@" not in content:
        print(repr(content))

'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'


That approach was annoying, and fairly brittle.  What happens if they add another `a` tag that's not in the nav bar, but also doesn't contain an `@`?

## The much easier way, with BeautifulSoup
BeautifulSoup allows you to use a CSS selector to choose exactly the elements you're trying to target

### CSS Selector Rules
 - Start with HTML element type (e.g. `div`, `li`, `p`)
 - If you only want elements with a particular class, add a `.` then the class name  (e.g. `div.header-content`)
 - If you only want elements with a particular id, add a `#` then the id name (e.g. `div#contact-list`)
 - You can stack more than one selector at a time.  For example, if you want to select only `p` tags that are inside of `li` tags with class `addresses`, that would look like `li.addresses p`

In [12]:
# you might need to run `conda install beautifulsoup4` for this to work
from bs4 import BeautifulSoup

In [13]:
# Make a "soup" object out of the html string
css_soup = BeautifulSoup(html_string)

In [14]:
# Select the list of elements that match our query
# In this case, it's "`a` tags with class `nav-link`"
nav_links = css_soup.select("a.nav-link")

In [15]:
# Exploring what we got back
first_link = nav_links[0]
first_link?

In [16]:
first_link.contents

['About']

In [17]:
# Loop over the links and print their contents
for link in nav_links:
    # HTML tags can contain more than one thing, but these ones only contain a single piece of text
    print(link.contents[0])

About
Experience
Education
Skills
Interests
Awards


...and that's it!  Much faster and cleaner!  It avoided selecting any of the other `a` tags from the get-go, and it will continue to work if the site owner decides to put an `@` in the nav links for some reason