**Prerequisites:** This lesson requires the BeautifulSoup library.  You should be able to get it with `conda install beautifulsoup4`

# Scraping Task
Imagine we need to scrape the text of all of the navigation links from this portfolio website HTML.  We want to print out a list of them, represented as strings.

First, open up the HTML file with Python.  (In this case we're opening a local file, but you would typically be using `requests` to get the file from a web server.)

In [None]:
def get_portfolio_html_text():
    with open("portfolio.html", "r") as file_obj:
        return file_obj.read()

In [None]:
html_string = get_portfolio_html_text()
html_string

## If we didn't know about BeautifulSoup...

You could just treat the HTML as a big string, and split and slice it until you find the content you want

We notice that all of the navigation labels are immediately followed by "</a>"

In [None]:
split_string = html_string.split("</a>")

We experiment to find the beginning of the label.  Turns out it's consistently ">", then the label text

In [None]:
greater_than_index = split_string[0].rfind(">")

In [None]:
greater_than_index

In [None]:
split_string[0][greater_than_index:]

Hmm, this selected something that we didn't want

In [None]:
greater_than_index = split_string[1].rfind(">")
split_string[1][greater_than_index+1:]

It worked for the 1-th index though...

Okay, here's a loop over everything so far (note: `repr` is a built-in function that shows whitespace more clearly than just `print`)

In [None]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    print(repr(segment[greater_than_index+1:]))

Great, that gets the links we wanted, but also some extra stuff. Let's do some cleanup, remove the links that only contain whitespace

In [None]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0:
        print(repr(content))

We still have this pesky email link. Maybe we get rid of it with a hack like this

In [None]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0 and "@" not in content:
        print(repr(content))

So, the whole flow from `html_string` to printing the labels was:

In [None]:
def get_links_without_bs4(html_string):
    split_string = html_string.split("</a>")
    for segment in split_string:
        greater_than_index = segment.rfind(">")
        content = segment[greater_than_index+1:]
        content = content.strip()
        if len(content) > 0 and "@" not in content:
            print(repr(content))
            
get_links_without_bs4(html_string)

That approach was annoying, and fairly brittle.  What happens if they add another `a` tag that's not in the nav bar, but also doesn't contain an `@`?

## The much easier way, with BeautifulSoup
BeautifulSoup allows you to use a CSS selector to choose exactly the elements you're trying to target

### CSS Selector Rules
 - Start with HTML element type (e.g. `div`, `li`, `p`)
 - If you only want elements with a particular class, add a `.` then the class name  (e.g. `div.header-content`)
 - If you only want elements with a particular id, add a `#` then the id name (e.g. `div#contact-list`)
 - You can stack more than one selector at a time.  For example, if you want to select only `p` tags that are inside of `li` tags with class `addresses`, that would look like `li.addresses p`

In [None]:
from bs4 import BeautifulSoup

Make a "soup" object out of the html string

In [None]:
css_soup = BeautifulSoup(html_string)

Select the list of elements that match our query. In this case, it's "`a` tags with class `nav-link`"

In [None]:
nav_links = css_soup.select("a.nav-link")

Exploring what we got back

In [None]:
first_link = nav_links[0]
first_link?

In [None]:
first_link.contents

In [None]:
first_link.text

So, let's loop over everything:

In [None]:
for link in nav_links:
    print(link.text)

So, the whole flow from `html_string` to printing the labels was:

In [None]:
def get_links_with_bs4(html_string):
    css_soup = BeautifulSoup(html_string)
    nav_links = css_soup.select("a.nav-link")
    for link in nav_links:
        print(link.text)
        
get_links_with_bs4(html_string)

...and that's it!  Much faster and cleaner!  It avoided selecting any of the other `a` tags from the get-go, and it will continue to work if the site owner decides to put an `@` in the nav links for some reason