**Prerequisites:** This lesson requires the BeautifulSoup library.  You should be able to get it with `conda install beautifulsoup4`

# Scraping Task
Imagine we need to scrape the text of all of the navigation links from this portfolio website HTML.  We want to print out a list of them, represented as strings.

First, open up the HTML file with Python.  (In this case we're opening a local file, but you would typically be using `requests` to get the file from a web server.)

In [1]:
def get_portfolio_html_text():
    with open("portfolio.html", "r") as file_obj:
        return file_obj.read()

In [2]:
html_string = get_portfolio_html_text()
html_string



## If we didn't know about BeautifulSoup...

You could just treat the HTML as a big string, and split and slice it until you find the content you want

We notice that all of the navigation labels are immediately followed by "`</a>`"

In [3]:
split_string = html_string.split("</a>")

In [4]:
split_string[1]

'\n    <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent"\n      aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">\n      <span class="navbar-toggler-icon"></span>\n    </button>\n    <div class="collapse navbar-collapse" id="navbarSupportedContent">\n      <ul class="navbar-nav">\n        <li class="nav-item">\n          <a class="nav-link js-scroll-trigger" href="#about">About'

We experiment to find the beginning of the label.  Turns out it's consistently ">", then the label text

In [5]:
greater_than_index = split_string[0].rfind(">")

In [6]:
greater_than_index

1187

In [7]:
split_string[0][greater_than_index:]

'>\n    '

Hmm, this selected something that we didn't want

In [8]:
greater_than_index = split_string[1].rfind(">")
split_string[1][greater_than_index+1:]

'About'

It worked for the 1-th index though...

Okay, here's a loop over everything so far (note: `repr` is a built-in function that shows whitespace more clearly than just `print`)

In [9]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    print(repr(segment[greater_than_index+1:]))

'\n    '
'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'
'name@email.com'
'\n          '
'\n          '
'\n          '
'\n          '
''


Great, that gets the links we wanted, but also some extra stuff. Let's do some cleanup, remove the links that only contain whitespace

In [10]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0:
        print(repr(content))

'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'
'name@email.com'


We still have this pesky email link. Maybe we get rid of it with a hack like this

In [11]:
for segment in split_string:
    greater_than_index = segment.rfind(">")
    content = segment[greater_than_index+1:]
    content = content.strip()
    if len(content) > 0 and "@" not in content:
        print(repr(content))

'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'


So, the whole flow from `html_string` to printing the labels was:

In [12]:
def get_links_without_bs4(html_string):
    split_string = html_string.split("</a>")
    for segment in split_string:
        greater_than_index = segment.rfind(">")
        content = segment[greater_than_index+1:]
        content = content.strip()
        if len(content) > 0 and "@" not in content:
            print(repr(content))
            
get_links_without_bs4(html_string)

'About'
'Experience'
'Education'
'Skills'
'Interests'
'Awards'


That approach was annoying, and fairly brittle.  What happens if they add another `a` tag that's not in the nav bar, but also doesn't contain an `@`?

## The much easier way, with BeautifulSoup
BeautifulSoup allows you to use a CSS selector to choose exactly the elements you're trying to target

### CSS Selector Rules
 - Start with HTML element type (e.g. `div`, `li`, `p`)
 - If you only want elements with a particular class, add a `.` then the class name  (e.g. `div.header-content`)
 - If you only want elements with a particular id, add a `#` then the id name (e.g. `div#contact-list`)
 - You can stack more than one selector at a time.  For example, if you want to select only `p` tags that are inside of `li` tags with class `addresses`, that would look like `li.addresses p`

In [13]:
from bs4 import BeautifulSoup

Make a "soup" object out of the html string

In [14]:
css_soup = BeautifulSoup(html_string)

Select the list of elements that match our query. In this case, it's "`a` tags with class `nav-link`"

In [15]:
css_soup?

[0;31mSignature:[0m      [0mcss_soup[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m           BeautifulSoup
[0;31mString form:[0m   
<!DOCTYPE html>
           
           <html lang="en">
           <head>
           <meta charset="utf-8"/>
           <meta content="width=device-widt <...> !-- Custom scripts for this template -->
           <script src="js/resume.min.js"></script>
           </body>
           </html>
[0;31mLength:[0m         3
[0;31mFile:[0m           ~/.conda/envs/prework-labs/lib/python3.7/site-packages/bs4/__init__.py
[0;31mDocstring:[0m     
This class defines the basic interface called by the tree builders.

These methods will be called by the parser:
  reset()
  feed(markup)

The tree builder may call these methods from its feed() implementation:
  handle_starttag(name, attrs) # See note about return value
  handle_endtag(name)
  handle_data(data) # Appends to the

In [16]:
nav_links = css_soup.select("a.nav-link")

In [17]:
nav_list_items = css_soup.select("li.nav-item a")
nav_list_items[0]

<a class="nav-link js-scroll-trigger" href="#about">About</a>

In [18]:
navbar = css_soup.select("div.navbar-collapse")
navbar

[<div class="collapse navbar-collapse" id="navbarSupportedContent">
 <ul class="navbar-nav">
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#about">About</a>
 </li>
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#experience">Experience</a>
 </li>
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#education">Education</a>
 </li>
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#skills">Skills</a>
 </li>
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#interests">Interests</a>
 </li>
 <li class="nav-item">
 <a class="nav-link js-scroll-trigger" href="#awards">Awards</a>
 </li>
 </ul>
 </div>]

Exploring what we got back

In [19]:
first_link = nav_links[0]
first_link?

[0;31mSignature:[0m      [0mfirst_link[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m           Tag
[0;31mString form:[0m    <a class="nav-link js-scroll-trigger" href="#about">About</a>
[0;31mLength:[0m         1
[0;31mFile:[0m           ~/.conda/envs/prework-labs/lib/python3.7/site-packages/bs4/element.py
[0;31mDocstring:[0m      Represents a found HTML tag with its attributes and contents.
[0;31mInit docstring:[0m Basic constructor.
[0;31mCall docstring:[0m
Calling a tag like a function is the same as calling its
find_all() method. Eg. tag('a') returns a list of all the A tags
found within this tag.


In [20]:
first_link.contents

['About']

In [21]:
first_link.text

'About'

So, let's loop over everything:

In [22]:
for link in nav_links:
    print(link.text)

About
Experience
Education
Skills
Interests
Awards


So, the whole flow from `html_string` to printing the labels was:

In [23]:
def get_links_with_bs4(html_string):
    css_soup = BeautifulSoup(html_string)
    nav_links = css_soup.select("a.nav-link")
    for link in nav_links:
        print(link.text)
        
get_links_with_bs4(html_string)

About
Experience
Education
Skills
Interests
Awards


...and that's it!  Much faster and cleaner!  It avoided selecting any of the other `a` tags from the get-go, and it will continue to work if the site owner decides to put an `@` in the nav links for some reason

## Appendix (in-class experiments)

**Question:** what if we wanted to get only the experience titles, but not the education titles?  Both are `<h3 class="mb-0">`

In [24]:
h3s = css_soup.select("h3.mb-0")
[element.text for element in h3s]

['Senior Web Developer',
 'Web Developer',
 'Junior Web Designer',
 'Web Design Intern',
 'University of Colorado Boulder',
 'James Buchanan High School']

**Answer:** keep going "up the tree" to find what is different about the experience compared to the education titles.  It turns out that the experience titles are nested (4 levels down) inside of a `section` tag with the ID `experience`, whereas the education titles are nested inside of a `section` tag with the ID `education`.  So if we just wanted to select the experience ones, that would look like:

In [25]:
experience_section_h3s = css_soup.select("section#experience h3")
[element.text for element in experience_section_h3s]

['Senior Web Developer',
 'Web Developer',
 'Junior Web Designer',
 'Web Design Intern']

**Question:** what if you wanted to get all of the text inside of a `div`, even if the `div` has several tags inside of it that contain all the text?

**Answer:** you would want to use `.getText()` rather than `.text`

In [26]:
experience_contents = css_soup.select("section#experience div.resume-content")
experiences = [element.getText() for element in experience_contents]
experiences

['\nSenior Web Developer\nIntelitec Solutions\nBring to the table win-win survival strategies to ensure proactive domination. At the end of the day,\n              going forward, a new normal that has evolved from generation X is on the runway heading towards a\n              streamlined cloud solution. User generated content in real-time will have multiple touchpoints for\n              offshoring.\n',
 '\nWeb Developer\nIntelitec Solutions\nCapitalize on low hanging fruit to identify a ballpark value added activity to beta test. Override the\n              digital divide with additional clickthroughs from DevOps. Nanotechnology immersion along the information\n              highway will close the loop on focusing solely on the bottom line.\n',
 '\nJunior Web Designer\nShout! Media Productions\nPodcasting operational change management inside of workflows to establish a framework. Taking seamless\n              key performance indicators offline to maximise the long tail. Keeping your 

Can we clean that up a little bit, remove the newlines?

In [27]:
[" ".join([exp.strip() for exp in experience.split("\n") if len(exp.strip()) > 0]) for experience in experiences]

['Senior Web Developer Intelitec Solutions Bring to the table win-win survival strategies to ensure proactive domination. At the end of the day, going forward, a new normal that has evolved from generation X is on the runway heading towards a streamlined cloud solution. User generated content in real-time will have multiple touchpoints for offshoring.',
 'Web Developer Intelitec Solutions Capitalize on low hanging fruit to identify a ballpark value added activity to beta test. Override the digital divide with additional clickthroughs from DevOps. Nanotechnology immersion along the information highway will close the loop on focusing solely on the bottom line.',
 'Junior Web Designer Shout! Media Productions Podcasting operational change management inside of workflows to establish a framework. Taking seamless key performance indicators offline to maximise the long tail. Keeping your eye on the ball while performing a deep dive on the start-up mentality to derive convergence on cross-plat

Can we get all of the text from the whole page at once?

In [28]:
css_soup.getText()

'\n\n\n\n\n\n\nResume - Start Bootstrap Theme\n\n\n\n\n\n\n\n\n\n\n\n\nClarence Taylor\n\n\n\n\n\n\n\n\n\n\nAbout\n\n\nExperience\n\n\nEducation\n\n\nSkills\n\n\nInterests\n\n\nAwards\n\n\n\n\n\n\n\nClarence\n          Taylor\n\n3542 Berry Street · Cheyenne Wells, CO 80810 · (317) 585-8468 ·\n          name@email.com\n\nI am experienced in leveraging agile frameworks to provide a robust synopsis for high level\n          overviews. Iterative approaches to corporate strategy foster collaborative thinking to further the overall\n          value proposition.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nExperience\n\n\nSenior Web Developer\nIntelitec Solutions\nBring to the table win-win survival strategies to ensure proactive domination. At the end of the day,\n              going forward, a new normal that has evolved from generation X is on the runway heading towards a\n              streamlined cloud solution. User generated content in real-time will have multiple touchpoints for\n          

In [29]:
" ".join([segment.strip() for segment in css_soup.getText().split() if len(segment.strip()) > 0])

'Resume - Start Bootstrap Theme Clarence Taylor About Experience Education Skills Interests Awards Clarence Taylor 3542 Berry Street · Cheyenne Wells, CO 80810 · (317) 585-8468 · name@email.com I am experienced in leveraging agile frameworks to provide a robust synopsis for high level overviews. Iterative approaches to corporate strategy foster collaborative thinking to further the overall value proposition. Experience Senior Web Developer Intelitec Solutions Bring to the table win-win survival strategies to ensure proactive domination. At the end of the day, going forward, a new normal that has evolved from generation X is on the runway heading towards a streamlined cloud solution. User generated content in real-time will have multiple touchpoints for offshoring. March 2013 - Present Web Developer Intelitec Solutions Capitalize on low hanging fruit to identify a ballpark value added activity to beta test. Override the digital divide with additional clickthroughs from DevOps. Nanotechn