<center>
 <img src="https://drive.google.com/uc?export=view&id=1JpQLPJGs44Tu1Z-xujVGKDyb6ikSwlCl"  alt="PyCon header" height="150"/>

<br>
<h1>
<TT>It's Officially Legal so Let's Scrape the Web</TT>
</h1>
Kimberly Fessel  &nbsp &#8226 &nbsp
<img src="https://drive.google.com/uc?export=view&id=1aMmmYHcHkwRFcUw5v_HD3i0Kt5sGH7Gq"  alt="Twitter logo" height="20"/> @kimberlyfessel &nbsp &#8226 &nbsp
<img src="https://drive.google.com/uc?export=view&id=1s8InMOalYkxWyMeu1vhG5LyF1Jc9wczH"  alt="LinkedIn logo" height="20"/> &nbsp kimberlyfessel
<br> <br>
<h2> <TT> Scraping Basics </TT> </h2>
</center>
<br>

---



#Introduction to Google Colab and `BeautifulSoup`


<img src="https://drive.google.com/uc?export=view&id=1o6M5j-41qjCA85qgETgG0fJooZw6HXFZ"  alt="BeautifulSoup logo" height="100"/>

- Executes Python code on the fly
- Interactivity allows for instant feedback
- Memory persists across cells
- `shift+enter` 
- Use [markdown](https://blog.ghost.org/markdown/) (TEXT) mode for adding text like this

<img src="https://drive.google.com/uc?export=view&id=1cY9SICwwAbrNmuhTha3ilBl330jDie51"  alt="BeautifulSoup logo" height="100"/>

- open-source Python library
- extract data from HTML files
- understands HTML structure by working with a [parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) (`lxml`, `html5lib`, etc.) 
- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference
<br> <br>

<img src="https://drive.google.com/uc?export=view&id=1otbY2IvQ_kzks6OMJvfHi_Zf1Zfvp0Nf"  alt="info" height="20"/> `BeautifulSoup` does not actually gather information from the web.  We will use the `requests` library for that.

#Learn to Scrape with Simple Inline HTML

Let's start with this simple HTML page given below as a string: 

In [0]:
simple_html = """
<html>

<head>
  <style>
    li {font-size: 18px;}
  </style
</head>

<body>
  <div style="border-style: dotted; padding: 10px">
    <h1>Today's Learning Objectives</h1>
    <ul>
      <li>Decipher basic HTML</li>
      <li>Retrieve information from Internet</li>
      <li>Parse web data</li>
      <li>Gather and prepare data systematically</li>
    </ul>
    <br>
  </div>
</body>

</html>
"""

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/>  **Quick HTML Review**
> What tags do we see on this page?  
> What attributes?  
> What's the inner HTML text of the header?

Now we will tell Python to render this string as HTML.

In [0]:
from IPython.core.display import display, HTML
display(HTML(simple_html)) 

This simple "page" contains a list of learning objectives for today's workshop. Now we will see how `BeautifulSoup` can extract information from this HTML.

First we need to import `BeautifulSoup` and parse the HTML string.

In [0]:
from bs4 import BeautifulSoup as bs

In [0]:
soup = bs(simple_html)

In [0]:
soup

When we print out `soup`, it looks like `BeautifulSoup` hasn't done anything!  But no worries -- it has indeed parsed our code and `BeautifulSoup` now knows how to navigate through the HTML DOM.

In [0]:
type(soup)

### Find by tag

We begin by using the `find()` method to extract the header of our HTML.

In [0]:
soup.find('h1')

In [0]:
type(soup.find('h1'))

`find()` returns a tagged element, but we can grab just the inner HTML text instead.

In [0]:
soup.find('h1').text

In [0]:
type(soup.find('h1').text)

We now have a way to extract text from a webpage -- powerful stuff!  

What do you think will be returned if we look for list tags (`li`)?

In [0]:
soup.find('li')

<img src="https://drive.google.com/uc?export=view&id=1DzGWG2ZiMDuh4f4ZSQJiA85q-rs6FRYl"  alt="warning" height="20"/> `BeautifulSoup` returns ONLY the FIRST matching element when we use `find()`.

###Find all

If we would like `BeautifulSoup` to return ALL matching elements, we can use `find_all()` instead.

In [0]:
soup.find_all('li')

In [0]:
type(soup.find_all('li'))

Using `find_all()` yields a result set containing all of list elements on the "page."  You can basically think of a result set as actinly like a list. 

<img src="https://drive.google.com/uc?export=view&id=1DzGWG2ZiMDuh4f4ZSQJiA85q-rs6FRYl"  alt="warning" height="20"/> `BeautifulSoup` does not allow you to apply `.text` to a result set.  The following code **will fail**.

In [0]:
soup.find_all('li').text

Instead, you must apply `.text` to each item in the result set individually.

In [0]:
for item in soup.find_all('li'):
  print(item.text)

In [0]:
learning_objectives = [item.text for item in soup.find_all('li')]

learning_objectives

<img src="https://drive.google.com/uc?export=view&id=1b88t_6cp1ozWJydV1CyN2k0ZAyPTZh1V"  alt="tip" height="22"/>  The two **most common mistakes** I see in web scraping with `BeautifulSoup` are:
- Using `find()` when you really want `find_all()`
- Attempting to apply `.text` to a result set like the output of `find_all()`

###Exercises

For the exercises that follow, please use this HTML code describing today's agenda and tools:

In [0]:
workshop_html = """
<html>

<body>
  <h1>Today's Workshop</h1>
  <div id='agenda' style="background-color: aliceblue">
    <h2>Agenda</h2>
    <p>Today's workshop is comprised of three main sections:</p>
    <ol>
      <li>HTML Basics</li>
      <li>Scraping Basics</li>
      <li>Scraping Pipeline</li>
    </ol>
  </div>
  
  <div id='tools' style='background-color: honeydew'>
    <h2>Tools</h2>
    <p>You will be learning about two primary Python libraries:</p>  
    <ol>
      <li>BeautifulSoup</li>
      <li>requests</li>
    </ol>
  </div>
</body>

</html>
"""

In [0]:
from IPython.core.display import display, HTML
display(HTML(workshop_html)) 

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 1 - Finding the header**  _(Solutions to all exercises provide at bottom of notebook.)_
> Parse `workshop_html` with `BeautifulSoup`.  Find the main header text (`h1`) and save it in a variable.  Verify that you have the text by checking the `type` of your variable.

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 2 - Finding the paragraphs**

Now find all the paragraphs in `workshop_html` and print out the text that you find.

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **BONUS: Exercise 3 - Finding the agenda items**

Create a list of all of the agenda items for today's workshop.  Be sure to store only the TEXT for the AGENDA items!

#Scrape Test Webpage

In the last exercise, we found out that oftentimes using only the HTML tags alone won't be granular enough.  

Let's work with a more complicated HTML file to see what other options are available.

First download this file to your computer.

In [0]:
!wget https://raw.github.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html

In [0]:
from google.colab import files
files.download('pycon_info.html')

Double click on this file to view it in your browser.  Once you have gotten a feel for the structure, read the file in and save as a string. 

In [0]:
pycon_html = open('pycon_info.html').read()

In [0]:
print(pycon_html)

Since our HTML is a string, we can parse it with `BeautifulSoup` and begin collecting data.  

Let's say we are interested in gathering titles and links of events happening today.  Links can be found by looking for anchor, `a`, tags.  

In [0]:
soup = bs(pycon_html)

In [0]:
soup.find_all('a')

Whoa -- there are a lot more links on this page other than today's events!

###Find by attribute

In order to drill down to just the links we are interested in, notice that today's events are contained within a `div` that has `id=today`.  We can first isolate this `div` by searching for it by its `id`.

In [0]:
today_div = soup.find(id='today')

today_div

In [0]:
type(today_div)

Now we will look for all of the anchor tags that are contained within this division.

In [0]:
today_div.find_all('a')

<img src="https://drive.google.com/uc?export=view&id=1b88t_6cp1ozWJydV1CyN2k0ZAyPTZh1V"  alt="tip" height="22"/>   You can find elements by pretty much any attribute.  Let's find elements with that are members of the `events` class.

In [0]:
soup.find_all(class_ = 'events')

Passing a dictionary of attributes works as well.

In [0]:
soup.find_all(attrs={'class':'events', 'id': 'tomorrow'}) 

###Retrieve attributes

If we want to just get the names of today's events, we can simply cycle through today's links and collect the `.text`.

In [0]:
today_text = [link.text for link in today_div.find_all('a')]

today_text

But what would we do if we wanted the **hyperlinks** to each of those events?

`BeautifulSoup` allows you to retrieve element attributes.  You will reference these using the same syntax as dictionary key.

In [0]:
today_div.find('a')

In [0]:
today_div.find('a')['href']

In [0]:
type(today_div.find('a')['href'])

In [0]:
today_links = [link['href'] for link in today_div.find_all('a')]

today_links

###Exercises

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 4 - Tomorrow's event tuples** 
> Create a list of tuples for each of tomorrow's events.  The first element in your tuples will be the event title and the second will be the event link.

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 5 - Finding the event headers** 
> Using `pycon_html` find the header text for today's and tomorrow's events by referencing the `events` class.

---

#Solutions

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/>  **Quick HTML Review**
> What tags do we see on this page? <br>
`div`, `h1`, `ul` (unordered list), `li` (list item)

> What attributes? <br>
`style` for the `div` container

> What's the inner HTML text of the header? <br>
"Today's Learning Objectives"

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 1 - Finding the header**

> Parse `workshop_html` with `BeautifulSoup`.  Find the main header text (`h1`) and save it in a variable.  Verify that you have the text by checking the `type` of your variable.

In [0]:
soup = bs(workshop_html)

In [0]:
header = soup.find('h1').text

print(header)

In [0]:
type(header)

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 2 - Finding the paragraphs**

Now find all the paragraphs in `workshop_html` and print out the text that you find.

In [0]:
soup.find_all('p')

In [0]:
for paragraph in soup.find_all('p'):
  print(paragraph.text)

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **BONUS: Exercise 3 - Finding the agenda items**

Create a list of all of the agenda items for today's workshop.  Be sure to store only the TEXT for the AGENDA items!

In [0]:
agenda_items = [li.text for li in soup.find_all('li')[:3]]

print(agenda_items)

In [0]:
#Later we will learn a better way: 
#  First look for the div that contains the agenda items

agenda_div = soup.find('div', id='agenda')

agenda_items = [li.text for li in agenda_div.find_all('li')]

print(agenda_items)

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 4 - Tomorrow's event tuples** 
> Create a list of tuples for each of tomorrow's events.  The first element in your tuples will be the event title and the second will be the event link.

In [0]:
tomorrow_tuples = [(a.text, a['href']) for a in soup.find(id='tomorrow').find_all('a')]

tomorrow_tuples

<img src="https://drive.google.com/uc?export=view&id=18s6CjUtjr0M24K57uMdtmXvK5TrV34Tv"  alt="exercise" height="20"/> **Exercise 5 - Finding the event headers** 
> Using `pycon_html` find the header text for today's and tomorrow's events by referencing the `events` class.

In [0]:
event_headers = [div.find('h2') for div in soup.find_all(class_='events')]

In [0]:
event_header_text = [header.text for header in event_headers]

event_header_text