In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

A status_code of *200* means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

In [7]:
page.status_code 

200

In [8]:
page.text

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [9]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [11]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [14]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

Shows us that there are two tags at the top level of the page. The initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list.

In [17]:
for item in list(soup.children):
    print(type(item))

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>


All of the items are BeautifulSoup objects. The first is a Doctype object contains information about the type of the document.

The second is a NavigableString represents text found in the HTML document.

The final item is a Tag object contains other nested tags. Allows us to navigate through an HTML document, and extract other tags and text. 

Select the html tag and its children by taking the third item in the list.

In [21]:
html = list(soup.children)[2]
list(html.children)a

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
body = list(html.children)[3]
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

Isolate the p tag.

In [25]:
p = list(body.children)[1]
p.get_text

'Here is some simple content for this page.'

### Finding all instances of a tag at once

In [26]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

**find_all** returns a list. Use list indexing to extract text.

In [35]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

### Searching for tags by class and id

In [37]:
page = requests.get('http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')

In [38]:
soup.find_all('p', class_ = 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [39]:
soup.find_all(class_ = 'outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [40]:
soup.find_all(id = 'first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### Using CSS Selectors

In [41]:
soup.select('div p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [48]:
page = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168')
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id = 'seven-day-forecast')
forecast_items = seven_day.find_all(class_ = 'tombstone-container')
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  NOW until
  <br/>
  4:00pm Sat
 </p>
 <p>
  <img alt="" class="forecast-icon" src="newimages/medium/shra.png" title=""/>
 </p>
 <p class="short-desc">
  Wind Advisory
 </p>
</div>
