# Scraping Data from the Web

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) web site

[target site example: Horse Land](https://treehouse-projects.github.io/horse-land/index.html)

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [7]:
html = urlopen('https://treehouse-projects.github.io/horse-land/index.html')
soup = BeautifulSoup(html.read(), 'html.parser')

# print(soup)
# this will print, but let's make the indentations display neatly 

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Horse Land
  </title>
  <link href="css/style.css" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700|Rye" rel="stylesheet"/>
 </head>
 <body>
  <div class="sign">
   <h1>
    Horse Land
   </h1>
   <p>
    A list of horses from A to Z
   </p>
  </div>
  <div class="featured">
   <h2>
    Horse of the Month
   </h2>
   <div class="featured-img">
    <img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
   </div>
   <div class="featured-text">
    <p class="bold">
     Mustang
    </p>
    <p>
     This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.
    </p>
    <a class="button button--primary" href="mustang.html">
     Learn More
    </a>
   </div>
  </div>
  <ul class="card-

### things to notes

In the unordered list, `<ul class="card-wrap" id="imageGallery">`,

the ID of `imageGallery` appears, even though the images don't.

**Reason:** This is because *BeautifulSoup doesn't wait for JavaScript to run* before it scrapes the page.  (Later on, we see how to handle it.)

### other BeautifulSoup features

In [9]:
print(soup.title)

<title>Horse Land</title>


In [12]:
print(soup.div)  # only the first div

<div class="sign">
<h1>Horse Land</h1>
<p>A list of horses from A to Z</p>
</div>


In [15]:
# all divs
divs = soup.find_all('div')
for div in divs:
    print(div)

<div class="sign">
<h1>Horse Land</h1>
<p>A list of horses from A to Z</p>
</div>
<div class="featured">
<h2>Horse of the Month</h2>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.</p>
<a class="button button--primary" href="mustang.html">Learn More</a>
</div>
</div>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
        

What if we want a particular div, such as the divs with the *featured* class name?

`<div class="featured">...</div>`

In [17]:
divs = soup.find_all('div', {'class': 'featured'})
for div in divs:
    print(div)

<div class="featured">
<h2>Horse of the Month</h2>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.</p>
<a class="button button--primary" href="mustang.html">Learn More</a>
</div>
</div>
