# Scraping Data from the Web

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) web site

[target site example: Horse Land](https://treehouse-projects.github.io/horse-land/index.html)

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
html = urlopen('https://treehouse-projects.github.io/horse-land/index.html')
soup = BeautifulSoup(html.read(), 'html.parser')

# print(soup)
# this will print, but let's make the indentations display neatly 

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Horse Land
  </title>
  <link href="css/style.css" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700|Rye" rel="stylesheet"/>
 </head>
 <body>
  <div class="sign">
   <h1>
    Horse Land
   </h1>
   <p>
    A list of horses from A to Z
   </p>
  </div>
  <div class="featured">
   <h2>
    Horse of the Month
   </h2>
   <div class="featured-img">
    <img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
   </div>
   <div class="featured-text">
    <p class="bold">
     Mustang
    </p>
    <p>
     This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.
    </p>
    <a class="button button--primary" href="mustang.html">
     Learn More
    </a>
   </div>
  </div>
  <ul class="card-

### things to notes

In the unordered list, `<ul class="card-wrap" id="imageGallery">`,

the ID of `imageGallery` appears, even though the images don't.

**Reason:** This is because *BeautifulSoup doesn't wait for JavaScript to run* before it scrapes the page.  (Later on, we see how to handle it.)

### other BeautifulSoup features

In [3]:
print(soup.title)

<title>Horse Land</title>


In [4]:
print(soup.div)  # only the first div

<div class="sign">
<h1>Horse Land</h1>
<p>A list of horses from A to Z</p>
</div>


# `find()` and `find_all()`

Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.

In [5]:
# all divs
divs = soup.find_all('div')
for div in divs:
    print(div)

<div class="sign">
<h1>Horse Land</h1>
<p>A list of horses from A to Z</p>
</div>
<div class="featured">
<h2>Horse of the Month</h2>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.</p>
<a class="button button--primary" href="mustang.html">Learn More</a>
</div>
</div>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
        

What if we want a particular div, such as the divs with the *featured* class name?

`<div class="featured">...</div>`

In [6]:
divs = soup.find_all('div', {'class': 'featured'})
for div in divs:
    print(div)

<div class="featured">
<h2>Horse of the Month</h2>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.</p>
<a class="button button--primary" href="mustang.html">Learn More</a>
</div>
</div>


In [10]:
featured_header_h2 = soup.find('div', {"class": "featured"}).h2

print(featured_header_h2)

<h2>Horse of the Month</h2>


In [9]:
featured_header = soup.find('div', {"class": "featured"})

print(featured_header)

<div class="featured">
<h2>Horse of the Month</h2>
<div class="featured-img">
<img src="img/1280px-SpanishMustangsOfCorolla.jpg"/>
</div>
<div class="featured-text">
<p class="bold">Mustang</p>
<p>This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.</p>
<a class="button button--primary" href="mustang.html">Learn More</a>
</div>
</div>


## `get_text()`

Use `get_text()` as the **last step** in the scraping process.

In [13]:
print(featured_header_h2.get_text())

Horse of the Month


In [14]:
print(featured_header.get_text())


Horse of the Month




Mustang
This month, we're featuring the mustang, a free-roaming horse of the American west that first descended from
          horses brought to the Americas by the Spanish. Here are five beach loving wild Spanish Mustangs in Corolla,
          North Carolina.
Learn More




## parameters

```
name

attrs

recursive

string
```

## Exercise: find buttons

Use `attrs` argument to search for the CSS class and print out all references to the *primary button* class (the *Learn More* button).

In [15]:
for button in soup.find(attrs={"class": "button button--primary"}):
    print(button)

Learn More


Perform the same thing, but with the `class_` reserved word in `BeautifulSoup`

In [17]:
for button in soup.find(class_="button button--primary"):
    print(button)

Learn More


## Get all links (internal and external)

In [21]:
for link in soup.find_all('a'):
    print(link.get('href'))

mustang.html
https://en.wikipedia.org/wiki/Horse
https://commons.wikimedia.org/wiki/Horse_breeds
https://commons.wikimedia.org/wiki/Horse_breeds
https://creativecommons.org/licenses/by-sa/3.0/


# Web Scraping: legal concerns

Web Scraping Legal Claims (USA):

copyright infringement

Computer Fraud and Abuse Act (CFAA)

- The CFAA prohibits accessing a computer without, or in excess of, authorization.

### robots.txt

Many sites have a [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard#Examples) file where limits can be set as to where bots can go.
