# Web Scraping

You want to go for a hike in San Luis Obispo. But where? In this workbook, you will scrape [a website](http://www.hikeslo.com/) with information about local hiking routes, and build a data frame containing information about each hike (e.g., length, elevation change).

## Scraping a Single Page

We will use the `requests` library to fetch the contents of a URL and BeautifulSoup to parse the HTML. Although we used BeautifulSoup previously to parse XML, it's HTML where BeautifulSoup really shines. Much HTML on the web is malformed, and BeautifulSoup is designed to handle malformed HTML gracefully.

In [None]:
import requests
from bs4 import BeautifulSoup

Let's try to extract the rating, location, elevation gain, and distance from the following page automatically: http://www.hikeslo.com/vaca-flats/

In [None]:
req = requests.get("http://www.hikeslo.com/vaca-flats/")
soup = BeautifulSoup(req.text, "html.parser")

The variable `soup` is a `BeautifulSoup` object that represents the document as a nested data structure. As before, to get all instances of a particular tag as a list, we can use the `.find_all()` method.

In [None]:
soup.find_all('table')

As mentioned above, BeautifulSoup represents the document as a nested data structure. So we can also call `.find()` and `.find_all()` on a tag to search for tags _within_ that tag.

In [None]:
table = soup.find('table')
table.find_all('div')

Now let's use `BeautifulSoup` to extract the information that we want from the page: rating, location, elevation gain, and distance. How do we know where to find this information in the HTML source code?

When web scraping, you will need to constantly go back and forth between the rendered page and the HTML source code. Google Chrome makes this easy for you. If you right click on any element on the page, one of the options is "Inspect". This will show you the HTML source code, with the element you selected highlighted.

Let's try to extract the Location from the page automatically.

In [None]:
# YOUR CODE HERE

Now implement the function below, which given a page on this site, returns the rating, location, elevation gain, and distance as a Pandas series. 

_Hint:_ The rating is probably the most challenging. You will need to filter based on the `class` attribute. If you have forgotten how to filter on an attribute, take a look at the documentation for `.find_all()`.

In [None]:
import pandas as pd

In [None]:
def get_data_for_hike(url):
    
    # YOUR CODE HERE
    
    return pd.Series({
            "rating": None,
            "location": None,
            "elevation_gain": None,
            "distance": None
        })

In [None]:
get_data_for_hike("http://www.hikeslo.com/vaca-flats/")

## Crawling a Site

We want to be able to automatically scrape all of the hikes on the website. For example, we might want to be able to scrape information about the 10 hikes listed on [this page](http://www.hikeslo.com/).

We can do this in two steps. First, we scrape the main page, getting all links to hikes. (Note: Hyperlinks are represented by the `<a>` tag.) Then, we scrape each of those pages by calling the function `get_data_for_hike()` that we wrote above.

In [None]:
# YOUR CODE HERE