### Web Scraping with Python

Previous module: DOM elements and introduction to BeautifulSoup  

[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  

Methods covered today:
<li>find, find_all</li>
<li>get_next_sibling(s), get_previous_sibling(s)</li>
<li>stripped_strings</li>
<li>filtering by tags and text</li>

In [1]:
from bs4 import BeautifulSoup
import re

Scraping data from [Clear Lake Campground](http://www.fs.usda.gov/recarea/mthood/null/recarea/?recid=53058&actid=82)

#### Clear Lake Campground Website
<img src ='images/clear_lake_website.png' style="display: inline-block" width=90%>

#### Web scraping methodology
1. Identify the information we want to scrape
2. Look for unique elements near that information - i.e. text, tag attribute like an id or class
2. Look for strucutral relationships between elements - table data, list items, sibling relationships

In [6]:
# Create a 'soup' of DOM elements from the page source
site_url = 'clear_lake.html'
cg_data = open("webfiles/" + site_url,'r').read()
cg_soup = BeautifulSoup(cg_data, 'html.parser')

In [7]:
cg_soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<script type="text/javascript">
        var bidiSupport = new Object();
        bidiSupport.bidiAlignRight = "right";
        bidiSupport.bidiAlignLeft = "left"; 
        bidiSupport.bidiDirAttr = ""; 
        bidiSupport.bidiImageRTL = null;
        bidiSupport.isRTL = false;
</script>
<title> 
	    	   
			
			Mt. Hood National Forest - Clear Lake Campground
			
	</title>
<link href="about:blank" rel="shortcut icon"/>
<link href="/FSE_WIDTheme/themes/./html/FSE_WIDTheme/reset.min.css" rel="styleSheet" type="text/css"/>
<link href="/FSE_WIDTheme/themes/./html/FSE_WIDTheme/WIDConsumption_Styles_Mozilla.min.css" rel="styleSheet" type="text/css"/>
<link href="/FSE_WIDTheme/themes/./html/FSE_WIDTheme/print.min.css" media="print" rel="stylesheet" title="Printer-Friendly Style" type="text/css"/>
<link href="

#### BeautifulSoup Cheat Sheet

In [10]:
# find all elements with a specific HTML tag
# matches all <strong></strong> elements
strong_tags = cg_soup.find_all('strong')

# find all elements with a specifc TAG and specific text:
# matches all elements <strong>Telephone</strong>
phone_strong = cg_soup.find('strong', text=re.compile('Telephone'))

# find the next sibling from the Telephone strong text that has an area code
phone_number = phone_strong.find_next_sibling(text=re.compile('([0-9]{3}).*')).strip()

# find previous sibling of the phone number and scrape hours 
hours_strong = phone_strong.find_previous_sibling('strong')
hours_str = ""
for line in hours_strong.find_next_siblings(text=re.compile('[a-z]')):
    hours_str = hours_str + '\n' + line.strip()

# find elements by an attribute
# matches all elements <div class='navleft</div>
navleft_divs = cg_soup.find_all('div', {'class': 'navleft'})

# find elements by relation to eachother
# next_sibling, previous_sibling
navleft_p = navleft_divs[0].p
navleft_p.find_next_sibling()  # form 
navleft_p.find_next_sibling().button.text  # 'Go'

# iterate through strings in the soup
for str_nav in navleft_p.find_next_sibling().stripped_strings:
    print(str_nav)


Search form
Search website
Go
Site Map


#### Lets take a look at the [Clear Lake Campground Website](http://www.fs.usda.gov/recarea/mthood/null/recarea/?recid=53058&actid=82) - Location

#### Current Conditions

#### Web Scraping Summary

Start by locating the information you want, then investigate nearby elements.
  
Look for:  
<li>Unique identifiers - text, tag attriutes </li>
<li>Relative positioning with respect to other elements - siblings, children </li>

When scraping multiple pages based on a single template, keep generality in mind

BeautifulSoup methods covered:  
<li>find, find_all</li>
<li>get_next_sibling(s), get_previous_sibling(s)</li>
<li>stripped_strings</li>
<li>filtering by tags and text</li>
