## Beautiful Soup

Adapted from http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

Beautiful Soup is a handy library for dealing with messy html and other semi-structured text data. 

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

We can get a feel for it with a few examples.  If you visit this page, you can see that foundation names are set up as section headings.  Let's see how we could scrape these out of the html with Beautiful Soup.  

http://www.climateworks.org/about-us/partners/foundation-partners/

In [None]:
from IPython.display import HTML
HTML('<iframe src=http://www.climateworks.org/about-us/partners/foundation-partners/ width=700 height=500></iframe>')

In [None]:
import requests as rq

r = rq.get("http://www.climateworks.org/about-us/partners/foundation-partners/")
print(r.text)

The html has a tree-like structure, which can be represented as a set of nested objects

In [None]:
from IPython.display import Image
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')

Browsers wrap this structure up as an object, the DOM (document object model):

In [None]:
Image('http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png')

Beautiful Soup does something similar.  Let's try opening up our sample html page as an object.  Once it's loaded we can start accessing the components as parts of the object it generates:

In [None]:
from bs4 import BeautifulSoup
import requests as rq

r = rq.get("http://www.climateworks.org/about-us/partners/foundation-partners/")
soup = BeautifulSoup(r.text, "html.parser")
print (soup.title)

Now we can use the built-in functions to find things in the document.  Try looking around in the document and finding elements to pull out, for example:

In [None]:
print (soup.find_all("li", id="menu-item-1118"))

Each item can be accessed as an object, and we can use the object properties to access its sub elements, for example:

In [None]:
tag = soup.find("li", id="menu-item-1118")
print(tag.name)
print(tag["class"])
print(tag["id"])
print(tag.text)
print(tag.a)

We can also use this interface to search the entire DOM.  For example:

In [None]:
for a in soup.find_all("a"):
    print(a["href"])

#### Question - can you pull out the foundation names and their web site urls for ClimateWorks core funders, and put them in a list of two-element tuples?

### A possible answer:

In [None]:
headers = soup.find_all("h3")
names = []
results = []
for h in headers:
    names.append(h.text)
for a in soup.find_all("a"):
    for n in names:
        if a.text == n:
            results.append((n, a["href"]))
print(results)