## Notebook 8.2 HTML-soup

Not all data on-line will be conventiently accessible through a REST API, and in those cases we'll need to get and parse the HTML text data directly.  This is much more difficult and tedious than using the rest API, since we'll need to look at the HTML representation of the webpage directly and try to understand it in order to find the part of the code that we're interested in. There are several Python packages for parsing HTML and trying to make sense of it, and for some terrible reason the best of them has the worst name, and is called `beautifulsoup4`. Install this package with conda and import it to continue. 

Then read through the "Quick Start" guide for Beautiful soup [here]([Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/).


In [4]:
# conda install beautifulsoup4

In [12]:
import requests
from bs4 import BeautifulSoup

### What is HTML?

HTML is the language of the web. It stands for hyper text markup language. In fact, the `markdown` language that we have been using to write text in our notebooks is simply an abstraction of HTML (markup language), which is easier to write but gets rendered into HTML after we execute it. The two languages are identical under the hood and can be used interchangeably in many places. 

To get a firm grasp on how HTML works, you can complete the tutorial on https://www.w3schools.com/html/html_intro.asp. This is entirely necessary for us to proceed, but it's useful reading if you're interested. The main thing to know, however, is that HTML is a hierarchical structure (things nested within things) and that each of those things is labelled by a type of tag. For example, in the link above you can see that a webpage has a `<HTML>` tag, and `<head>` tag, and a `<body>` tag. These are simply nested containers for writing text into. The entire web is simply a bunch of text, with design instructions laid on top of it. 

### Let's take a look at some HTML
Open the URL below in either chrome or firefox. In either of these browsers right-click on any link in the page and from the dropdown menu select "inspect". This should open a new window that shows the HTML element that corresponds to the tag you are inspecting. Look at the name of the tag. For example, if you clicked on a student's email address you would see `<div class="student-email">` above that element. This is a specific element for storing email addresses. Now that we know that, if we wanted to get all of the email addresses for students in the department we know that we could parse the HTML of the webpage and try to just pull out all of the 'student-email' elements. So let's give that a try by using beautifulsoup. 

In [13]:
baseurl = "http://e3b.columbia.edu/students/current/phd/"

#### get the HTML using requests

In [16]:
# get the full page HTML using requests, print hte first 500 characters
response = requests.get(baseurl)
response.text[:500]


'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">\n<head>\n    <meta charset="UTF-8" />\n    <title>Current - Ph.D. - Columbia University Ecology, Evolution and Environmental Biology Department</title>\n\t<link rel="shortcut icon" type="image/x-icon" href="http://e3b.columbia.edu/wp-content/themes/columbia-e3b/assets/ico/favicon.ico?v=2">\n\t<link rel="icon" type="image/ico" href="http://e3b.columbia.edu/wp-content/themes/columbia-e3b/assets/ico/favicon.ico>v=2'

#### parse the HTML using bs4
[Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)  

Here we can use the `.find()` or `find_all()` functions to search for particular tags (anything within a "< >") as well as class or id attributes of those tags. Here is another pretty good guide for further tips on using bs4 (https://www.dataquest.io/blog/web-scraping-tutorial-python/). 

In [28]:
# create a BS object from the URL
soup = BeautifulSoup(response.text, "html5lib")

In [98]:
# find all div elements with class='student-email'
emails = soup.find_all("div", {"class", "student-email"})


In [99]:
for email in emails:
    atag = email.find('a')
    if atag:
        print(atag.contents[0])

tb2583@columbia.edu
yc2975@columbia.edu
pc2796@columbia.edu 
bdc2120@columbia.edu
lac2208@columbia.edu
hf2306@columbia.edu
jsh2211@columbia.edu
sah2216@columbia.edu
amh2284@columbia.edu
jej2141@columbia.edu
pak2136@columbia.edu
sk4335@columbia.edu
sk4220@columbia,edu
pfm2119@columbia.edu
an2601@columbia.edu
arp2195@columbia.edu
awq2101@columbia.edu
vr2352@columbia.edu
prp2123@columbia.edu
lr2767@columbia.edu
scs2204@columbia.edu
sss2254@columbia.edu
ss4812@columbia,edu
mqt2101@columbia.edu
bnt2111 @columbia.edu
jc4055@columbia.edu
mv2640@columbia.edu
nkw2113@columbia.edu


### A little more advanced
You can see when inspecting the HTML that right above the `student-email` tags there are also tags for `student-name` and `student-program`. Let's collect all of this data into dictionary. The other two elements are even simpler to parse than emails, since they return plain text instead of a link (`<a>` tag). 

In [96]:
# find all div elements with class='student-email'
emails = soup.find_all("div", {"class", "student-email"})
names = soup.find_all("div", {"class", "student-name"})


In [102]:
students = {}

for name, email in zip(names, emails):
    
    name = name.contents[0].strip()
    atag = email.find("a")
    if name and atag:
        students[name] = atag.contents[0]


In [103]:
students

{'Bytnerowicz, Thomas': 'tb2583@columbia.edu',
 'Cheng, Yi-Ru': 'yc2975@columbia.edu',
 'Choksi, Pooja': 'pc2796@columbia.edu ',
 'Clark, Benjamin': 'bdc2120@columbia.edu',
 'Coelho, Lais': 'lac2208@columbia.edu',
 'Fuong, Holly': 'hf2306@columbia.edu',
 'Hall, Jazlynn': 'jsh2211@columbia.edu',
 'Heilpern, Sebastian': 'sah2216@columbia.edu',
 'Huddell, Alex': 'amh2284@columbia.edu',
 'Jensen, Johanna (Jo)': 'jej2141@columbia.edu',
 'Kache, Pallavi': 'pak2136@columbia.edu',
 'Khanwilkar, Sarika': 'sk4335@columbia.edu',
 'Kou-Giesbrecht, Sian': 'sk4220@columbia,edu',
 'McKenzie, Patrick': 'pfm2119@columbia.edu',
 'Neelakantan, Amrita': 'an2601@columbia.edu',
 'Petach, Anika': 'arp2195@columbia.edu',
 'Quebbeman, Andrew': 'awq2101@columbia.edu',
 'Ramesh, Vijay': 'vr2352@columbia.edu',
 'Ribeiro Piffer, Pedro': 'prp2123@columbia.edu',
 'Rocha Moreira, Lucas': 'lr2767@columbia.edu',
 'Schmiege, Stephanie': 'scs2204@columbia.edu',
 'Shah, Shailee': 'sss2254@columbia.edu',
 'Siller, Stefanie