# Introduction to Web Scraping
Often we are interested in getting data from a website.  Modern websites are often built using a [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) framework that has an Application Programming Interface ([API](https://en.wikipedia.org/wiki/Application_programming_interface) to make [HTTP](https://www.tutorialspoint.com/http/http_requests.htm) requests to retrieve structured data in the form of [JSON](https://en.wikipedia.org/wiki/JSON) or XML.

However, when there is not a clear API, we might need to perform web scraping by directly grabbing the data ourselves

In [None]:
import requests
from bs4 import BeautifulSoup

## Getting data using requests
In this simple example we will scrape data from the PBS faculty webpage.

In [None]:
page = requests.get("http://pbs.dartmouth.edu/people")
print(page)

Here the response '200' indicates that the get request was successful.  Now let's look at the actual text that was downloaded from the webpage.

In [None]:
print(page.content)

Here you can see that we have downloaded all of the data from the PBS faculty page and that it is in the form of HTML. 
HTML is a markup language that tells a browser how to layout content.  HTML consists of elements called tags.  Each tag indicates a beginning and end.  Here are a few examples: 

 - `<a></a>` - indicates hyperlink
 - `<p></p>` - indicates paragraph
 - `<div></div>` - indicates a division, or area, of the page.
 - `<b></b>` - bolds any text inside.
 - `<i></i>` - italicizes any text inside.
 - `<h1></h1>` - indicates a header
 - `<table></table>` - creates a table.
 - `<ol></ol>` - ordered list
 - `<ul></ul>` - unordered list
 - `<li></li>` - list item


## Parsing HTML using Beautiful Soup
There are many libraries that can be helpful for quickly parsing structured text such as HTML.  We will be using Beautiful Soup as an example.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Here we are going to find the unordered list tagged with the id 'faculty-container'.  We are then going to look for any nested tag that use the 'h4' header tag.  This should give us all of the lines with the faculty names as a list.

In [None]:
names_html = soup.find_all('ul',id='faculty-container')[0].find_all('h4')
names = [x.text for x in names_html]
print(names)

What if we wanted to get all of the faculty email addresses?

In [None]:
email_html = soup.find_all('ul',id='faculty-container')[0].find_all('span',{'class' : 'contact'})
email = [x.text for x in email_html]
print(email)

## Parsing string data
What if we wanted to grab the name from the list of email addresses?

In [None]:
print([x.split('@')[0] for x in email])

## Interacting with web page using Selenium