Contents
---
- [requests](#requests)
- [BeautifulSoup](#scraping)
- [JavaScript Console](#javascript)
- [Using pandas to scrape the web](#pandas)

This module is edited from Charles Severance's Python for Informatics book.


In this chapter, we'll learn how to read information from the internet instead of from files. 

Retrieving web pages using the requests package
---
<a class="anchor" id="requests"></a>

The requests library allows you to send HTTP requests using Python. Using requests, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and requests handles all of the HTTP protocol and header details.
The code below reads the romeo.txt file from the website http://data.pr4e.org/romeo.txt :

In [3]:
import requests

url = 'http://data.pr4e.org/romeo.txt'
response = requests.get(url)
print(response.text)

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief



So response.text prints out the text from the web page. What does response print out? The response from the website. A response of 200 means that you were able to communicate with the response successfully:

In [51]:
print(response.status_code)

200


There are tons of different codes you can receive. You've probably received a 403 Forbidden error or a 404 Not Found error when you've gone to a website. All the codes are here:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

What if you wanted to print all of the header details that come along with sending an HTTP request? Type:

In [25]:
print(response.headers)

{'Date': 'Fri, 09 Mar 2018 02:38:30 GMT', 'Server': 'Apache/2.4.7 (Ubuntu)', 'Last-Modified': 'Sat, 13 May 2017 11:22:22 GMT', 'ETag': '"a7-54f6609245537"', 'Accept-Ranges': 'bytes', 'Content-Length': '167', 'Cache-Control': 'max-age=0, no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/plain'}


The Romeo text is short but many web pages will contain a lot of info. What if instead of printing all of the text at once you want to print it line by line? We'll need to break up each line by the newline character: 

In [4]:
import requests

url = 'http://data.pr4e.org/romeo.txt'
sentences = requests.get(url).text.split('\n')

for sentence in sentences:
    print(sentence)

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief



Is there a more Pythonic way of printing each line? Yes. Use the following:

In [1]:
import requests
url = 'http://data.pr4e.org/romeo.txt'
r = requests.get(url)

for line in r.iter_lines():
    print(line)

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'


What do those b's stand for? Bytes. To convert from a byte array to strings that you are used to, use decode:

In [6]:
import requests

url = 'http://data.pr4e.org/romeo.txt'
r = requests.get(url)

for line in r.iter_lines():
    print(line.decode())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


What if we want to print a list of the most frequent words in the text? We've done this before in previous units using a dictionary and then a sorted list of tuples:

In [7]:
import requests

url = 'http://data.pr4e.org/romeo.txt'
r = requests.get(url)

counts = {}

for line in r.iter_lines():
    words = line.decode().split() 
    for word in words:
        counts[word] = counts.get(word,0) + 1 

counts_list=[]
for key, val in counts.items():
    counts_list.append((val,key))

counts_list.sort(reverse = True)
print(counts_list)

[(3, 'the'), (3, 'is'), (3, 'and'), (2, 'sun'), (1, 'yonder'), (1, 'with'), (1, 'window'), (1, 'what'), (1, 'through'), (1, 'soft'), (1, 'sick'), (1, 'pale'), (1, 'moon'), (1, 'light'), (1, 'kill'), (1, 'grief'), (1, 'fair'), (1, 'envious'), (1, 'east'), (1, 'breaks'), (1, 'already'), (1, 'Who'), (1, 'Juliet'), (1, 'It'), (1, 'But'), (1, 'Arise')]


Here's another example. Suppose we want to get information from the OES Faculty/Staff web page. Read the output carefully below to see where the faculty names are located. Where is Dennis Chang located? (Hint: you may want to search for "Dennis Chang" in your web browser since the output is large and messy.)

In [8]:
url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
print(requests.get(url).text)


<!DOCTYPE html>
<!--[if lte IE 8]>         <html lang="en-US" class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-US"> <!--<![endif]-->
<head>
	<meta charset="utf-8">
	
<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"87d38be11c","applicationID":"11580779","transactionName":"JVgLEhBaXg4BSxgTWQFSFkkKVFwGCFxoEFQTUA==","queueTime":0,"applicationTime":767,"agent":""}</script>
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UwMPVVVUGwIBUVlSAAYO"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=loc

The faculty names seem to be located below a line that contains the term "FullName." We can make a loop to print out just the names by creating a boolean variable called "fullname" that keeps track of whether FullName is in the line or not. If it is, we'll print out the next line:

In [9]:
url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

fullname = False
for line in r.iter_lines():
    line = line.decode()
    if fullname == True:
        print(line)
        fullname = False
    if 'FullName' in line:
        fullname = True

				Asha Appel 
				Autumn Apperson 
				Susan Bankowski 
				Brad Baugher 
				Carmen Boyle 
				Peter Buonincontro 
				Eduard Cecere 
				Dennis Chang 
				Chiman Chen 
				Corbet Clark 
				Jenny Cleveland 
				Bevin Daglen 


### Exercise - requests 1
Use the requests package to print the text and the response code from the website

http://www.dr-chuck.com/page1.htm

In [10]:
#insert requests 1

### Exercise - requests 2
Print out just the "http://www.dr-chuck.com/page2.htm" link from the above web page.

In [11]:
#insert 2

### Exercise - requests 3
Create a tuple list for the letters from the web page response and the frequency in which they appear in decending order of frequency.

In [12]:
#insert requests 3

### Exercise - requests 4
Read the text from the webpage www.wunderground.com  . What line is "Sailing Weather" on? Hint: use enumerate to keep track of line numbers.

In [13]:
#insert requests 4

Web Scraping using BeautifulSoup
---
<a class="anchor" id="scraping"></a>

One of the common uses of the requests capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.

As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web.


Google also uses the frequency of links from pages it finds to a particular page as one measure of how “important” a page is and how high the page should appear in its search results. 

BeautifulSoup is one Python package that helps us to scrape the web. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need.

To download it, type "pip install beautifulsoup4" into your terminal. (You can find your terminal under Applications - Utilities on a Mac). If that doesn't work, download it directly from this website: https://www.crummy.com/software/BeautifulSoup/

Okay, let's first view what information is contained on the following page:

In [67]:
import requests
url = 'http://www.dr-chuck.com/page1.htm'
r = requests.get(url)
print(r.text)

<h1>The First Page</h1>
<p>
If you like, you can switch to the 
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>



Notice that using HTML, headers are contained between the "h1" terms, web link tags are contained within the "a" terms, and paragraphs within the "p" terms. Thus, if we wanted to use BeautifulSoup to search for just the web links, we could type the following:

In [14]:
from bs4 import BeautifulSoup
import requests

url = 'http://www.dr-chuck.com/page1.htm'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

# Retrieve all of the anchor (web) tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

http://www.dr-chuck.com/page2.htm


If we want to get more specific with all of the different types of info stored in the tags, we can type:

In [73]:
from bs4 import BeautifulSoup
import requests

url = 'http://www.dr-chuck.com/page1.htm'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('CONTENTS:', tag.contents[0])
    print('ATTRIBUTES:', tag.attrs)

TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
CONTENTS: 
Second Page
ATTRIBUTES: {'href': 'http://www.dr-chuck.com/page2.htm'}


We notice that the entire HTML tag is stored in "tag". The URL alone can be accessed by tag.get('href', None) and the contents can be accessed by tag.contents[0].

If we wanted instead to search for the headers, we could use "h1":

In [74]:
from bs4 import BeautifulSoup
import requests

url = 'http://www.dr-chuck.com/page1.htm'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
headers = soup('h1')
print(headers)

[<h1>The First Page</h1>]


Let's return to our OES faculty example. Look carefully at the teachers' names. Where are they stored? Inside "h3" tags:

In [79]:
import requests

url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

for i,line in enumerate(r.iter_lines()):
    if line:
        if 620 < i < 700: #im only printing out part of the document since
                            #it is so large for easier viewing
            line = line.decode()
            print(i,line)

621 			<div class="fsPhoto"><img class="fsThumbnail" alt="Carmen Boyle " src="/uploaded/New_Site/Employee_Directory/Carmen_Boyle.jpg" /></div>
623 		<h3 class="fsFullName">
624 				Carmen Boyle 
625 		</h3>
630 			<div class="fsTitles">
631 	    <strong>Titles:</strong>
632     Upper School Spanish Teacher
633 </div>
641 			<div class="fsDepartments">
642 	    <strong>Departments:</strong>
643     World Languages
644 </div>
648 			<div class="fsEmail">
649 				<strong>Email: </strong>
650 				<div id="fsEmail-3623-2488-directory">
651 					<script type="text/javascript">setTimeout(function(){ FS.util.insertEmail("fsEmail-3623-2488-directory", "ude.seo", "celyob", false); }, 20);</script>
652 				</div>
653 			</div>
655 			<div class="fsPhones">
656 				<strong>Phone Numbers:</strong><br>
657 					School:
658 					<a href="tel:503-416-9238">
659 						503-416-9238 
660 					</a><br>
661 			</div>
672 	</div>
674 			
676 	<div class="fsConstituentItem fsHasPhoto">
677 			<div class="fsP

Let's try printing out just the h3 info using BeautifulSoup:

In [4]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

headers = soup('h3')
print(headers)

[<h3 class="fsFullName">
				Asha Appel 
		</h3>, <h3 class="fsFullName">
				Autumn Apperson 
		</h3>, <h3 class="fsFullName">
				Susan Bankowski 
		</h3>, <h3 class="fsFullName">
				Brad Baugher 
		</h3>, <h3 class="fsFullName">
				Carmen Boyle 
		</h3>, <h3 class="fsFullName">
				Peter Buonincontro 
		</h3>, <h3 class="fsFullName">
				Eduard Cecere 
		</h3>, <h3 class="fsFullName">
				Dennis Chang 
		</h3>, <h3 class="fsFullName">
				Chiman Chen 
		</h3>, <h3 class="fsFullName">
				Corbet Clark 
		</h3>, <h3 class="fsFullName">
				Jenny Cleveland 
		</h3>, <h3 class="fsFullName">
				Bevin Daglen 
		</h3>]


That's still not quite as pretty as we'd like, but remember that there are multiple attributes stored in the headers. In this case, we want to access its contents:

In [3]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

headers = soup('h3')

for header in headers:
    print(header.contents[0].strip())

Asha Appel
Autumn Apperson
Susan Bankowski
Brad Baugher
Carmen Boyle
Peter Buonincontro
Eduard Cecere
Dennis Chang
Chiman Chen
Corbet Clark
Jenny Cleveland
Bevin Daglen


Notice that by using BeautifulSoup, we didn't need to create a loop to search for "Full Name" as we did at the beginning of this unit.

Suppose we wanted to get what department each faculty member works in. First we notice that their departments are inside "div class":

In [2]:
import requests

url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

for i,line in enumerate(r.iter_lines()):
    if line:
        if 620 < i < 700: #once again, this line is only for ease of viewing
            line = line.decode()
            print(i,line)

621 			<div class="fsPhoto"><img class="fsThumbnail" alt="Carmen Boyle " src="/uploaded/New_Site/Employee_Directory/Carmen_Boyle.jpg" /></div>
623 		<h3 class="fsFullName">
624 				Carmen Boyle 
625 		</h3>
630 			<div class="fsTitles">
631 	    <strong>Titles:</strong>
632     Upper School Spanish Teacher
633 </div>
641 			<div class="fsDepartments">
642 	    <strong>Departments:</strong>
643     World Languages
644 </div>
648 			<div class="fsEmail">
649 				<strong>Email: </strong>
650 				<div id="fsEmail-3623-2488-directory">
651 					<script type="text/javascript">setTimeout(function(){ FS.util.insertEmail("fsEmail-3623-2488-directory", "ude.seo", "celyob", false); }, 20);</script>
652 				</div>
653 			</div>
655 			<div class="fsPhones">
656 				<strong>Phone Numbers:</strong><br>
657 					School:
658 					<a href="tel:503-416-9238">
659 						503-416-9238 
660 					</a><br>
661 			</div>
672 	</div>
674 			
676 	<div class="fsConstituentItem fsHasPhoto">
677 			<div class="fsP

Let's first try printing out all of the tags labeled "div":

In [5]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

divs = soup('div')
print(divs)

[<div id="fsPageWrapper">
<div id="fsMenu">
<div class=" fsMenu fsStyleAutoclear" id="fsEl_493">
<div class="fsElement fsContent close-button-container" data-use-new="true" id="fsEl_920">
<div class="fsElementContent">
<button class="drawer-trigger" href="#"></button>
</div>
</div>
<div class="fsElement fsNavigation fsList nav-main" data-use-new="true" id="fsEl_494">
<div class="fsElementContent">
<nav><ul class="fsNavLevel1"><li class="fsNavParentPage"><a href="/aboutoes">ABOUT OES</a><div class="fsNavPageInfo"><ul class="fsNavLevel2"><li><a href="/aboutoes/ataglance">OES At a Glance</a></li><li><a href="/aboutoes/welcome-from-head-of-school">Welcome From Head of School</a></li><li><a href="/aboutoes/mission-vision-identity">Mission, Vision, and Identity</a></li><li><a href="/aboutoes/history">Brief History</a></li><li class="fsNavParentPage"><a href="/aboutoes/leadership">Leadership</a><div class="fsNavPageInfo"><ul class="fsNavLevel3"><li><a href="/aboutoes/leadership/board-of-trust

That's still too much info. To narrow down our search, we can use BeautifulSoup and the findAll command to find the classes labeled "fsDepartments":

In [6]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

divs = soup.findAll("div", { "class" : "fsDepartments" })

for div in divs:
    print(div)


<div class="fsDepartments">
<strong>Departments:</strong>
    Upper School Administration, Administration
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    History
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Athletics, Lacrosse
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Educational Technology
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    World Languages
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Visual &amp; Performing Arts (VaPA), Boarding
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Boarding, Educational Technology
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Athletics, Mathematics, Soccer, Counseling &amp; Academic Support (CAST), MS-CAST
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    World Languages
</div>
<div class="fsDepartments">
<strong>Departments:</strong>
    Upper School Administration, 

That's still a bit too much being printed. If we play around, we can see that the third entry in the contents of each div tag is what we want:

In [7]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

divs = soup.findAll("div", { "class" : "fsDepartments" })

for div in divs:
    print(div.contents[2].strip())

Upper School Administration, Administration
History
Athletics, Lacrosse
Educational Technology
World Languages
Visual & Performing Arts (VaPA), Boarding
Boarding, Educational Technology
Athletics, Mathematics, Soccer, Counseling & Academic Support (CAST), MS-CAST
World Languages
Upper School Administration, Religion & Philosophy
Chaplaincy, College Counseling
Science


Let's look at the paragraph tags:

In [8]:
import requests
from bs4 import BeautifulSoup


url = 'https://www.oes.edu/academics/upper-school/faculty-staff'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

paragraphs = soup('p')
for paragraph in paragraphs:
    print(paragraph)

<p></p>
<p>Oregon Episcopal School is a college preparatory, independent school in Portland, Oregon, serving 860 students from Pre-Kindergarten through Grade 12, including 60 boarding students from around the world in Grades 9-12.</p>
<p></p>


### Exercise - BeautifulSoup 1
Write a program using BeautifulSoup to print the information contained in the paragraphs (between the "p" terms) of the https://www.oes.edu/aboutoes/mission-vision-identity website. It should look familiar!

In [None]:
#insert 1

### Exercise - Beautiful Soup2
Count the number of hyperlink tags on the www.cnn.com website. You don't need to print them, just count them.

In [None]:
#insert 2

### Exercise - Beautiful Soup3
Write a program that prints out the OES faculty's titles from the https://www.oes.edu/academics/upper-school/faculty-staff website.

In [None]:
#insert 3

### Exercise - Beautiful Soup 4
Write a program that prints out the OES faculty's phone numbers.

In [None]:
#insert 4

### Exercise - Beautiful Soup 5
Create a pandas dataframe containing faculty name, titles, departments, and phone numbers.

In [None]:
#insert 5

Using the JavaScript console to help you find stuff
---
<a class="anchor" id="javascript"></a>

Let's use Beautiful Soup and requests to find the temperature for Portland using www.weatherunderground.com. First let's read in the web page using requests.

In [96]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.wunderground.com/weather-forecast/97217'
response = requests.get(url)

print(response.text)



<!DOCTYPE html><html><head>
  <title>Portland, OR Forecast | Weather Underground</title>
  <meta charset="utf-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
  <meta name="robots" content="follow, index" />
  <meta name="rating" content="general" />
  <meta name="referrer" content="no-referrer-when-downgrade" />
  <meta name="apple-itunes-app" content="app-id=486154808, affiliate-data=at=1010lrYB&ct=website_wu" />
  <meta name="fb_app_id" content="325331260891611" />
  <meta name="fb_channel_url" content="width=device-width, initial-scale=1, maximum-scale=1" />
  <meta property="og:site_name" content="Weather Underground" />
  <meta property="og:type" content="article" />
  <meta name="description" content="Weather Underground provides local & long range weather forecasts, weather reports, maps & tropical weather conditions for locations worldwide." />
  <meta name="wui-me

Okay, this is really ugly. We'll have a hard time finding the information we need from here. Here's a trick to find it more easily:

First, go the website http://www.wunderground.com/weather-forecast/97223 

Then, right click on the current temperature.

You should see something like this:
<img src="weatherimage.jpg" style="width: 200px;"/>

Then, choose Inspect. This will cause the JavaScript console to pop up:
<img src="weatherimage2.jpg" style="width: 500px;"/>
What is it telling us? Well, the blue highlighted section is telling us the part of the html code that contains the temperature info we clicked on. If you look closely, it is described by a 'wu-value wu-value-to' class tag. In order to extract the temperature, we can use BeautifulSoup. Note: we need to put an underscore after "class" because class is a reserved word in Python:


In [103]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.wunderground.com/weather-forecast/97217'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
temp = soup.find(class_ = 'wu-value wu-value-to').get_text()
print('Temp:', temp)

Temp: 51


What if we wanted to see obtain the description of the weather? We could click on the sun/cloud/snow icon:

<img src="weatherimage3.jpg" style="width: 200px;"/>

If we click on Inspect, we see that it is located under the "condition-icon small-6 medium-12 columns" class tag. 

In [107]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.wunderground.com/weather-forecast/97217'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
conditions = soup.find(class_ = 'condition-icon small-6 medium-12 columns').get_text()
print('Conditions:', conditions)

Conditions: 

Cloudy



Many times, we will want to scrape the web and put the data that we get into a Pandas dataframe. For example, view the seven day forecast on the website here:

https://forecast.weather.gov/MapClick.php?lat=45.44763999999998&lon=-122.76902000000001#.WqL4D5MbOu4

We can use read the data in and store it in Pandas using the following code:

In [30]:
import requests
from bs4 import BeautifulSoup

url = 'https://forecast.weather.gov/MapClick.php?lat=45.44763999999998&lon=-122.76902000000001#.WqL4D5MbOu4'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.findAll(class_='col-sm-10 forecast-text')
print(data)

[<div class="col-sm-10 forecast-text">Mostly sunny, with a high near 53. Light west southwest wind. </div>, <div class="col-sm-10 forecast-text">Patchy fog after 10pm.  Otherwise, mostly clear, with a low around 35. Light north northeast wind. </div>, <div class="col-sm-10 forecast-text">Patchy fog before 10am.  Otherwise, increasing clouds, with a high near 59. East northeast wind 3 to 6 mph. </div>, <div class="col-sm-10 forecast-text">Partly cloudy, with a low around 40. East northeast wind around 6 mph becoming calm  in the evening. </div>, <div class="col-sm-10 forecast-text">Sunny, with a high near 66. Calm wind becoming east northeast around 6 mph in the morning. </div>, <div class="col-sm-10 forecast-text">Partly cloudy, with a low around 41.</div>, <div class="col-sm-10 forecast-text">A 10 percent chance of showers after 4pm.  Mostly sunny, with a high near 68.</div>, <div class="col-sm-10 forecast-text">A slight chance of showers before 10pm, then a slight chance of rain afte

There is still a lot of html stuff that is making the forecasts hard to read. Let's strip it using .get_text. We could make a for loop for this, or, we could do it more succinctly using a a list comprehension:

In [31]:
forecasts = [pt.get_text() for pt in data]
forecasts

['Mostly sunny, with a high near 53. Light west southwest wind. ',
 'Patchy fog after 10pm.  Otherwise, mostly clear, with a low around 35. Light north northeast wind. ',
 'Patchy fog before 10am.  Otherwise, increasing clouds, with a high near 59. East northeast wind 3 to 6 mph. ',
 'Partly cloudy, with a low around 40. East northeast wind around 6 mph becoming calm  in the evening. ',
 'Sunny, with a high near 66. Calm wind becoming east northeast around 6 mph in the morning. ',
 'Partly cloudy, with a low around 41.',
 'A 10 percent chance of showers after 4pm.  Mostly sunny, with a high near 68.',
 'A slight chance of showers before 10pm, then a slight chance of rain after 10pm.  Mostly cloudy, with a low around 43.',
 'Rain likely.  Mostly cloudy, with a high near 55.',
 'A chance of rain.  Mostly cloudy, with a low around 40.',
 'A chance of showers.  Cloudy, with a high near 55.',
 'A chance of showers.  Mostly cloudy, with a low around 40.',
 'A chance of showers.  Mostly cloud

Let's do the same thing to grab the days of the week:

In [32]:
import requests
from bs4 import BeautifulSoup

url = 'https://forecast.weather.gov/MapClick.php?lat=45.44763999999998&lon=-122.76902000000001#.WqL4D5MbOu4'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.findAll(class_='col-sm-2 forecast-label')
days = [pt.get_text() for pt in data]
days

['This Afternoon',
 'Tonight',
 'Saturday',
 'Saturday Night',
 'Sunday',
 'Sunday Night',
 'Monday',
 'Monday Night',
 'Tuesday',
 'Tuesday Night',
 'Wednesday',
 'Wednesday Night',
 'Thursday']

Finally, we can convert these two lists into a pandas dataframe:

In [33]:
import pandas as pd
weather = pd.DataFrame({
        "days": days, 
        "forecasts": forecasts, 
    })
weather

Unnamed: 0,days,forecasts
0,This Afternoon,"Mostly sunny, with a high near 53. Light west ..."
1,Tonight,"Patchy fog after 10pm. Otherwise, mostly clea..."
2,Saturday,"Patchy fog before 10am. Otherwise, increasing..."
3,Saturday Night,"Partly cloudy, with a low around 40. East nort..."
4,Sunday,"Sunny, with a high near 66. Calm wind becoming..."
5,Sunday Night,"Partly cloudy, with a low around 41."
6,Monday,A 10 percent chance of showers after 4pm. Mos...
7,Monday Night,"A slight chance of showers before 10pm, then a..."
8,Tuesday,"Rain likely. Mostly cloudy, with a high near 55."
9,Tuesday Night,"A chance of rain. Mostly cloudy, with a low a..."


### Exercise - JavaScript Console 1
Find the current condition wind speed of your city using www.wunderground.com . Use the JavaScript console to help you find the location of wind speed.

In [19]:
#insert 1

### Exercise - JavaScript Console 2
Use the JavaScript console to print the Star Wars Total Gross from http://www.boxofficemojo.com/yearly/chart/?yr=2017&p=.htm

In [None]:
#insert 2

### Exercise - Javascript Console 3
Use the JavaScript console to print the opening date of Pulp Fiction from http://www.imdb.com/title/tt0110912/?ref_=fn_al_tt_1

In [None]:
#insert 3

### Exercise - JavaScript Console 4
Use the Javascript console to find all of the people referenced under the cast of http://www.imdb.com/title/tt0110912/?ref_=fn_al_tt_1 . Delete duplicates and strip whitespace. Save these actors to a pandas dataframe.

In [None]:
#insert 4

Using Pandas to Webscrape
---
Pandas has a super cool built in functionality to read tables from websites! It doesn't work on all tables, but when it does, it saves you a heck of a lot of time!
<a class="anchor" id="http"></a>

First, go to this website listing NFL Super Bowl champions. There are several tables on the website. We can print all of them in just a few lines of code!

In [204]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions'
tables = pd.read_html(url)
print(tables)

[                                                  0  \
0         National Football League (NFL, 1967–1970)   
1                            NFL champion‡ (4, 2–2)   
2  National Football Conference (NFC, 1971–present)   
3                         NFC champion* (48, 25–23)   

                                                  1  
0         American Football League (AFL, 1967–1970)  
1                            AFL champion^ (4, 2–2)  
2  American Football Conference (AFC, 1971–present)  
3                         AFC champion† (48, 23–25)  ,                   0                                                 1  \
0              Game                                              Date   
1             01 !I           000000001967-01-15-0000January 15, 1967   
2            02 !II           000000001968-01-14-0000January 14, 1968   
3           03 !III           000000001969-01-12-0000January 12, 1969   
4            04 !IV           000000001970-01-11-0000January 11, 1970   
5             

It looks like the table I'm really interested in is the second one. Let's print that one:

In [205]:
df = tables[1]
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,Game,Date,Winning team,Score,Losing team,Venue,City,Attendance,Ref
1,01 !I,"000000001967-01-15-0000January 15, 1967","Green Bay Packers 01 !Green Bay Packers‡ (1, 1–0)",3510 !35–10,"Kansas City Chiefs 01 !Kansas City Chiefs^ (1,...",Los Angeles Memorial Coliseum 01 !Los Angeles ...,"Los Angeles, California 01 !Los Angeles, Calif...","061946 !61,946",[11]
2,02 !II,"000000001968-01-14-0000January 14, 1968","Green Bay Packers 02 !Green Bay Packers‡ (2, 2–0)",3314 !33–14,"Oakland Raiders 01 !Oakland Raiders^ (1, 0–1)",Miami Orange Bowl 01 !Miami Orange Bowl,"Miami, Florida 01 !Miami, Florida[note 2]","075546 !75,546",[12]
3,03 !III,"000000001969-01-12-0000January 12, 1969","New York Jets 01 !New York Jets^ (1, 1–0)",1607 !16–7,"Indianapolis Colts 01 !Baltimore Colts‡ (1, 0–1)",Miami Orange Bowl 02 !Miami Orange Bowl (2),"Miami, Florida 02 !Miami, Florida (2)[note 2]","075389 !75,389",[13]
4,04 !IV,"000000001970-01-11-0000January 11, 1970","Kansas City Chiefs 02 !Kansas City Chiefs^ (2,...",2307 !23–7,"Minnesota Vikings 01 !Minnesota Vikings‡ (1, 0–1)",Tulane Stadium 01 !Tulane Stadium,"New Orleans, Louisiana 01 !New Orleans, Louisiana","080562 !80,562",[14]
5,05 !V,"000000001971-01-17-0000January 17, 1971","Indianapolis Colts 02 !Baltimore Colts† (2, 1–1)",1613 !16–13,"Dallas Cowboys 01 !Dallas Cowboys* (1, 0–1)",Miami Orange Bowl 03 !Miami Orange Bowl (3),"Miami, Florida 03 !Miami, Florida (3)[note 2]","079204 !79,204",[15]
6,06 !VI,"000000001972-01-16-0000January 16, 1972","Dallas Cowboys 02 !Dallas Cowboys* (2, 1–1)",2403 !24–3,"Miami Dolphins 01 !Miami Dolphins† (1, 0–1)",Tulane Stadium 02 !Tulane Stadium (2),"New Orleans, Louisiana 02 !New Orleans, Louisi...","081023 !81,023",[16]
7,07 !VII,"000000001973-01-14-0000January 14, 1973","Miami Dolphins 02 !Miami Dolphins† (2, 1–1)",1407 !14–7,Washington Redskins 01 !Washington Redskins* (...,Los Angeles Memorial Coliseum 02 !Los Angeles ...,"Los Angeles, California 02 !Los Angeles, Calif...","090182 !90,182",[17]
8,08 !VIII,"000000001974-01-13-0000January 13, 1974","Miami Dolphins 03 !Miami Dolphins† (3, 2–1)",2407 !24–7,"Minnesota Vikings 02 !Minnesota Vikings* (2, 0–2)",Rice Stadium 01 !Rice Stadium,"Houston, Texas 01 !Houston, Texas","071882 !71,882",[18]
9,09 !IX,"000000001975-01-12-0000January 12, 1975",Pittsburgh Steelers 01 !Pittsburgh Steelers† (...,1606 !16–6,"Minnesota Vikings 03 !Minnesota Vikings* (3, 0–3)",Tulane Stadium 03 !Tulane Stadium (3),"New Orleans, Louisiana 03 !New Orleans, Louisi...","080997 !80,997",[19]


Let's rename our columns the first row of the dataframe and then drop the first row:

In [207]:
df.columns = df.iloc[0]
df.drop(0, inplace = True)
df

Unnamed: 0,Game,Date,Winning team,Score,Losing team,Venue,City,Attendance,Ref
1,01 !I,"000000001967-01-15-0000January 15, 1967","Green Bay Packers 01 !Green Bay Packers‡ (1, 1–0)",3510 !35–10,"Kansas City Chiefs 01 !Kansas City Chiefs^ (1,...",Los Angeles Memorial Coliseum 01 !Los Angeles ...,"Los Angeles, California 01 !Los Angeles, Calif...","061946 !61,946",[11]
2,02 !II,"000000001968-01-14-0000January 14, 1968","Green Bay Packers 02 !Green Bay Packers‡ (2, 2–0)",3314 !33–14,"Oakland Raiders 01 !Oakland Raiders^ (1, 0–1)",Miami Orange Bowl 01 !Miami Orange Bowl,"Miami, Florida 01 !Miami, Florida[note 2]","075546 !75,546",[12]
3,03 !III,"000000001969-01-12-0000January 12, 1969","New York Jets 01 !New York Jets^ (1, 1–0)",1607 !16–7,"Indianapolis Colts 01 !Baltimore Colts‡ (1, 0–1)",Miami Orange Bowl 02 !Miami Orange Bowl (2),"Miami, Florida 02 !Miami, Florida (2)[note 2]","075389 !75,389",[13]
4,04 !IV,"000000001970-01-11-0000January 11, 1970","Kansas City Chiefs 02 !Kansas City Chiefs^ (2,...",2307 !23–7,"Minnesota Vikings 01 !Minnesota Vikings‡ (1, 0–1)",Tulane Stadium 01 !Tulane Stadium,"New Orleans, Louisiana 01 !New Orleans, Louisiana","080562 !80,562",[14]
5,05 !V,"000000001971-01-17-0000January 17, 1971","Indianapolis Colts 02 !Baltimore Colts† (2, 1–1)",1613 !16–13,"Dallas Cowboys 01 !Dallas Cowboys* (1, 0–1)",Miami Orange Bowl 03 !Miami Orange Bowl (3),"Miami, Florida 03 !Miami, Florida (3)[note 2]","079204 !79,204",[15]
6,06 !VI,"000000001972-01-16-0000January 16, 1972","Dallas Cowboys 02 !Dallas Cowboys* (2, 1–1)",2403 !24–3,"Miami Dolphins 01 !Miami Dolphins† (1, 0–1)",Tulane Stadium 02 !Tulane Stadium (2),"New Orleans, Louisiana 02 !New Orleans, Louisi...","081023 !81,023",[16]
7,07 !VII,"000000001973-01-14-0000January 14, 1973","Miami Dolphins 02 !Miami Dolphins† (2, 1–1)",1407 !14–7,Washington Redskins 01 !Washington Redskins* (...,Los Angeles Memorial Coliseum 02 !Los Angeles ...,"Los Angeles, California 02 !Los Angeles, Calif...","090182 !90,182",[17]
8,08 !VIII,"000000001974-01-13-0000January 13, 1974","Miami Dolphins 03 !Miami Dolphins† (3, 2–1)",2407 !24–7,"Minnesota Vikings 02 !Minnesota Vikings* (2, 0–2)",Rice Stadium 01 !Rice Stadium,"Houston, Texas 01 !Houston, Texas","071882 !71,882",[18]
9,09 !IX,"000000001975-01-12-0000January 12, 1975",Pittsburgh Steelers 01 !Pittsburgh Steelers† (...,1606 !16–6,"Minnesota Vikings 03 !Minnesota Vikings* (3, 0–3)",Tulane Stadium 03 !Tulane Stadium (3),"New Orleans, Louisiana 03 !New Orleans, Louisi...","080997 !80,997",[19]
10,10 !X,"000000001976-01-18-0000January 18, 1976",Pittsburgh Steelers 02 !Pittsburgh Steelers† (...,2117 !21–17,"Dallas Cowboys 03 !Dallas Cowboys* (3, 1–2)",Miami Orange Bowl 04 !Miami Orange Bowl (4),"Miami, Florida 04 !Miami, Florida (4)[note 2]","080187 !80,187",[20]


Obviously, the data would take a bit of cleanup, but it's a very quick way to get what you need!

### Exercise - Pandas 1
Create a pandas dataframe of the NBA MVP winners from https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award

Update the column names to be the first row and then drop the first row.

In [20]:
#insert pandas 1

### Exercise - Pandas 2
Use groupby to find the position that occurs most frequently.

In [221]:
#insert pandas2

### Exercise - Pandas 3
Use groupby to find the team that occurs most frequently.

In [None]:
#insert pandas 3

### Exercise - Pandas 4
Create a dataframe of college football champions from https://en.wikipedia.org/wiki/College_football_national_championships_in_NCAA_Division_I_FBS

Make the first row the column names and drop it.

In [21]:
#insert pandas 4

### Exercise - Pandas 5
Create a histogram for the number of championships. Choose your bins to be bins=np.arange(0.5,13.5,1).

In [22]:
#insert pandas 5

### Exercise - Pandas 6
Use pd.read_html to read in the data from:

https://forecast.weather.gov/MapClick.php?lat=45.44763999999998&lon=-122.76902000000001#.WqL4D5MbOu4



In [23]:
#insert pandas 6

### Exercise - Pandas 7
Using the table above, print out just the dewpoint in fahrenheit in the format:

"Dewpoint: 36F"

In [24]:
#insert pandas 7