# Jupyter Notebook

This is a Jupyter Notebook, which is a basically just a super fancy Python shell.

You may have "cells" that can either be text (like this one) or executable Python code. Notebooks are really nice because they allow you to rapidly develop Python code by writing small bits of code, testing their output, and moving on to the next bit; this interactive nature of the notebook is a huge plus to professional Python developers. 

It's also nice, because it's really easy to share your code with others and surround it with text to tell a story! 

# Colaboratory
Colaboratory is a service provided by Google to take a Jupyter Notebook (a standard formay of a `.ipynb` file) and let users edit/run the code in the notebook for free! 

This notebook is write-protected so you are not able to edit the  notebook that the whole class will look at, but you are able to open up the notebook in "playground mode" which lets you make edits to a temporary copy of the notebook. If you want to save the changes you made to this notebook, you will have to follow the instructions when you try to save to copy the notebook to your Google Drive. 

# Setup
Make sure you run the following cell(s) before trying to run any the following cells. You do not need to understand what they are doing, it's just a way to make sure there is a file we want to use stored on the computer running this notebook.

---


# APIs
The first web technology we talked about are the Application Programming Interface (API). Any API is generally a URL you can go to that returns data (as opposed to a web-page like facebook.com). We looked at an API that returns the position of the International Space Station. This API is located at http://api.open-notify.org/iss-now.json and returns data in the JSON format. JSON is basically just a python lists and dictionaries with keys and values.

`requests` is a very popular Python library that lets you fetch data from a URL.

In [0]:
import requests

In [0]:
response = requests.get('http://api.open-notify.org/iss-now.json')

This returns a "response" object that has information about the response like it's status code and data

In [0]:
response.status_code

200

You can view the data with the `content` attribute, but this returns a string which is not very helpful. Instead, we can use the `json` method to convert the string to a python dictionary.

In [0]:
response.content

b'{"message": "success", "iss_position": {"latitude": "-12.5937", "longitude": "-45.8263"}, "timestamp": 1559873338}'

In [0]:
d = response.json()
print(d)
print(d['timestamp'])

{'message': 'success', 'iss_position': {'latitude': '-12.5937', 'longitude': '-45.8263'}, 'timestamp': 1559873338}
1559873338


How did we know there was a key called `'timestamp'`? This is part of the documentation of the API that can be found [here](http://open-notify.org/Open-Notify-API/ISS-Location-Now/).  

To get more up-to-date data, we would have to make the request again. The code below makes 20 calls to the API and prints the latitude and longitude of the ISS.

In [0]:
for i in range(20):
  response = requests.get('http://api.open-notify.org/iss-now.json')
  if response.status_code == 200:
    print(response.json()['iss_position'])
  else:
    print('Error')

{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.5937', 'longitude': '-45.8263'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6187', 'longitude': '-45.8073'}
{'latitude': '-12.6436', 'longitude': '-45.7884'}
{'latitude': '-12.6436', 'longitude': '-45.7884'}


# Web Scraping
APIs are fantastic because the return useful data that is generally well formatted! One problem is the API needs to be written by someone and they don't always exist for the things you want. If there is no API to access the data nicely, another approach is to "scrape" the data of a webpage. 

In the rest of this example, we will try to scrape [this webpage](https://forecast.weather.gov/MapClick.php?lat=47.6036&lon=-122.3294) with the weather forecast so we can gather data about the weather for the rest of the week. This example is a bit lacking since we don't show **what** you would do with this data, just **how** to get it. To know what you can do with the data, refer to the rest of the course where we learned how to process data we had; web-scraping is just a tool to gather more data.

To understand how to scrape a page, you have to understand what a webpage looks like. Please refer to the slides to see what a webpage looks like. 

In [0]:
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=47.6036&lon=-122.3294')
print(page.content)



This would be a pain to parse by hand, so instead we use a library that lets us look at the conents of the page. This library is called Beautiful Soup which can be used like below:

In [0]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

To find the first paragraph in the page, we can write

In [0]:
soup.find('p')

<p>
<input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
<label class="search-scope" for="nws">NWS</label>
<input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
<label class="search-scope" for="noaa">All NOAA</label>
</p>

To find all the paragraphs, we would use `find_all`

In [0]:
soup.find_all('p')

[<p>
 <input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
 <label class="search-scope" for="nws">NWS</label>
 <input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
 <label class="search-scope" for="noaa">All NOAA</label>
 </p>, <p>Your local forecast office is</p>, <p>
             A weather system will inch eastward, but will still impact portions of the Mississippi Valley soils are saturated and rivers remain in flood. Heavy rains spreads into the Southeast and portions of the Mid-Atlantic Friday and Saturday. Flash flooding is possible across the area. Severe storms with primarily a damaging wind threat develops across the High Plains. 
             <a href="http://www.wpc.ncep.noaa.gov/discussions/hpcdiscussions.php?disc=pmdspd" target="_blank">Read More &gt;</a>
 </p>, <p class="myforecast-current">NA</p>, <p class="myforecast-current-lrg">64°F</p>, <p class="myforecast-current-sm">18°C</p>, <p class="moreInfo"><b>More Information:</b

We learned you can also specify an ID or a class to identify a tag in HTML. First, we select the element with the id "seven-day-forecast" and then try to find all of the items with the class "tombstone-container" **inside** that element.

In [0]:
seven_day = soup.find(id='seven-day-forecast')
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Downtown Seattle WA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/nshra60.png" title="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a te

In [0]:
forecast_items = seven_day.find_all(class_='tombstone-container')
tonight = forecast_items[0]
tonight

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/nshra60.png" title="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. "/></p><p class="short-desc">Showers<br/>Likely</p><p class="temp temp-low">Low: 49 °F</p></div>

To print it out a little nicer, we can use the `prettify` method

In [0]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/nshra60.png" title="Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. "/>
 </p>
 <p class="short-desc">
  Showers
  <br/>
  Likely
 </p>
 <p class="temp temp-low">
  Low: 49 °F
 </p>
</div>


Elements returned by `find` sometimes have attributes that you can inspect. For example, the `img` tags in the the forecast have a "title" attribute with information about the forecast.

In [0]:
tonight.find('img')['title']

'Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. '

Finding elements within an element can sometimes be tedious to do manually with find. BeautifulSoup also provides the `select` method that lets you find elements using a special syntax called "CSS Selectors". 

The follwoing line finds all the elements with the class "period-name" that are inside elements with the class "tombstone container".

In [0]:
seven_day.select('.tombstone-container .period-name')

[<p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Friday<br/><br/></p>,
 <p class="period-name">Friday<br/>Night</p>,
 <p class="period-name">Saturday<br/><br/></p>,
 <p class="period-name">Saturday<br/>Night</p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>]

To get the text inside each tag returned by `select`, we can use a loop to all the `get_text` method on each tag. 

Below we get the following information for each time in the forecast
*  We get the name of the forecast time (called period)
*  We get the description of the forecast from the title property of the image inside the forecast
*  We get the forecast temperature 



In [0]:
periods = [pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
print(periods)

['Tonight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight', 'Monday', 'MondayNight']


In [0]:
titles = [img['title'] for img in seven_day.select('.tombstone-container img')]
print(titles)

['Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. ', 'Friday: Showers, with thunderstorms also possible after 2pm.  High near 59. South southwest wind 9 to 13 mph.  Chance of precipitation is 90%. New rainfall amounts between a tenth and quarter of an inch, except higher amounts possible in thunderstorms. ', 'Friday Night: Showers before 11pm.  Low around 50. West wind 6 to 11 mph becoming south southeast in the evening.  Chance of precipitation is 80%. New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: A 20 percent chance of showers before 11am.  Partly sunny, with a high near 65. South wind 5 to 11 mph becoming west in the afternoon. ', 'Saturday Night: Partly cloudy, with a low around 50. North northwest wind 7 to 11 mph becoming northeast in the eveni

In [0]:
temps = [tt.get_text() for tt in seven_day.select('.tombstone-container .temp')]
print(temps)

['Low: 49 °F', 'High: 59 °F', 'Low: 50 °F', 'High: 65 °F', 'Low: 50 °F', 'High: 71 °F', 'Low: 54 °F', 'High: 75 °F', 'Low: 58 °F']


All of the data together is shown below

In [0]:
print(titles)
print(temps)
print(periods)

['Tonight: Showers likely, mainly before 11pm.  Mostly cloudy, with a low around 49. Northwest wind 8 to 10 mph becoming south southwest after midnight.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. ', 'Friday: Showers, with thunderstorms also possible after 2pm.  High near 59. South southwest wind 9 to 13 mph.  Chance of precipitation is 90%. New rainfall amounts between a tenth and quarter of an inch, except higher amounts possible in thunderstorms. ', 'Friday Night: Showers before 11pm.  Low around 50. West wind 6 to 11 mph becoming south southeast in the evening.  Chance of precipitation is 80%. New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: A 20 percent chance of showers before 11am.  Partly sunny, with a high near 65. South wind 5 to 11 mph becoming west in the afternoon. ', 'Saturday Night: Partly cloudy, with a low around 50. North northwest wind 7 to 11 mph becoming northeast in the eveni

The data is now stored as "parallel arrays", where the value at index 0 in each array corresponds to one forecast. This would be a bit annoying to work with, so we put it in a `pandas` `DataFrame`

In [0]:
import pandas as pd
weather = pd.DataFrame({
    'period': periods,
    'temp': temps,
    'desc': titles
})
weather

Unnamed: 0,period,temp,desc
0,Tonight,Low: 49 °F,"Tonight: Showers likely, mainly before 11pm. ..."
1,Friday,High: 59 °F,"Friday: Showers, with thunderstorms also possi..."
2,FridayNight,Low: 50 °F,Friday Night: Showers before 11pm. Low around...
3,Saturday,High: 65 °F,Saturday: A 20 percent chance of showers befor...
4,SaturdayNight,Low: 50 °F,"Saturday Night: Partly cloudy, with a low arou..."
5,Sunday,High: 71 °F,"Sunday: Mostly sunny, with a high near 71."
6,SundayNight,Low: 54 °F,"Sunday Night: Partly cloudy, with a low around..."
7,Monday,High: 75 °F,"Monday: Mostly sunny, with a high near 75."
8,MondayNight,Low: 58 °F,"Monday Night: Partly cloudy, with a low around..."
