## Web Scraping with Beautiful Soup

adapted from https://www.dataquest.io/blog/web-scraping-tutorial-python/

In [None]:
#!pip install BeautifulSoup4 # ! means run this line as a shell command (if you don't know what that means, don't worry about it)
import requests # package for making the GET request and recieving the HTML
import bs4 as BeautifulSoup # package for parsing the HTML
import pandas as pd

We're going to need to download the webpage using Python's requests library. We'll use the request GET, which will download the HTML contents from the specified webpage. 

In [None]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=38.0335&lon=-78.5079")

In [None]:
page

<Response [200]>

Response [200] indicates that the request was successful. We'll need to create a BeautifulSoup object in order to utilize the library and get information from the webpage.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(page.content)

If you want to view the HTML, use the command print(soup.prettify()) to see it laid out nicely. Usually, we can just use soup.prettify() in Jupyter Notebook to view an object, but it doesn't work with soup.prettify(). 

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title"/>
  <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
  <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
  <meta content="" name="DC.date.created" scheme="ISO8601"/>
  <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
  <meta content="weather, National Weather Service" name="DC.keywords"/>
  <meta content="NOAA's National Weather Service" name="DC.publisher"/>
  <meta content="National Weather Service" name="DC.contributor"/>
  <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
  <meta content="General" name="rating"/>
  <meta content="index,follow" name="robots"/>
  <!-- I

find() will look through the HTML to find the tag that you specify. This is the first step in order to narrow down the information on the webpage you want. To find the id on the webpage, use "inspect".

In [None]:
seven_day = soup.find(id="seven-day-forecast") # finds the section with the id "seven-day-forecast"

In [None]:
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Charlottesville VA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Clear, with a low around 45. Calm wind. " class="forecast-icon" src="newimages/medium/nskc.png" title="Tonight: Clear, with a low around 45. Calm wind. "/></p><p class="short-desc">Clear</p><p class="temp temp-low">Low: 45 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Sunday<br/><br/></p>
<p><img alt="Sunday: Patchy fog before 10am.  Otherwise, sunny, with a high near 79. Calm wind becoming south around 5 mph in the afternoon. " class="forecast-icon" src="DualImage.php?i=f

## Brief HTML Overview

Common tags:

- `<div>` - a divider/section
- `<h1>` - header 1 (big)
- `<h2>` - header 2 (less big)
- `<p>` - paragraph

Each of the tags can contain a `class` and/or an `id`. The class can be used multiple times in elements that are similar, for example separate `<div>`s that contain an image + text that need to be formatted in the same way. An id is used for unique elements, and can only be found once on a webpage. For example, you could use it for a logo `<div>` at the top of a website.

In [None]:
seven_day.find(class_="tombstone-container") # finds first tag with the class "tombstone-container"

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Clear, with a low around 45. Calm wind. " class="forecast-icon" src="newimages/medium/nskc.png" title="Tonight: Clear, with a low around 45. Calm wind. "/></p><p class="short-desc">Clear</p><p class="temp temp-low">Low: 45 °F</p></div>

`find()` returned the first "tombstone-container", but we want all of them, one for each day. We will use the "findall" method to select __all__ elements with the `class` tombstone-container. This returns a list from which we can select the first element.

In [None]:
forecast_items = seven_day.find_all(class_="tombstone-container") # finds every section with the specific class
today = forecast_items[0]

In [None]:
today

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Clear, with a low around 45. Calm wind. " class="forecast-icon" src="newimages/medium/nskc.png" title="Tonight: Clear, with a low around 45. Calm wind. "/></p><p class="short-desc">Clear</p><p class="temp temp-low">Low: 45 °F</p></div>

In [None]:
print(today.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Clear, with a low around 45. Calm wind. " class="forecast-icon" src="newimages/medium/nskc.png" title="Tonight: Clear, with a low around 45. Calm wind. "/>
 </p>
 <p class="short-desc">
  Clear
 </p>
 <p class="temp temp-low">
  Low: 45 °F
 </p>
</div>


We've narrowed the scope a bit so that we have access to tonight's weather data. Four Points of Interest:

- The name of the forecast item – in this case, Tonight.
- The description of the conditions – this is stored in an `<img>`, inside the attribute `title`.
- A short description of the conditions - stored in paragraph of class "short-desc".
- The temperature high – stored in paragraph of class "temp temp-high".

In [None]:
period = today.find(class_="period-name").get_text()
desc = today.find(class_="forecast-icon")['title'] # this is an image with an attribute called "title", so we get it using the bracket syntax
short_desc = today.find(class_="short-desc").get_text()
temp = today.find(class_="temp").get_text()

print(period)
print(desc)
print(short_desc)
print(temp)

Tonight
Tonight: Clear, with a low around 45. Calm wind. 
Clear
Low: 45 °F


Now that we can parse the individual nights' information, we can generalize this process to all of the nights using CSS selectors.

Select all items with the `class` "period-name" inside an item with the `class` "tombstone-container" in seven_day. The period (.) represents a class. If you wanted to find an id, you would use a pound symbol (#).

Note that `select` returns a list, so to get the elements you have to index the list, or use a for loop to process each element of the list.

In [None]:
seven_day.select(".tombstone-container  .period-name")

[<p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Veterans<br/>Day</p>,
 <p class="period-name">Wednesday<br/>Night</p>]

We will scrape all of the periods, short_descs, temps, and descs using the `select` command.

In [None]:
periods = []
for tag in seven_day.select(".tombstone-container .period-name"):
  text = tag.get_text()
  periods.append(text)
print(periods)

['Tonight', 'Sunday', 'SundayNight', 'Monday', 'MondayNight', 'Tuesday', 'TuesdayNight', 'VeteransDay', 'WednesdayNight']


In [None]:
short_descs = []
for sd in seven_day.select(".tombstone-container .short-desc"):
  text = sd.get_text()
  short_descs.append(text)
print(short_descs)

['Clear', 'Patchy Fogthen Sunny', 'Partly Cloudythen PatchyFog', 'Patchy Fogthen MostlySunny', 'Mostly Cloudy', 'Mostly Cloudythen SlightChanceShowers', 'ShowersLikely', 'Showers', 'Showers']


In [None]:
temps = []
for t in seven_day.select(".tombstone-container .temp"):
  text = t.get_text()
  temps.append(text)
print(temps)

['Low: 45 °F', 'High: 79 °F', 'Low: 50 °F', 'High: 71 °F', 'Low: 54 °F', 'High: 72 °F', 'Low: 64 °F', 'High: 74 °F', 'Low: 63 °F']


In [None]:
descs = []
for d in seven_day.select(".tombstone-container img"):
  text = d['title']
  descs.append(text)
print(descs)

['Tonight: Clear, with a low around 45. Calm wind. ', 'Sunday: Patchy fog before 10am.  Otherwise, sunny, with a high near 79. Calm wind becoming south around 5 mph in the afternoon. ', 'Sunday Night: Patchy fog after 1am.  Otherwise, partly cloudy, with a low around 50. Light and variable wind. ', 'Monday: Patchy fog before 10am.  Otherwise, mostly sunny, with a high near 71. Calm wind becoming south around 5 mph in the afternoon. ', 'Monday Night: Mostly cloudy, with a low around 54. Light south wind. ', 'Tuesday: A slight chance of showers after 1pm.  Partly sunny, with a high near 72. Chance of precipitation is 20%.', 'Tuesday Night: Showers likely, mainly after 1am.  Mostly cloudy, with a low around 64. Chance of precipitation is 70%.', 'Veterans Day: Showers.  High near 74. Chance of precipitation is 90%.', 'Wednesday Night: Showers.  Low around 63. Chance of precipitation is 80%.']


Now, we've scraped all the data we wish to analyze, so we can combine it all into a DataFrame so we can clean and analyze it using pandas. 

In [None]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc": descs
    })
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Clear,Low: 45 °F,"Tonight: Clear, with a low around 45. Calm wind."
1,Sunday,Patchy Fogthen Sunny,High: 79 °F,"Sunday: Patchy fog before 10am. Otherwise, su..."
2,SundayNight,Partly Cloudythen PatchyFog,Low: 50 °F,Sunday Night: Patchy fog after 1am. Otherwise...
3,Monday,Patchy Fogthen MostlySunny,High: 71 °F,"Monday: Patchy fog before 10am. Otherwise, mo..."
4,MondayNight,Mostly Cloudy,Low: 54 °F,"Monday Night: Mostly cloudy, with a low around..."
5,Tuesday,Mostly Cloudythen SlightChanceShowers,High: 72 °F,Tuesday: A slight chance of showers after 1pm....
6,TuesdayNight,ShowersLikely,Low: 64 °F,"Tuesday Night: Showers likely, mainly after 1a..."
7,VeteransDay,Showers,High: 74 °F,Veterans Day: Showers. High near 74. Chance o...
8,WednesdayNight,Showers,Low: 63 °F,Wednesday Night: Showers. Low around 63. Chan...


In [None]:

weather2 = pd.DataFrame()

weather2['period'] = periods
weather2['short_desc'] = short_descs
weather2['temp'] = temps
weather2['desc'] = descs


weather2

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Clear,Low: 45 °F,"Tonight: Clear, with a low around 45. Calm wind."
1,Sunday,Patchy Fogthen Sunny,High: 79 °F,"Sunday: Patchy fog before 10am. Otherwise, su..."
2,SundayNight,Partly Cloudythen PatchyFog,Low: 50 °F,Sunday Night: Patchy fog after 1am. Otherwise...
3,Monday,Patchy Fogthen MostlySunny,High: 71 °F,"Monday: Patchy fog before 10am. Otherwise, mo..."
4,MondayNight,Mostly Cloudy,Low: 54 °F,"Monday Night: Mostly cloudy, with a low around..."
5,Tuesday,Mostly Cloudythen SlightChanceShowers,High: 72 °F,Tuesday: A slight chance of showers after 1pm....
6,TuesdayNight,ShowersLikely,Low: 64 °F,"Tuesday Night: Showers likely, mainly after 1a..."
7,VeteransDay,Showers,High: 74 °F,Veterans Day: Showers. High near 74. Chance o...
8,WednesdayNight,Showers,Low: 63 °F,Wednesday Night: Showers. Low around 63. Chan...


Because all of the Weather Channel's pages are formatted the same way, you can do this for any other city. All you have to do is change the original url that you started with to the city of your choosing. 