# Web scraping with Python, Pandas, and Beautiful Soup

Determining the 7-day forecast for Charlottesville based on the National Weather Services Website.

adapted from https://www.dataquest.io/blog/web-scraping-tutorial-python/

outline: 
1. download web page with our desired content Create a BeautifulSoup class to parse the page 2. Find the div with id seven-day-forecast, and assign to seven_day Inside seven_day and 
3. find each individual forecast item. 
4. Extract and print the first forecast item


## Download the web page 
We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

### run these commands in terminal: "pip3 install requests" as well as "pip3 install BeautifulSoup4"

before we download the page, it'd be nice to get an idea for the structure of the page. We can accomplish this using the deve tools on Chrome (or other variants if you choose) https://developer.chrome.com/devtools


### Explore: inspect the elements of the web page, noting the general HTML structure and inspect the elements which may be of use.


In [2]:
import requests

In [3]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=38.0335&lon=-78.5079")

In [4]:
page

<Response [200]>

our 200 code for the resonse means that the request was successful. 

now on to creating a beautiful soup class

In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(page.content, 'html.parser')

now soup contains the structure of the website, you are welcome to print it if you'd like with print(soup.prettify())

 we can use CSS selectors to parse out the information we need in specific div tags, labeled by their id.

In [7]:
seven_day = soup.find(id="seven-day-forecast")

Here we use the "find_all" method to select all elements with the class_ tombstone-container. this returns a list from which we can select the first element.


In [9]:
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]


In [10]:
print(tonight.prettify())


<div class="tombstone-container">
 <p class="period-name">
  This
  <br/>
  Afternoon
 </p>
 <p>
  <img alt="This Afternoon: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a high near 85. Southeast wind around 7 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. " class="forecast-icon" src="newimages/medium/shra60.png" title="This Afternoon: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a high near 85. Southeast wind around 7 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. "/>
 </p>
 <p class="short-desc">
  Showers
  <br/>
  Likely
 </p>
 <p class="temp temp-high">
  High: 85 °F
 </p>
</div>



We've narrowed the scope a bit so that we have access to tonight's weather data.
Four Points of Interest:
1. The name of the forecast item – in this case, Tonight. 
2. The description of the conditions – this is stored in the title property of img. 
3. A short description of the conditions – in this case, Mostly Clear. 
4. The temperature low – in this case, 49 degrees.

In [12]:

period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

ThisAfternoon
ShowersLikely
High: 85 °F


Now that we can parse the individual night's information, we can generalize this process to all of the nights using CSS selectors.
1. Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
2. Use a list comprehension to call the get_text method on each BeautifulSoup object.


In [13]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday']

let's use some for loops on the periods array to get the other fields

In [14]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['ShowersLikely', 'ShowersLikely thenChanceShowers andPatchy Fog', 'ChanceShowers andPatchy Fogthen ShowersLikely', 'ChanceShowers', 'ShowersLikely', 'ShowersLikely thenChanceT-storms', 'Partly Sunnythen ChanceT-storms', 'ChanceT-storms', 'ChanceT-storms thenT-storms']
['High: 85 °F', 'Low: 69 °F', 'High: 84 °F', 'Low: 68 °F', 'High: 86 °F', 'Low: 70 °F', 'High: 89 °F', 'Low: 71 °F', 'High: 87 °F']
['This Afternoon: Showers likely and possibly a thunderstorm.  Mostly cloudy, with a high near 85. Southeast wind around 7 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ', 'Tonight: Showers likely and possibly a thunderstorm before midnight, then a chance of showers and thunderstorms between midnight and 4am, then patchy drizzle after 4am.  Patchy fog after 4am.  Otherwise, mostly cloudy, with a low around 69. Light southeast wind.  Chance of precipitation is 70%. New rainfall amounts between a ten


Now that we have the data, we can use our pandas dataframe knowledge to create tables and analyze the data

In [15]:
import pandas as pd

In [16]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,desc,period,short_desc,temp
0,This Afternoon: Showers likely and possibly a ...,ThisAfternoon,ShowersLikely,High: 85 °F
1,Tonight: Showers likely and possibly a thunder...,Tonight,ShowersLikely thenChanceShowers andPatchy Fog,Low: 69 °F
2,"Friday: Patchy drizzle before 9am, then a chan...",Friday,ChanceShowers andPatchy Fogthen ShowersLikely,High: 84 °F
3,Friday Night: A chance of showers and thunders...,FridayNight,ChanceShowers,Low: 68 °F
4,"Saturday: A chance of showers before 8am, then...",Saturday,ShowersLikely,High: 86 °F
5,Saturday Night: Showers likely and possibly a ...,SaturdayNight,ShowersLikely thenChanceT-storms,Low: 70 °F
6,Sunday: A chance of showers and thunderstorms ...,Sunday,Partly Sunnythen ChanceT-storms,High: 89 °F
7,Sunday Night: A chance of showers and thunders...,SundayNight,ChanceT-storms,Low: 71 °F
8,"Monday: A chance of showers, then showers and ...",Monday,ChanceT-storms thenT-storms,High: 87 °F


From here we can use our previous knowledge from dataframe manipulation (week_1 pandas tutorial)


### Explore: try to scrape the data from http://money.cnn.com/data/markets/

general outline: 
download, parse, search divs, extract using CSS selectors, make into dataframe
