# Web Scraping with Python Using Beautiful Soup

## Extracting San Francisco Weather Data from the National Weather Service Website

Note: This was done by following the tutorial by Vik Paruchuri, the CEO and Founder of Dataquest. More information can be found at https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

In [1]:
# Importing the necessary libraries
import requests
from bs4 import BeautifulSoup

In [2]:
# Download the webpage containing the forecast
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.YIC6nehKhyw')
soup = BeautifulSoup(page.content, 'html.parser')

In [8]:
# Returns the entire HTML of the website
# For the sake of keeping this notebook short within github, it is commented out below:
# print(soup.prettify())

In [4]:
# Returns only the bit of HTML with the seven day forecast
seven_day = soup.find(id='seven-day-forecast')
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">This<br/>Afternoon</p>
<p><img alt="This Afternoon: Sunny, with a high near 74. West wind 9 to 14 mph, with gusts as high as 18 mph. " class="forecast-icon" src="newimages/medium/skc.png" title="This Afternoon: Sunny, with a high near 74. West wind 9 to 14 mph, with gusts as high as 18 mph. "/></p><p class="short-desc">Sunny</p><p class="temp temp-high">High: 74 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear, with a low around 55. West wind 5 to 11 mph. " class="

### Part 1: Extracting information from the page 

This section is just a test to see where we can extract the information we want. 

In [6]:
forecast_items = seven_day.find_all(class_='tombstone-container')

In [7]:
tonight = forecast_items[0]

In [8]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Patchy drizzle after 11pm.  Mostly cloudy, with a low around 53. West wind 11 to 18 mph, with gusts as high as 24 mph. " class="forecast-icon" src="newimages/medium/nra.png" title="Tonight: Patchy drizzle after 11pm.  Mostly cloudy, with a low around 53. West wind 11 to 18 mph, with gusts as high as 24 mph. "/>
 </p>
 <p class="short-desc">
  Patchy
  <br/>
  Drizzle
 </p>
 <p class="temp temp-low">
  Low: 53 °F
 </p>
</div>


In [9]:
period = tonight.find(class_='period-name').get_text()

In [10]:
print(period)

Tonight


In [11]:
short_desc = tonight.find(class_='short-desc').get_text()
print(short_desc)

PatchyDrizzle


In [12]:
temp = tonight.find(class_="temp").get_text()
print(temp)

Low: 53 °F


In [13]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Patchy drizzle after 11pm.  Mostly cloudy, with a low around 53. West wind 11 to 18 mph, with gusts as high as 24 mph. 


### Part 2: Extracting ALL the information on the page

Knowing each individual piece of information, we can now extract everything at once.

In [14]:
# Extracting all the periods as a list
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight']

In [16]:
# Extracting all the short descriptions as a list
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
short_descs

['PatchyDrizzle',
 'Sunny',
 'Mostly Clear',
 'Sunny',
 'Clear',
 'Sunny',
 'Clear',
 'Sunny',
 'Mostly Clear']

In [17]:
# Extracting all the temperature descriptions as a list
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps

['Low: 53 °F',
 'High: 64 °F',
 'Low: 52 °F',
 'High: 69 °F',
 'Low: 53 °F',
 'High: 72 °F',
 'Low: 54 °F',
 'High: 71 °F',
 'Low: 52 °F']

In [21]:
# Extracting all the long descriptions from the images as a list
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs

['Tonight: Patchy drizzle after 11pm.  Mostly cloudy, with a low around 53. West wind 11 to 18 mph, with gusts as high as 24 mph. ',
 'Saturday: Sunny, with a high near 64. West wind 8 to 18 mph, with gusts as high as 23 mph. ',
 'Saturday Night: Mostly clear, with a low around 52. West wind 8 to 13 mph. ',
 'Sunday: Sunny, with a high near 69. West wind 6 to 9 mph. ',
 'Sunday Night: Clear, with a low around 53. West northwest wind 8 to 14 mph, with gusts as high as 18 mph. ',
 'Monday: Sunny, with a high near 72.',
 'Monday Night: Clear, with a low around 54.',
 'Tuesday: Sunny, with a high near 71.',
 'Tuesday Night: Mostly clear, with a low around 52.']

### Part 3: Combining data into a Pandas Dataframe

Now that we have extracted all the data, let's put it into a dataframe to better analyze the data.

In [22]:
import pandas as pd
# Creating a dataframe from the lists pulled above
weather = pd.DataFrame({
    "period": periods,
    "short_des": short_descs,
    "temp": temps,
    "desc": descs
})
weather

Unnamed: 0,period,short_des,temp,desc
0,Tonight,PatchyDrizzle,Low: 53 °F,Tonight: Patchy drizzle after 11pm. Mostly cl...
1,Saturday,Sunny,High: 64 °F,"Saturday: Sunny, with a high near 64. West win..."
2,SaturdayNight,Mostly Clear,Low: 52 °F,"Saturday Night: Mostly clear, with a low aroun..."
3,Sunday,Sunny,High: 69 °F,"Sunday: Sunny, with a high near 69. West wind ..."
4,SundayNight,Clear,Low: 53 °F,"Sunday Night: Clear, with a low around 53. Wes..."
5,Monday,Sunny,High: 72 °F,"Monday: Sunny, with a high near 72."
6,MondayNight,Clear,Low: 54 °F,"Monday Night: Clear, with a low around 54."
7,Tuesday,Sunny,High: 71 °F,"Tuesday: Sunny, with a high near 71."
8,TuesdayNight,Mostly Clear,Low: 52 °F,"Tuesday Night: Mostly clear, with a low around..."


In [9]:
# Get only the numerical temperature from the column 'temp'
temp_nums = weather["temp"].str.extract('(\d+)', expand=False)
temp_nums

NameError: name 'weather' is not defined

In [35]:
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    53
1    64
2    52
3    69
4    53
5    72
6    54
7    71
8    52
Name: temp, dtype: object

In [36]:
# Mean of all the high and low temperatures
weather['temp_num'].mean()

60.0

In [37]:
# Selecting only the Night rows
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [38]:
# Displays the new dataframe with only "night" tempatures and descriptions
weather[is_night]

Unnamed: 0,period,short_des,temp,desc,temp_num,is_night
0,Tonight,PatchyDrizzle,Low: 53 °F,Tonight: Patchy drizzle after 11pm. Mostly cl...,53,True
2,SaturdayNight,Mostly Clear,Low: 52 °F,"Saturday Night: Mostly clear, with a low aroun...",52,True
4,SundayNight,Clear,Low: 53 °F,"Sunday Night: Clear, with a low around 53. Wes...",53,True
6,MondayNight,Clear,Low: 54 °F,"Monday Night: Clear, with a low around 54.",54,True
8,TuesdayNight,Mostly Clear,Low: 52 °F,"Tuesday Night: Mostly clear, with a low around...",52,True
