# Web Scraping with Python Using Beautiful Soup

## Extracting San Francisco Weather Data from the National Weather Service Website

Note: This was done by following the tutorial by Vik Paruchuri, the CEO and Founder of Dataquest. More information can be found at https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

In [1]:
# Importing the necessary libraries
import requests
from bs4 import BeautifulSoup

In [2]:
# Download the webpage containing the forecast
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.YIC6nehKhyw')
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
# Returns the entire HTML of the website
# For the sake of keeping this notebook short within github, it is commented out below:
# print(soup.prettify())

In [4]:
# Returns only the bit of HTML with the seven day forecast
seven_day = soup.find(id='seven-day-forecast')
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    San Francisco CA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. "/></p><p class="short-desc">Mostly Clear</p><p class="temp temp-low">Low: 56 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tuesday<br/><br/></p>
<p><img alt="Tuesday: Sunny, with a high near 70. Light west wind increasing to 9 t

### Part 1: Extracting information from the page 

This section is just a test to see where we can extract the information we want. 

In [5]:
forecast_items = seven_day.find_all(class_='tombstone-container')

In [6]:
tonight = forecast_items[0]

In [7]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Clear
 </p>
 <p class="temp temp-low">
  Low: 56 °F
 </p>
</div>


In [8]:
period = tonight.find(class_='period-name').get_text()

In [9]:
print(period)

Tonight


In [10]:
short_desc = tonight.find(class_='short-desc').get_text()
print(short_desc)

Mostly Clear


In [11]:
temp = tonight.find(class_="temp").get_text()
print(temp)

Low: 56 °F


In [12]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. 


### Part 2: Extracting ALL the information on the page

Knowing each individual piece of information, we can now extract everything at once.

In [13]:
# Extracting all the periods as a list
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight']

In [14]:
# Extracting all the short descriptions as a list
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
short_descs

['Mostly Clear',
 'Sunny',
 'Mostly Clear',
 'Sunny',
 'Mostly Clear',
 'Mostly Sunny',
 'Mostly Clear',
 'Sunny thenSunny andBreezy',
 'Clear andBreezy thenClear']

In [15]:
# Extracting all the temperature descriptions as a list
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps

['Low: 56 °F',
 'High: 70 °F',
 'Low: 51 °F',
 'High: 67 °F',
 'Low: 51 °F',
 'High: 64 °F',
 'Low: 50 °F',
 'High: 65 °F',
 'Low: 51 °F']

In [16]:
# Extracting all the long descriptions from the images as a list
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs

['Tonight: Mostly clear, with a low around 56. West wind 5 to 15 mph, with gusts as high as 18 mph. ',
 'Tuesday: Sunny, with a high near 70. Light west wind increasing to 9 to 14 mph in the afternoon. Winds could gust as high as 18 mph. ',
 'Tuesday Night: Mostly clear, with a low around 51. West southwest wind 6 to 14 mph, with gusts as high as 18 mph. ',
 'Wednesday: Sunny, with a high near 67. Light west southwest wind increasing to 8 to 13 mph in the afternoon. ',
 'Wednesday Night: Mostly clear, with a low around 51. West wind 9 to 11 mph. ',
 'Thursday: Mostly sunny, with a high near 64.',
 'Thursday Night: Mostly clear, with a low around 50.',
 'Friday: Sunny, with a high near 65. Breezy. ',
 'Friday Night: Clear, with a low around 51. Breezy. ']

### Part 3: Combining data into a Pandas Dataframe

Now that we have extracted all the data, let's put it into a dataframe to better analyze the data.

In [17]:
import pandas as pd
# Creating a dataframe from the lists pulled above
weather = pd.DataFrame({
    "period": periods,
    "short_des": short_descs,
    "temp": temps,
    "desc": descs
})
weather

Unnamed: 0,period,short_des,temp,desc
0,Tonight,Mostly Clear,Low: 56 °F,"Tonight: Mostly clear, with a low around 56. W..."
1,Tuesday,Sunny,High: 70 °F,"Tuesday: Sunny, with a high near 70. Light wes..."
2,TuesdayNight,Mostly Clear,Low: 51 °F,"Tuesday Night: Mostly clear, with a low around..."
3,Wednesday,Sunny,High: 67 °F,"Wednesday: Sunny, with a high near 67. Light w..."
4,WednesdayNight,Mostly Clear,Low: 51 °F,"Wednesday Night: Mostly clear, with a low arou..."
5,Thursday,Mostly Sunny,High: 64 °F,"Thursday: Mostly sunny, with a high near 64."
6,ThursdayNight,Mostly Clear,Low: 50 °F,"Thursday Night: Mostly clear, with a low aroun..."
7,Friday,Sunny thenSunny andBreezy,High: 65 °F,"Friday: Sunny, with a high near 65. Breezy."
8,FridayNight,Clear andBreezy thenClear,Low: 51 °F,"Friday Night: Clear, with a low around 51. Bre..."


In [18]:
# Get only the numerical temperature from the column 'temp'
temp_nums = weather["temp"].str.extract('(\d+)', expand=False)
temp_nums

0    56
1    70
2    51
3    67
4    51
5    64
6    50
7    65
8    51
Name: temp, dtype: object

In [19]:
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    56
1    70
2    51
3    67
4    51
5    64
6    50
7    65
8    51
Name: temp, dtype: object

In [20]:
# Mean of all the high and low temperatures
weather['temp_num'].mean()

58.333333333333336

In [21]:
# Selecting only the Night rows
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [22]:
# Displays the new dataframe with only "night" tempatures and descriptions
weather[is_night]

Unnamed: 0,period,short_des,temp,desc,temp_num,is_night
0,Tonight,Mostly Clear,Low: 56 °F,"Tonight: Mostly clear, with a low around 56. W...",56,True
2,TuesdayNight,Mostly Clear,Low: 51 °F,"Tuesday Night: Mostly clear, with a low around...",51,True
4,WednesdayNight,Mostly Clear,Low: 51 °F,"Wednesday Night: Mostly clear, with a low arou...",51,True
6,ThursdayNight,Mostly Clear,Low: 50 °F,"Thursday Night: Mostly clear, with a low aroun...",50,True
8,FridayNight,Clear andBreezy thenClear,Low: 51 °F,"Friday Night: Clear, with a low around 51. Bre...",51,True
