# Python Web Scraping Tutorial using BeautifulSoup

**Note**: This tutorial is a reproduction of DataQuest's _Python Web Scraping Tutorial using BeautifulSoup_. The original link can be found here: https://www.dataquest.io/blog/web-scraping-tutorial-python/

## Objective

The goal of this exercise is to scrape the National Weather Service website for last week's data in Washington, DC, store the data in a dataframe and then perform a simple analysis of the data with tools learned in Pandas. This tutorial is derived from the DataQuest tutorial found: https://www.dataquest.io/blog/web-scraping-tutorial-python/

This is the webpage that we will be scraping from to gather data on the extended forecast:

https://forecast.weather.gov/MapClick.php?lat=38.8904&lon=-77.032#.W2IV8NhKiuU

### Prep work

1. Find the Extended Forcast on the webpage
2. Use Chrome tools to "Inspect" the page
3. Look for the Extended Forecase text
4. Find the "outermost" element that contains all the forecast text
5. What kind of tag is this? --> Div tag with id "seven-day-forecast"
6. What is "Tonight" and the following days contained in? --> div, "tombstone-container"
7. Drill down into the tombstone-container and see how the elements are stored

In [1]:
# Import libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

We will be using requests.get method whcih can be found here:

http://docs.python-requests.org/en/master/user/quickstart/#make-a-request

In [28]:
# Request the page to download the entire page
# It should read "200" for success
lat = '38.8904'
long = '-77.032'

page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}#.W2IYD9hKiuV".format(lat, long))

#page = requests.get("https://forecast.weather.gov/MapClick.php?lat=38.8904&lon=-77.032#.W2IYD9hKiuV")
#page.status_code

200

In [3]:
# Let's take a look at the page we downloaded
page.content



In [4]:
# Create a BS object to help use parse the html
soup = BeautifulSoup(page.content, 'html.parser')

# Use prettify to format the html content
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>

In [5]:
# Find the div with id "seven-day-forcast" and assign to "seven_day"
seven_day = soup.find(id="seven-day-forecast")
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   Washington DC
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       This
       <br/>
       Afternoon
      </p>
      <p>
       <img alt="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%." class="forecast-icon" src="newimages/medium/shra50.png" title="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%."/>
      </p>
      <p class="short-desc">
       Scattered
       <br/>
       Showers
      </p>
      <p class="t

In [6]:
# Inside seven_day, find one individual forecast item and assign to "forecast_item"
forecast_item = seven_day.find(class_="tombstone-container")
print(forecast_item.prettify())

<div class="tombstone-container">
 <p class="period-name">
  This
  <br/>
  Afternoon
 </p>
 <p>
  <img alt="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%." class="forecast-icon" src="newimages/medium/shra50.png" title="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%."/>
 </p>
 <p class="short-desc">
  Scattered
  <br/>
  Showers
 </p>
 <p class="temp temp-high">
  High: 83 °F
 </p>
</div>


In [7]:
# Now extract ALL the tombstones and call it "forecaset_items"
forecast_items = seven_day.find_all(class_="tombstone-container")

### Extract one-day forecast

In [8]:
#Extract and print the first forecast item
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  This
  <br/>
  Afternoon
 </p>
 <p>
  <img alt="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%." class="forecast-icon" src="newimages/medium/shra50.png" title="This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%."/>
 </p>
 <p class="short-desc">
  Scattered
  <br/>
  Showers
 </p>
 <p class="temp temp-high">
  High: 83 °F
 </p>
</div>


In [9]:
# Extract "period" text, call the new variable "period"
period = tonight.find(class_="period-name").get_text()
period

'ThisAfternoon'

In [10]:
# Extract "short-desc" text, call the new variable "short_desc"
short_desc = tonight.find(class_="short-desc").get_text()
short_desc

'ScatteredShowers'

In [11]:
# Extract "temp" text, call the new variable "temp"
temp = tonight.find(class_="temp").get_text()
temp

'High: 83 °F'

In [12]:
# Print perdiod, short_desc and temp
print(period)
print(short_desc)
print(temp)

ThisAfternoon
ScatteredShowers
High: 83 °F


In [13]:
# Extract the title attribute from the img tag
img = tonight.find("img")
desc = img['title']
print(desc)

This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%.


### Extract all forecasts

In [14]:
# Store all the elements from tombstone in one variable 
period_tags = seven_day.select(".tombstone-container .period-name")
period_tags

[<p class="period-name">This<br/>Afternoon</p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Saturday<br/><br/></p>,
 <p class="period-name">Saturday<br/>Night</p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>]

In [15]:
# Use a loop to call the get_text method on each BS object
periods = []

for pt in period_tags:
    # print(pt.get_text())
    periods.append(pt.get_text())
    
periods

['ThisAfternoon',
 'Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday']

In [16]:
# Use a list comprehension to to call the get_text method on each BS object
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday']

In [17]:
# Apply the same technique to the other (3) fields 
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)
print()

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)
print()


descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

['ScatteredShowers', 'ShowersLikely', 'ChanceShowers thenShowersLikely', 'ShowersLikely thenChanceShowers', 'ChanceShowers thenChanceT-storms', 'ChanceT-storms thenSlight ChanceShowers', 'ChanceT-storms', 'Partly Cloudy', 'ChanceT-storms']

['High: 83 °F', 'Low: 68 °F', 'High: 83 °F', 'Low: 68 °F', 'High: 85 °F', 'Low: 70 °F', 'High: 91 °F', 'Low: 73 °F', 'High: 94 °F']

['This Afternoon: Scattered showers and thunderstorms.  Partly sunny, with a high near 83. South wind around 9 mph.  Chance of precipitation is 50%.', 'Tonight: Showers likely and possibly a thunderstorm before 11pm, then scattered showers and thunderstorms between 11pm and 2am, then isolated showers after 2am.  Mostly cloudy, with a low around 68. South wind 3 to 6 mph.  Chance of precipitation is 60%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ', 'Saturday: A chance of showers before noon, then a chance of showers and thunderstorms between noon and 2pm, then

# Combining our data into a Pandas Dataframe

We will need to pass each list of items that we have created and stored in variables. This will include:
 - period
 - short_desc
 - temp
 - desc
 
 We will pass these as a dictionary into the DataFrame class in Pandas. Each dictionary will become a column in the DataFrame., with each list element as part of the value in the column.

In [18]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })

weather

Unnamed: 0,period,short_desc,temp,desc
0,ThisAfternoon,ScatteredShowers,High: 83 °F,This Afternoon: Scattered showers and thunders...
1,Tonight,ShowersLikely,Low: 68 °F,Tonight: Showers likely and possibly a thunder...
2,Saturday,ChanceShowers thenShowersLikely,High: 83 °F,"Saturday: A chance of showers before noon, the..."
3,SaturdayNight,ShowersLikely thenChanceShowers,Low: 68 °F,Saturday Night: Showers and thunderstorms like...
4,Sunday,ChanceShowers thenChanceT-storms,High: 85 °F,"Sunday: A chance of showers and thunderstorms,..."
5,SundayNight,ChanceT-storms thenSlight ChanceShowers,Low: 70 °F,Sunday Night: A chance of showers and thunders...
6,Monday,ChanceT-storms,High: 91 °F,"Monday: A chance of thunderstorms before 8am, ..."
7,MondayNight,Partly Cloudy,Low: 73 °F,"Monday Night: Partly cloudy, with a low around..."
8,Tuesday,ChanceT-storms,High: 94 °F,Tuesday: A chance of showers and thunderstorms...


# Bonus: Regular Expressions


 - Source: https://www.geeksforgeeks.org/python-pandas-series-str-extract/
 - Cheet Sheet: https://www.rexegg.com/regex-quickstart.html

##### Example 1

In [19]:
# importing pandas as pd 
import pandas as pd 
  
# importing re for regular expressions 
import re 
  
# Creating the Series 
sr = pd.Series(['New_York', 'Lisbon', 'Tokyo', 'Paris', 'Munich']) 
  
# Creating the index 
idx = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5'] 
  
# set the index 
sr.index = idx 
  
# Print the series 
print(sr) 

City 1    New_York
City 2      Lisbon
City 3       Tokyo
City 4       Paris
City 5      Munich
dtype: object


In [20]:
# Now we will use Series.str.extract() function to extract groups from the strings in the given series object.

# extract groups having a vowel followed by 
# any character 
result = sr.str.extract(pat = '([aeiou].)') 
  
# print the result 
print(result) 

         0
City 1  ew
City 2  is
City 3  ok
City 4  ar
City 5  un


##### Example 2

In [21]:
# importing pandas as pd 
import pandas as pd 
  
# importing re for regular expressions 
import re 
  
# Creating the Series 
sr = pd.Series(['Mike', 'Alessa', 'Nick', 'Kim', 'Britney']) 
  
# Creating the index 
idx = ['Name 1', 'Name 2', 'Name 3', 'Name 4', 'Name 5'] 
  
# set the index 
sr.index = idx 
  
# Print the series 
print(sr) 

Name 1       Mike
Name 2     Alessa
Name 3       Nick
Name 4        Kim
Name 5    Britney
dtype: object


In [22]:
# extract groups having any capital letter 
# followed by 'i' and any other character 
result = sr.str.extract(pat = '([A-Z]i.)') 
  
# print the result 
print(result) 

          0
Name 1  Mik
Name 2  NaN
Name 3  Nic
Name 4  Kim
Name 5  NaN


## Analysis

In [23]:
# Use the Series.str.extract method to insert a regular expression to pull out numeric temperature values
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=True)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

Unnamed: 0,temp_num
0,83
1,68
2,83
3,68
4,85
5,70
6,91
7,73
8,94


In [24]:
# Find the mean of this week's temperature
weather["temp_num"].mean()

79.44444444444444

In [25]:
# Select rows that occur only at night
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [26]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,ShowersLikely,Low: 68 °F,Tonight: Showers likely and possibly a thunder...,68,True
3,SaturdayNight,ShowersLikely thenChanceShowers,Low: 68 °F,Saturday Night: Showers and thunderstorms like...,68,True
5,SundayNight,ChanceT-storms thenSlight ChanceShowers,Low: 70 °F,Sunday Night: A chance of showers and thunders...,70,True
7,MondayNight,Partly Cloudy,Low: 73 °F,"Monday Night: Partly cloudy, with a low around...",73,True
