
## What is Web Scraping?
* Extracting data from internet
* Extracted data is collected and then changed into a suitable format that is useful for the user (i.e. CSV)
* Extract all data from the page or specific data selected by the user before it is run
* Specific data requires techniques to identify CSS and Javascript element corresponds to required data
* User checks through the data, confirming scraper works properly
* Web scraper outputs the data collected
* Collected data can then be changed into a suitable format

What are web scrapers used for
* Extracting information from the net
* Depending problem statement and the type of analysis the data will be run on

Types of websites
* Static: the content of page does not change e.g. history sites
* Dynamic: content of the page, hence it is never the same at any point of time e.g. e-commerce sites

Beautiful Soup
* One of the most commonly used parsing libraries
* Very useful in pulling out information from the HTML page

## 1. Download/Import libraries

In [1]:
!pip install urllib
!pip install bs4
!pip install requests
!pip install pprint

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement urllib (from versions: none)[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
[31mERROR: No matching distribution found for urllib[0m[31m
[0mDefaulting to user installation because normal site-packages is not writeable
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mn

In [3]:
# Import libraries
import pandas as pd
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
from pprint import pprint #helps for formatting



## 2. Scraping Basic Knowledge

In [4]:
# 1. Scraping HTML
url = "https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"

page = requests.get(url) #using requests to pull the page

In [5]:
page
#Informational responses (100–199)
#Successful responses (200–299)
#Redirection messages (300–399)
#Client error responses (400–499)
#Server error responses (500–599)

<Response [200]>

In [12]:
#2. Make HTML looks more presentable / has indentation
soup = BeautifulSoup(page.content, 'html.parser')
print(soup) # Just to compare between using and not using prettify
#print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
<title>National Weather Service</title>
<meta content="National Weather Service" name="DC.title"/>
<meta content="NOAA National Weather Service" name="DC.description"/>
<meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
<meta content="2024-11-28T12:01:42+00:00" name="DC.date.created" scheme="ISO8601"/>
<meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
<meta content="weather" name="DC.keywords"/>
<meta content="NOAA's National Weather Service" name="DC.publisher"/>
<meta content="National Weather Service" name="DC.contributor"/>
<meta content="/disclaimer.php" name="DC.rights"/>
<meta content="General" name="rating"/>
<meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="/build/images/favicon.eab6deff.ico" rel="shortcut icon" type="image

In [14]:
# 3. How to find information of tags from HTML
#  a) Finding all instances of a tag using find_all
soup.find_all('p')

[<p>
 <input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
 <label class="search-scope" for="nws">NWS</label>
 <input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
 <label class="search-scope" for="noaa">All NOAA</label>
 </p>,
 <p>Your local forecast office is</p>,
 <p>
                     Scattered showers and thunderstorms will spread into the East Coast, and moderate to heavy snowfall will develop over the Northeast. A significant Arctic outbreak will spread over the northern Plains, where dangerously cold wind chills are expected. The Arctic air will advance farther south and east on Friday into the weekend. Heavy lake effect snow is likely downwind of the Great Lakes. 
                                                                 <a href="http://www.wpc.ncep.noaa.gov/discussions/hpcdiscussions.php?disc=pmdspd" target="_blank">Read More &gt;</a>
 </p>,
 <p class="myforecast-current">NA</p>,
 <p class="myforecast-current-lrg">46°F<

In [15]:
# get_text() function extract text
soup.find_all('p')[1].get_text()

# Count the statements in between the "p"s, notice the index 1 statement

'Your local forecast office is'

In [16]:
# Get statement from index 2
soup.find_all('p')[2].get_text()


'\n                    Scattered showers and thunderstorms will spread into the East Coast, and moderate to heavy snowfall will develop over the Northeast. A significant Arctic outbreak will spread over the northern Plains, where dangerously cold wind chills are expected. The Arctic air will advance farther south and east on Friday into the weekend. Heavy lake effect snow is likely downwind of the Great Lakes. \n                                                                Read More >\n'

In [17]:
# 3b) Finding the first instance of tag using find()
soup.find('p')

<p>
<input checked="checked" id="nws" name="affiliate" type="radio" value="nws.noaa.gov"/>
<label class="search-scope" for="nws">NWS</label>
<input id="noaa" name="affiliate" type="radio" value="noaa.gov"/>
<label class="search-scope" for="noaa">All NOAA</label>
</p>

In [18]:
# This is a get_text function, all the those with ="" are not text
soup.find_all('p')[0].get_text()

'\n\nNWS\n\nAll NOAA\n'

In [19]:
# Split up a string into a list
soup.find_all('p')[0].get_text().rsplit()

['NWS', 'All', 'NOAA']

In [20]:
#c) Search for tags by class or id

# Find tags with class period-name
soup.find_all(class_= 'period-name')
# Reason why the class does not require quotations: it is a CSS selector, hence recognised by the Python language
# The reason why we put an underscore after the class is because class is a function inbuilt in the Python system, 
# therefore underscore '_' tells the system to get the CSS selector instead

[<p class="period-name">Thanksgiving Day</p>,
 <p class="period-name">Tonight</p>,
 <p class="period-name">Friday</p>,
 <p class="period-name">Friday Night</p>,
 <p class="period-name">Saturday</p>,
 <p class="period-name">Saturday Night</p>,
 <p class="period-name">Sunday</p>,
 <p class="period-name">Sunday Night</p>,
 <p class="period-name">Monday</p>]

In [22]:
# Find tags with id news-items
soup.find_all(id = 'news-items')
# Get whatever it is under or nested in the id 'news-items' (Refer to the soup code)
# Reason why id is not required a quotation: it is a CSS selector, hence recognised by the Python language

[<div id="news-items">
 <div id="topnews">
 <div class="icon"><img src="/bundles/templating/images/top_news/important.png"/></div>
 <div class="body">
 <h1 style="font-size: 11pt;">Rain and Snow in the Eastern US; Arctic Airmass Moves into the Northern Plains</h1>
 <p>
                     Scattered showers and thunderstorms will spread into the East Coast, and moderate to heavy snowfall will develop over the Northeast. A significant Arctic outbreak will spread over the northern Plains, where dangerously cold wind chills are expected. The Arctic air will advance farther south and east on Friday into the weekend. Heavy lake effect snow is likely downwind of the Great Lakes. 
                                                                 <a href="http://www.wpc.ncep.noaa.gov/discussions/hpcdiscussions.php?disc=pmdspd" target="_blank">Read More &gt;</a>
 </p>
 </div>
 </div>
 </div>]

## 3. Scraping for real now
1. Download webpage containing the forecast
2. Create a BeautifulSoup Class to parse the page
3. Find the div with id seven-day-forecast and assgin to seven-day
4. Inside seven-day, find each individual forecast item
5. Extract and print the first forecast item
6. Using the tag information found from Step 5, extract the following information: Period, Short Description, Temperature and Description of the condtions
7. Format the extracted data into a pandas dataset

In [23]:
# Show our current soup variable
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>
<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
<title>National Weather Service</title>
<meta content="National Weather Service" name="DC.title"/>
<meta content="NOAA National Weather Service" name="DC.description"/>
<meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
<meta content="2024-11-28T12:01:42+00:00" name="DC.date.created" scheme="ISO8601"/>
<meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
<meta content="weather" name="DC.keywords"/>
<meta content="NOAA's National Weather Service" name="DC.publisher"/>
<meta content="National Weather Service" name="DC.contributor"/>
<meta content="/disclaimer.php" name="DC.rights"/>
<meta content="General" name="rating"/>
<meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="/build/images/favicon.eab6deff.ico" r

In [26]:
# Find the div with id seven-day-forecast and assign to seven-day
seven_day = soup.find(id = "seven-day-forecast")
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   San Francisco CA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Thanksgiving Day
      </p>
      <p>
       <img alt="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. "/>
      </p>
      <p class="temp temp-high">
       High: 60 °F
      </p>
      <p class="short-desc">
       Sunny
      </p>
     </div>
    </li>
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Toni

In [27]:
# Step 4: Inside the seven day, find each individual forecast item
forecast_items = seven_day.find_all(class_ = "tombstone-container")
print(forecast_items)


[<div class="tombstone-container"><p class="period-name">Thanksgiving Day</p><p><img alt="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. "/></p><p class="temp temp-high">High: 60 °F</p><p class="short-desc">Sunny</p></div>, <div class="tombstone-container"><p class="period-name">Tonight</p><p><img alt="Tonight: Partly cloudy, with a low around 45. North wind around 5 mph. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 45. North wind around 5 mph. "/></p><p class="temp temp-low">Low: 45 °F</p><p class="short-desc">Partly Cloudy</p></div>, <div class="tombstone-container"><p class="period-name">Friday</p><p><img alt="Friday: Mostly sunny, with a high near 62. North northeast wind around 7 mph. " class="forecast-icon" src="newimages/medium/sct.png" title="Friday

In [29]:
# Step 5: Extract and print the first forecast item
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Thanksgiving Day
 </p>
 <p>
  <img alt="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. "/>
 </p>
 <p class="temp temp-high">
  High: 60 °F
 </p>
 <p class="short-desc">
  Sunny
 </p>
</div>



In [32]:
# Step 6: Using the tag information found from Step 5, extract the following information:
# Period, Short Description, Temperature and Description of the condtions
period = tonight.find(class_ = 'period_name').get_text()
short_desc = tonight.find(class_ = 'short_desc').get_text()
temp = tonight.find(class_ = 'temp').get_text()
print(period())





AttributeError: 'NoneType' object has no attribute 'get_text'

In [35]:
# Description of the conditions
img = tonight.find('img')
desc = img['title'] # title is also no a CSS selector, hence quotations
print(desc)


Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. 


In [37]:
# Extract all period names
period_tags = seven_day.select('.tombstone-container .period-name')
print(period_tags)
periods = [pt.get_text() for pt in period_tags]




[<p class="period-name">Thanksgiving Day</p>, <p class="period-name">Tonight</p>, <p class="period-name">Friday</p>, <p class="period-name">Friday Night</p>, <p class="period-name">Saturday</p>, <p class="period-name">Saturday Night</p>, <p class="period-name">Sunday</p>, <p class="period-name">Sunday Night</p>, <p class="period-name">Monday</p>]


In [43]:
# Create our variables short_descs, temps, descs
short_descs = [sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
# We need put the . for class and id, because they are CSS selectors
temps = [t.get_text() for t in seven_day.select('.tombstone-container .temp')]

descs = [d['title'] for d in seven_day.select('.tombstone-container img')]
# Do not have to put the dots there because img is not under the division of the tombstone-container


print(short_descs)
print(temps)
print(descs)


['Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Sunny']
['High: 60 °F', 'Low: 45 °F', 'High: 62 °F', 'Low: 44 °F', 'High: 62 °F', 'Low: 45 °F', 'High: 62 °F', 'Low: 47 °F', 'High: 63 °F']
['Thanksgiving Day: Sunny, with a high near 60. North northeast wind 5 to 8 mph. ', 'Tonight: Partly cloudy, with a low around 45. North wind around 5 mph. ', 'Friday: Mostly sunny, with a high near 62. North northeast wind around 7 mph. ', 'Friday Night: Partly cloudy, with a low around 44. Northeast wind around 5 mph. ', 'Saturday: Mostly sunny, with a high near 62. Northeast wind around 7 mph. ', 'Saturday Night: Mostly clear, with a low around 45.', 'Sunday: Mostly sunny, with a high near 62.', 'Sunday Night: Partly cloudy, with a low around 47.', 'Monday: Sunny, with a high near 63.']


In [44]:
# Step 7: Format the extracted data into a pandas dataset
weather = pd.DataFrame('Period': Periods)










SyntaxError: invalid syntax (1295900336.py, line 2)