#tutorial on how to scrape the web.
In this example, I scrape the BBC weather website for a specific city and collect the weather forecast for the next 14 days, saving it as a CSV file.

Web scraping may not always be legal. It's a good idea for me to check the terms of the website I plan to scrape before proceeding. Additionally, if my code makes multiple requests to a server for a URL, it's a good practice for me to either cache my requests or insert a timed delay between consecutive requests.

In [1]:
import json                   # to convert API to json format

from urllib.parse import urlencode

import requests               # to get the webpage
from bs4 import BeautifulSoup # to parse the webpage

import pandas as pd
import re                     # regular expression operators

from datetime import datetime

We now GET the webpage of interest, from the server

In [2]:
required_city = "Mumbai"
location_url = 'https://locator-service.api.bbci.co.uk/locations?' + urlencode({
   'api_key': 'AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv',
   's': required_city,
   'stack': 'aws',
   'locale': 'en',
   'filter': 'international',
   'place-types': 'settlement,airport,district',
   'order': 'importance',
   'a': 'true',
   'format': 'json'
})
location_url

'https://locator-service.api.bbci.co.uk/locations?api_key=AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv&s=Mumbai&stack=aws&locale=en&filter=international&place-types=settlement%2Cairport%2Cdistrict&order=importance&a=true&format=json'

In [3]:
result = requests.get(location_url).json()
result

{'response': {'results': {'results': [{'id': '1275339',
     'name': 'Mumbai',
     'container': 'India',
     'containerId': 1269750,
     'language': 'en',
     'timezone': 'Asia/Kolkata',
     'country': 'IN',
     'latitude': 19.07283,
     'longitude': 72.88261,
     'placeType': 'settlement'}],
   'totalResults': 1}}}

In [4]:
# url      = 'https://www.bbc.com/weather/1275339' # url to BBC weather, corresponding to a specific city (Mumbai, in this example)
url      = 'https://www.bbc.com/weather/'+result['response']['results']['results'][0]['id']
response = requests.get(url)

In [5]:
#I initiate an instance of BeautifulSoup
soup = BeautifulSoup(response.content,'html.parser')

The information I want (daily high and low temperatures, and daily weather summary) is located in specific blocks on the webpage. I need to identify the block type, type of identifier, and the identifier name. All of these can be figured out by right-clicking on the webpage and selecting 'Inspect' in the Chrome browser; a similar method works for other browsers.

In [6]:
daily_high_values = soup.find_all('span', attrs={'class': 'wr-day-temperature__high-value'}) # block-type: span; identifier type: class; and class name: wr-day-temperature__high-value
daily_high_values

[<span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">32°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">90°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">33°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">91°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">32°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">90°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="wr-value--temperature--c">33°</span><span class="wr-hide"> </span><span class="wr-value--temperature--f">91°</span></span></span>,
 <span class="wr-day-temperature__high-value"><span class="wr-value--temperature"><span class="w

In [7]:
daily_summary = soup.find('div', attrs={'class': 'wr-day-summary'})
daily_summary

<div class="wr-day-summary"><div class="gel-wrap"><span class="">A clear sky and a gentle breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a gentle breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a moderate breeze</span><span class="wr-hide">Sunny and a moderate breeze</span></div></div>

In [8]:
daily_summary.text

'A clear sky and a gentle breezeSunny and a moderate breezeSunny and a gentle breezeSunny and a gentle breezeSunny and a gentle breezeSunny and a gentle breezeSunny and a gentle breezeSunny and a gentle breezeSunny and a moderate breezeSunny and a moderate breezeSunny and a moderate breezeSunny and a moderate breezeSunny and a moderate breezeSunny and a moderate breeze'

In [9]:
daily_high_values[0].text.strip()

'32° 90°'

In [10]:
daily_high_values[5].text.strip()

'35° 94°'

In [11]:
daily_high_values[0].text.strip().split()[0]

'32°'

In [12]:
daily_high_values_list = [daily_high_values[i].text.strip().split()[0] for i in range(len(daily_high_values))]
daily_high_values_list

['32°',
 '33°',
 '32°',
 '33°',
 '34°',
 '35°',
 '35°',
 '33°',
 '31°',
 '32°',
 '33°',
 '34°',
 '34°']