# Web Scraping


Lesson Goals

    Learn the basics of scraping content from web pages.
    Perform scraping of text from a web page.
    Perform extraction of an HTML table from a web page into a Pandas data frame.

Introduction

Web scraping refers to the automatic extraction of information from a web page. This information is often a page's content, but it can also include information in the page's headers, links present on the page, or any other information embedded in the page's HTML. Because of this, scraping has become one of the most popular ways to extract data from the web. With basic knowledge of HTML and the help of a few Python libraries, you can obtain information from just about any page on the internet.

In this lesson, we will cover the basics of web scraping with Python and show examples of how to scrape text content from a simple web page as well the more complex task of extracting data from an HTML table embedded on a web page.
Scraping a Simple Web Page

Scraping a simple website is relatively straightforward. The first thing we need to do is determine the web page we want to scrape and the information we would like to obtain from it. For our purposes, let's suppose we wanted to scrape a Reuters news article and we wanted to extract the main text content (article title, story, etc.).

We first need to specify the URL of the page we want to scrape and then use the requests library's get method to request the page and the content method to retrieve the HTML content. 

In [1]:
import requests

url = 'https://www.reuters.com/article/us-shazam-m-a-apple-eu/eu-clears-apples-purchase-of-shazam-idUSKCN1LM1TZ'
html = requests.get(url).content
html[0:600]

b'<!--[if !IE]> This has NOT been served from cache <![endif]-->\n<!--[if !IE]> Request served from apache server: produs--i-0276be88a71239631 <![endif]-->\n<!--[if !IE]> token: 85a4229a-4fd3-4212-ae3f-a124c66b9c33 <![endif]-->\n<!--[if !IE]> App Server /produs--i-0276be88a71239631/ <![endif]-->\n\n<!doctype html><html lang="en" data-edition="BETAUS">\n    <head>\n\n    <title>\n                EU clears Apple\'s purchase of Shazam - Reuters</title>\n        <meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefe'

As you can see, there is a lot of extra information here that we don't really need if all we are interested in is the text content from the page. We will need to perform a few steps to clean this up, the first of which is to use the BeautifulSoup library to read the raw HTML and structure it in a way where we will be able to more easily parse the information we want out of it. In BeautifulSoup terms, this is called "making the soup."

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
soup

<!--[if !IE]> This has NOT been served from cache <![endif]--><!--[if !IE]> Request served from apache server: produs--i-0276be88a71239631 <![endif]--><!--[if !IE]> token: 85a4229a-4fd3-4212-ae3f-a124c66b9c33 <![endif]--><!--[if !IE]> App Server /produs--i-0276be88a71239631/ <![endif]--><!DOCTYPE html>
<html data-edition="BETAUS" lang="en">
<head>
<title>
                EU clears Apple's purchase of Shazam - Reuters</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta charset="utf-8"/><meta content="on" http-equiv="x-dns-prefetch-control"/><link href="//s1.reutersmedia.net" rel="dns-prefetch"/><link href="//s2.reutersmedia.net" rel="dns-prefetch"/><link href="//s3.reutersmedia.net" rel="dns-prefetch"/><link href="//s4.reutersmedia.net" rel="dns-prefetch"/><link href="//static.reuters.com" rel="dns-prefetch"/><link href="//www.googletagservices.com" rel="dns-prefetch"/><link href="//www.googletagmanager.com" rel="dns-prefetch"/><link href="//www.google-analytics.com" rel

You can see that our soup is slightly more structured than our raw HTML, but the best part about BeautifulSoup comes next. It allows us to extract specific HTML elements from the soup we have created using the find_all method. In our case, we are going to use it to find and extract all the text contained within header tags and paragraph tags. 

In [3]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

["EU clears Apple's purchase of Shazam",
 '2 Min Read',
 'BRUSSELS (Reuters) - The European Union approved Apple’s planned acquisition of British music discovery app Shazam on Thursday, saying an EU antitrust investigation showed it would not harm competition in the bloc. ',
 'The deal, announced in December last year, would help the iPhone maker better compete with Spotify, the industry leader in music streaming services. Shazam identifies songs when a smartphone is pointed at an audio source. ',
 '“After thoroughly analyzing Shazam’s user and music data, we found that their acquisition by Apple would not reduce competition in the digital music streaming market,” EU competition commissioner Margrethe Vestager said in a statement. ',
 '“Data is key in the digital economy. We must therefore carefully review transactions which lead to the acquisition of important sets of data, including potentially commercially sensitive ones,” she added. ',
 'The European Commission opened a full-scale 

This gives us a neat list where the text of each HTML element BeautifulSoup found is an element in the list. If we want to view it in paragraph form, we can simply call the join method, use a new line (\n) to join the elements together, and we get the text neatly in paragraph form.



In [4]:
print('\n'.join(text))

EU clears Apple's purchase of Shazam
2 Min Read
BRUSSELS (Reuters) - The European Union approved Apple’s planned acquisition of British music discovery app Shazam on Thursday, saying an EU antitrust investigation showed it would not harm competition in the bloc. 
The deal, announced in December last year, would help the iPhone maker better compete with Spotify, the industry leader in music streaming services. Shazam identifies songs when a smartphone is pointed at an audio source. 
“After thoroughly analyzing Shazam’s user and music data, we found that their acquisition by Apple would not reduce competition in the digital music streaming market,” EU competition commissioner Margrethe Vestager said in a statement. 
“Data is key in the digital economy. We must therefore carefully review transactions which lead to the acquisition of important sets of data, including potentially commercially sensitive ones,” she added. 
The European Commission opened a full-scale investigation into the dea

# More Complex Single-Page Scraping

The previous example was relatively straightforward because we were just extracting the text content from the page. Suppose we wanted to extract data that was contained within an HTML table and store it in a Pandas data frame. This objective makes our scraping task a bit more complex as we would need to identify the table within the HTML, identify the rows within the table, and then read and format the information within those rows so that they fit within a data frame. Let's look at an example of how we would extract a table containing life expectancies for each European country from Wikipedia.

This task would start out just like the previous one. We would specify the URL, use the requests library to request the page and retrieve the raw HTML content, and turn the HTML into soup using BeautifulSoup.



In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

Once we have our soup, we need to extract the table containing each country's life expectancy. You can look at the page source in a browser to determine whether you can specify a class for it. In the case of our table, it did have a class of "sortable wikitable" so we will use that as well as the index [0] to get just the single table we want.



In [6]:
table = soup.find_all('table',{'class':'sortable wikitable'})[0]
table

<table class="sortable wikitable">
<tbody><tr bgcolor="#efefef">
<th>Rank
</th>
<th>Country</th>
<th><a href="/wiki/List_of_countries_by_life_expectancy" title="List of countries by life expectancy">Life expectancy</a><sup class="reference" id="cite_ref-:0_1-1"><a href="#cite_note-:0-1">[1]</a></sup>
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="750" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/19px-Flag_of_Monaco.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/29px-Flag_of_Monaco.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/38px-Flag_of_Monaco.svg.png 2x" width="19"/> </span><a href="/wiki/Monaco" title="Monaco">Monaco</a><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</td>
<td>89.4
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon"><i

We now have the table we want, but to be able to load the data into Pandas, we need to extract each of the rows (
tags) and their cell values into a a nested list. We can do that with just a couple lines of Python. 

In [7]:
rows = table.find_all('tr')
rows = [row.text.strip().split("\n") for row in rows]
rows

[['Rank', '', 'Country', 'Life expectancy[1]'],
 ['1', '', '\xa0Monaco[2]', '', '89.4'],
 ['2', '', '\xa0San Marino[3]', '', '83.4'],
 ['3', '', '\xa0\xa0Switzerland', '83.0'],
 ['4', '', '\xa0Spain', '82.8'],
 ['5', '', '\xa0Liechtenstein', '82.7'],
 ['6', '', '\xa0Italy', '82.5'],
 ['7', '', '\xa0Norway', '82.5'],
 ['8', '', '\xa0Iceland', '82.5'],
 ['9', '', '\xa0Luxembourg', '82.3'],
 ['10', '', '\xa0France', '82.3'],
 ['11', '', '\xa0Sweden', '82.2'],
 ['12', '', '\xa0Malta', '81.8'],
 ['13', '', '\xa0Finland', '81.8'],
 ['14', '', '\xa0Ireland', '81.6'],
 ['15', '', '\xa0Netherlands', '81.5'],
 ['16', '', '\xa0Portugal', '81.1'],
 ['17', '', '\xa0Greece', '81.0'],
 ['18', '', '\xa0United Kingdom', '81.0'],
 ['19', '', '\xa0Austria', '80.9'],
 ['20', '', '\xa0Slovenia', '80.8'],
 ['21', '', '\xa0Denmark', '80.7'],
 ['22', '', '\xa0Germany', '80.6'],
 ['23', '', '\xa0Cyprus', '80.5'],
 ['24', '', '\xa0Albania', '78.3'],
 ['25', '', '\xa0Czech Republic', '78.3'],
 ['26', '', '\xa0Cr

From this nested list, we can specify what the column names are and then use the rest of the data to populate a data frame.



In [8]:
import pandas as pd

colnames = rows[0]
data = rows[1:]

df = pd.DataFrame(data[4:], columns=colnames)
df.head()

Unnamed: 0,Rank,Unnamed: 2,Country,Life expectancy[1]
0,5,,Liechtenstein,82.7
1,6,,Italy,82.5
2,7,,Norway,82.5
3,8,,Iceland,82.5
4,9,,Luxembourg,82.3


# Web Scraping Challenges

The two scraping tasks we performed in this lesson were possible because the web pages were created with HTML. It is important to note that this is not always the case and that it will make your scraping efforts more difficult (if not impossible) when it is not.

Aside from this, there are several other factors that may present challenges when performing web scraping. Below is a list of challenges and considerations that should be helpful to keep in mind while performing web scraping.

    Need to determine what information you want to extract from each page.
    Consider creating a customized scraper for each site to account for different formatting from one site to the next.
    Consider that different pages within the same site may have different structure.
    Consider that a page's content and structure can change over time.
    Terms of service for a website may not allow for scraping of their pages.

