# Web Scraping

## Introduction 

Web scraping refers to the automatic extraction of information from a web page. This information is often a page's content, but it can also include information in the page's headers, links present on the page, or any other information embedded in the page's HTML. Because of this, scraping has become one of the most popular ways to extract data from the web. With basic knowledge of HTML and the help of a few Python libraries, you can obtain information from just about any page on the internet.

In this lesson, we will cover the basics of web scraping with Python and show examples of how to scrape text content from a simple web page as well the more complex task of extracting data from an HTML table embedded on a web page.

## Scraping a Simple Web Page

Scraping a simple website is relatively straightforward. The first thing we need to do is determine the web page we want to scrape and the information we would like to obtain from it. For our purposes, let's suppose we wanted to scrape a Reuters news article and we wanted to extract the main text content (article title, story, etc.).

We first need to specify the URL of the page we want to scrape and then use the requests library's get method to request the page and the content method to retrieve the HTML content.

In [None]:
import requests

url = 'https://www.reuters.com/article/us-shazam-m-a-apple-eu/eu-clears-apples-purchase-of-shazam-idUSKCN1LM1TZ'
html = requests.get(url)
html

In [None]:
# Get the content
html = requests.get(url).content
#html

While printing the first 600 characters of the HTML content, you'll see something like:


In [None]:
# Show the first 600 chars
html[0:600]


As you can see, there is a lot of extra information here that we don't really need if all we are interested in is the text content from the page. We will need to perform a few steps to clean this up, the first of which is to use the BeautifulSoup library to read the raw HTML and structure it in a way where we will be able to more easily parse the information we want out of it. In BeautifulSoup terms, this is called "making the soup."

In order to run this code, you will need to install a lxml-parser. If you haven't done so already, please pip install it using the following link:

https://pypi.org/project/lxml/

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
# soup

https://www.reuters.com/article/us-shazam-m-a-apple-eu/eu-clears-apples-purchase-of-shazam-idUSKCN1LM1TZ

In [None]:
from IPython.display import Image
Image(filename='webscrap_exmp.png')

You can see that our soup is slightly more structured than our raw HTML, but the best part about BeautifulSoup comes next. It allows us to extract specific HTML elements from the soup we have created using the find_all method. In our case, we are going to use it to find and extract all the text contained within header tags and paragraph tags.

In [None]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

In [None]:
soup.find_all('h1')

This gives us a neat list where the text of each HTML element BeautifulSoup found is an element in the list. If we want to view it in paragraph form, we can simply call the join method, use a new line (\n) to join the elements together, and we get the text neatly in paragraph form.

In [None]:
print('\n'.join(text))

## More Complex Simple Page Scraping

The previous example was relatively straightforward because we were just extracting the text content from the page. Suppose we wanted to extract data that was contained within an HTML table and store it in a Pandas data frame. This objective makes our scraping task a bit more complex as we would need to identify the table within the HTML, identify the rows within the table, and then read and format the information within those rows so that they fit within a data frame. Let's look at an example of how we would extract a table containing life expectancies for each European country from Wikipedia.

This task would start out just like the previous one. We would specify the URL, use the requests library to request the page and retrieve the raw HTML content, and turn the HTML into soup using BeautifulSoup.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy'

In [None]:
#html = requests.get(url)

In [None]:
#html

In [None]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy').content

In [None]:
# html

In [None]:
soup = BeautifulSoup(html, "lxml")

In [None]:
#soup

Once we have our soup, we need to extract the table containing each country's life expectancy. You can look at the page source in a browser to determine whether you can specify a class for it. In the case of our table, it did have a class of "sortable wikitable" so we will use that as well as the index [0] to get just the single table we want.

In [None]:
table = soup.find_all('table', attrs={'class':'sortable wikitable'})[0]
#table

Even though we simply have one element in our 'table' object, it is important to see why we still need to specify that we only want the first element. In order to see why, let us run the following code. 

In [None]:
print(type(soup.find_all('table',attrs={'class':'sortable wikitable'})))
print(type(soup.find_all('table',attrs={'class':'sortable wikitable'})[0]))

As you can see, the find_all method initially returns a so called result set -- a set in which the HTML elements it has found are being stored. In turn, the elements in this set are the elements we can actually use and call methods on. 

We now have the table we want, but to be able to load the data into Pandas, we need to extract each of the rows (

tags) and their cell values into a a nested list. We can do that with just a couple lines of Python.

In [None]:
# Find all rows with a <tr>...</tr> (table row) element 
rows = table.find_all('tr')
rows[0:4]

In [None]:
# Print row[7] with text method
rows[7].text

In [None]:
# Strip this row
rows[7].text.strip()

In [None]:
# Split according to newline
rows[7].text.strip().split("\n")

In [None]:
# Strip and split all rows
rows = [row.text.strip().split("\n") for row in rows]
rows

From this nested list, we can specify what the column names are and then use the rest of the data to populate a data frame.

In [None]:
import pandas as pd

In [None]:
data = rows[1:]
df = pd.DataFrame(data)
df.head(10)

In [None]:
# Assign values to other column
df.iloc[0,3] = df.iloc[0,4]
df.iloc[1,3] = df.iloc[1,4]
df.head(10)

In [None]:
# Drop additional columns
df = df.drop([1,4], axis = 1)
df.head(10)

In [None]:
# Assign first row to df as col_names
del rows[0][1]
df.columns = rows[0]
df.head(10)

In [None]:
# Rename column 
df = df.rename(columns = {'Life expectancy[1]': 'Life expectancy'})

In [None]:
df

In [None]:
df.loc[9,'Country']

## Web Scraping Challenges

The two scraping tasks we performed in this lesson were possible because the web pages were created with HTML. It is important to note that this is not always the case and that it will make your scraping efforts more difficult (if not impossible) when it is not.

Aside from this, there are several other factors that may present challenges when performing web scraping. Below is a list of challenges and considerations that should be helpful to keep in mind while performing web scraping.

Need to determine what information you want to extract from each page.
Consider creating a customized scraper for each site to account for different formatting from one site to the next.
Consider that different pages within the same site may have different structure.
Consider that a page's content and structure can change over time.
Terms of service for a website may not allow for scraping of their pages.

## Summary

In this lesson, we covered the basics of web scraping. We began by looking at an example of how we can scrape text from a web page using Python's requests and BeautifulSoup libraries. We then studied a more complex example where we had to extract a specific table from the HTML of a web page and then extract the rows of that table so that we could load them into a Pandas data frame. We finished up the chapter by noting some important challenges and considerations you should keep in mind while scraping. Now that you have completed this lesson, you should have the skills you need to obtain data from web pages and structure it in a way where it can be analyzed.