<center>
  <a href="11.1-working-with-databases.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="12.1.A(I)-Getting%20Real%20time%20Data.ipynb">Activity: Get Real-Time Currency</a></center>
</center>

# 1.12 Working with Web Data (Webscraping): Getting Real Time Data

Now that we know how to access data from text files and databases, let's explore how to work with web data such as web pages.

<video controls src="fig/Intel_IoT.mp4" />

Sometimes the data that we want are not easily accessible as a CSV file or in a SQL database. Often, it can be found on the Web. Python provides a way to easily grab (also known as `web scraping`) content found in Web pages for us as data that we can work with.

One popular Python package that deals with web scraping is `BeautifulSoup` at https://www.crummy.com/software/BeautifulSoup/. As usual, we already have the package installed as part of the Anaconda distribution.

Let us explore how we can scrape data from web pages. First, let's import the BeautifulSoup module:

In [1]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=a7d2232aa649a05903c30c9893b86ff22c4540d843c4b59d3d61780c66ea652c
  Stored in directory: /home/mpheng/.cache/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [2]:
from bs4 import BeautifulSoup

The web page that we want to scrape data from is O'Reilly's Free Programming Ebooks page at http://www.oreilly.com/programming/free/. Let's see what the structure of the web site is like by using our web browser and view the page source. Basically for today's exercise, we want to simply list the names and URLs for all the books that are made free by O'Reilly.

![](http://localhost:8888/tree/03-working-with-data-sources/web-data.png)



We can see that the data we are interested in deep in the HTML content, particular the `<a></a>` link anchor tags.

In order to start scraping the data, we need to import another third-party module `requests` that enables us to download content from the web.

In [3]:
import requests

webpage_url = "http://www.oreilly.com/programming/free/"

# download the content into a variable
r = requests.get(webpage_url)

Now that we have a `Requests` object, we can check the status_code as well as the content.

In [4]:
r.status_code

200

In [5]:
# check the size of the content
len(r.content)


72456

In [6]:

# let's just show the first 200 characters
r.content[0:200]

b'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n\r\n  <meta charset="utf-8">\r\n\r\n  \t<title>O\'Reilly Media - Technology and Business Training</title>\n\t<meta name="description" content="Gain technology and busi'

Now that we have the content, we can use BeautifulSoup to parse the page to get the data we want.

In [7]:
soup = BeautifulSoup(r.content, "html.parser")

However, in order to avoid any network issues, the web page content has also already been downloaded in the current directory as `free-oreilly-books.html`. We can also pass the content from local web page content that were previously downloaded.

Here is how we can do that.

In [8]:
f = open("free-oreilly-books.html", "r", encoding="utf8")

soup = BeautifulSoup(f.read(), "html.parser" )

We can work with the web page data via the `soup` object. If we print the `soup` object, it will show the same web page content as previously. Since we are only interested in a subset of the content (the list of free programming books), we are only interested in *anchor links* that has `data-container` attribute. We can just extract the exact data we need with `soup.find_all()` method.

In [9]:
# this returns a list of free programming books
free_programming_books_data= soup.find_all('a', attrs={'data-container':"body"})



In [10]:
type(free_programming_books_data)

bs4.element.ResultSet

In [11]:
# let's see the first element
free_programming_books_data[0]

<a data-container="body" data-content="Adopting microservices requires much more than changes to your technology. Author Christian Posta—a Principal Middleware Specialist/Architect at Red Hat—also examines the organizational agility necessary to deliver these services. This concise book shows you how rapid feedback cycles, autonomous teams, and shared purpose are key to making microservices work." data-original-title="Microservices for Java Developers: A Hands-On Introduction to Frameworks and Containers" data-placement="auto left" data-toggle="popover" data-trigger="hover" href="http://www.oreilly.com/programming/free/microservices-for-java-developers.csp" title="">
<img alt="Microservices for Java Developers" src="./free-oreilly-books_files/cat.gif"/>
</a>

In [12]:
# let's get the number of free books available
print("Number of free programming books available: " + str(len(free_programming_books_data)))

# let's see the last element
free_programming_books_data[-1]

Number of free programming books available: 37


<a data-container="body" data-content="In this concise book, Lightbend CTO Jonas Bonér explains why microservice-based architecture that consists of small, independent services is far more flexible than the traditional all-in-one systems that continue to dominate today’s enterprise landscape." data-original-title="Reactive Microservices Architecture: Design Principles for Distributed Systems" data-placement="auto left" data-toggle="popover" data-trigger="hover" href="http://www.oreilly.com/programming/free/reactive-microservices-architecture-orm.csp" title="">
<img alt="Reactive Microservices Architecture" src="./free-oreilly-books_files/cat(34).gif"/>
</a>

It looks like the content that will be useful to us are:

* `data-content`
* `data-original-title`
* `href`

`soup.find_all()` returns a list of Tags and they can be manipulated in a number of ways, including access them as Dictionary objects. Let's print the first Tag information with just the above three attributes.

In [13]:
type(free_programming_books_data[0])

bs4.element.Tag

In [14]:
tag = free_programming_books_data[0]

print('Title: ', tag['data-original-title'])
print('URL: ', tag['href'])
print('Description: ', tag['data-content'])

Title:  Microservices for Java Developers: A Hands-On Introduction to Frameworks and Containers
URL:  http://www.oreilly.com/programming/free/microservices-for-java-developers.csp
Description:  Adopting microservices requires much more than changes to your technology. Author Christian Posta—a Principal Middleware Specialist/Architect at Red Hat—also examines the organizational agility necessary to deliver these services. This concise book shows you how rapid feedback cycles, autonomous teams, and shared purpose are key to making microservices work.


Now that we can easily get the content we want, how would you print out a list of the books out? How would you save this information into a file or database?

In [15]:
for i in range(len(free_programming_books_data)):
    tag = free_programming_books_data[i]
    print('Title: ', tag['data-original-title'])
    print('URL: ', tag['href'])
    print('Description: ', tag['data-content'])
    print("\n")

Title:  Microservices for Java Developers: A Hands-On Introduction to Frameworks and Containers
URL:  http://www.oreilly.com/programming/free/microservices-for-java-developers.csp
Description:  Adopting microservices requires much more than changes to your technology. Author Christian Posta—a Principal Middleware Specialist/Architect at Red Hat—also examines the organizational agility necessary to deliver these services. This concise book shows you how rapid feedback cycles, autonomous teams, and shared purpose are key to making microservices work.


Title:  Modern Java EE Design Patterns
URL:  http://www.oreilly.com/programming/free/modern-java-ee-design-patterns.csp
Description:  With the ascent of DevOps, microservices, containers, and cloud-based development platforms, the gap between state-of-the-art solutions and the technology that enterprises typically support has greatly increased. But as Markus Eisele explains in this O’Reilly ebook, some enterprises are now looking to bridge

In [16]:
import pandas as pd

dfsave=pd.DataFrame(columns=['Title', 'URL', 'Description'])


for i in range(len(free_programming_books_data)):
    tag = free_programming_books_data[i]
    dfsave.loc[i,"Title"]=tag['data-original-title']
    dfsave.loc[i,"URL"]=tag['href']
    dfsave.loc[i, "Description"]=tag['data-content']
    #    print('URL: ', tag['href'])
#    print('Description: ', tag['data-content'])

dfsave.to_excel("books.xls")

  dfsave.to_excel("books.xls")


ModuleNotFoundError: No module named 'xlwt'

### Open the Excel and see the file created

<center>
  <a href="11.1-working-with-databases.ipynb">Previous Page</a> | <a href="./">Content Page</a> | <a href="12.1.A(I)-Getting%20Real%20time%20Data.ipynb">Activity: Get Real-Time Currency</a></center>
</center>