# Tutorial 1

In this tutorial we are going to practice retrieving data from the web using Python.

Let's start by using urllib to read data from web pages.

In [1]:
# Import urllib request module and read the entire content of python.org
import urllib.request 

request_url = urllib.request.urlopen('https://toscrape.com/') 
print(request_url.read())

b'<!DOCTYPE html>\n<html lang="en">\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n        <title>Scraping Sandbox</title>\n        <link href="./css/bootstrap.min.css" rel="stylesheet">\n        <link href="./css/main.css" rel="stylesheet">\n    </head>\n    <body>\n        <div class="container">\n            <div class="row">\n                <div class="col-md-1"></div>\n                <div class="col-md-10 well">\n                    <img class="logo" src="img/zyte.png" width="200px">\n                    <h1 class="text-right">Web Scraping Sandbox</h1>\n                </div>\n            </div>\n\n            <div class="row">\n                <div class="col-md-1"></div>\n                <div class="col-md-10">\n                    <h2>Books</h2>\n                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It\'s a safe place for beginners learning web scraping and for deve

In 3 lines of code the content of python.org was retrieved and printed but it doesn't look good. The next lines of code again read the same page content but print the content into a more readable format. .decode() converts information from byte arrays into strings and strip() gets rid of any leading, and trailing whitespaces.

In [2]:
request_url = urllib.request.urlopen('https://toscrape.com/') 
for line in request_url:
    print(line.decode().strip())

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px">
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>

<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img src="./img/books.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6"

<font color='red'>**TO DO:** Try using the same lines of code to read the following web page: https://data.pr4e.org/romeo.txt

Open the webpage first in your browser to see how it looks and then use Python code to retrieve all its content into your Jupyter notebook. It should look something like this:</font>

In [7]:
# Your code goes here.

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


<font color='red'>**TO DO:** Retrieve content from other HTML pages, such as Wikipedia.</font>

Next we are going to explore how we can read data from the web that is saved in the XML format. To do this, we are going to use a Python module called *xml* and explore some of its basic functionality for parsing XML data.

Start by opening the following page using your browser: 'https://www.w3schools.com/xml/note.xml'.

Now let's retrieve its XML content using Python.

In [20]:
# xml is a Python module for retrieving and parsing xml data
import xml.etree.ElementTree as ET

# requests is a module for http requests
import requests 

# This is the URL we want to retrieve data from
url = 'https://www.w3schools.com/xml/note.xml'
  
# We can use requests to request the data from the URL via the http protocol
resp = requests.get(url) 

# We can save the data in a local xml file
file_name = 'xmldata.xml'  
with open(file_name, 'wb') as f:
    f.write(resp.content) 

Open the saved xml file to see its content.

Now that we have an xml file saved in our directory we can read it (parse it) to extract useful information from this file. 

In [17]:
# Use xml ET functionality to create a tree object with all the xml structure
tree = ET.parse(file_name) 
  
# Get root element (the first element of the xml structure)
root = tree.getroot() 
  
# Print the tag of root (first element of xml structure)
print(root.tag)

note


To print all the tags and attributes you can do something like this:

In [23]:
for child in root:
    print(child.tag, child.attrib)

to {}
from {}
heading {}
body {}


To extract the actual information from a tag, use the command text, such as:

In [38]:
print(root[0].text)
print(root[1].text)
print(root[2].text)
print(root[3].text)

Tove
Jani
Reminder
Don't forget me this weekend!


In [37]:
print(root.find('to').text)
print(root.find('from').text)
print(root.find('heading').text)
print(root.find('body').text)

Tove
Jani
Reminder
Don't forget me this weekend!


<font color='red'>**TO DO:** To finish, download the xml file called countries_data.xml (from Canvas) and explore parsing this xml file with Python. For more information, see here: https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree</font>

In [39]:
# Use xml ET functionality to create a tree object with all the xml structure
# Your code here
  
# Get root element (the first element of the xml structure)
# Your code here
  
# Print the tag of root (first element of xml structure)
# Your code here

In [40]:
# Print all tags and attributes (see above code for example)
# Your code here

In [None]:
# What does the following lines of code return?
print(root[0][1].text)
print(root[1][0].text)
print(root[1][1].text)

In [None]:
# Run the following code to return the neighbors of all countries
for neighbor in root.iter('neighbor'):
    print(neighbor.attrib)

In [None]:
# Run the following code to return the names and ranks of all countries
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print(name, rank)

In [None]:
# Keep exploring the data using your own code
# Your code here