# Notebook №9. Information Technologies

Performed by Movenko Konstantin, IS/b-21-2-o

## Working with the Web Resources API using XML

In [1]:
from bs4 import BeautifulSoup

### API and XML

Analyzing web pages and extracting information from them, we are trying to write a program that would act like a person. It can be difficult. Fortunately, more and more often various sites offer information that can be easily processed not only by a person, but also by another program. This is called the API — application program interface. A normal interface is a way for a person to interact with a program, and an API is a way for one program to interact with another. For example, your Python script with a remote web server.

HTML is used to store web pages that people read. To store arbitrary structured data exchanged between programs, other languages are used — in particular, the XML language, similar to HTML. It would be more accurate to say that XML is a *metalanguage*, that is, a way of describing languages. Unlike HTML, the set of tags in an XML document can be arbitrary (and is determined by the developer of a specific XML dialect). For example, if we wanted to describe in XML some student group, it might look like this:

<pre>
&lt;group&gt;
    &lt;number&gt;IS/b-21-2-о&lt;/number&gt;
    &lt;student&gt;
        &lt;firstname&gt;Konstantin&lt;/firstname&gt;
        &lt;lastname&gt;Movenko&lt;/lastname&gt;
    &lt;/student&gt;
    &lt;student&gt;
        &lt;firstname&gt;Anastasia&lt;/firstname&gt;
        &lt;lastname&gt;Olkhovskaya&lt;/lastname&gt;
    &lt;/student&gt;
&lt;/group&gt;
</pre>

To process XML files, you can use the same *Beautiful Soup* package that we have already used to work with HTML. The only difference is that you need to specify an additional parameter `features="xml"` when calling the `BeautifulSoup` function — so that it does not search in the document HTML tags.

If the `features="xml"` parameter leads to an error, then you need to install the `lxml` package. To do this, open the Anaconda Prompt window and run the `pip install lxml` command.

In [2]:
# assign a string with xml data to the variable 
group = """
<group>
    <number>IS/b-21-2-о</number>
    <student>
        <firstname>Konstantin</firstname>
        <lastname>Movenko</lastname>
    </student>
    <student>
        <firstname>Anastasia</firstname>
        <lastname>Olkhovskaya</lastname>
    </student>
</group>
"""

In [3]:
obj = BeautifulSoup(group, features="xml") # parse the string as xml
print(obj.prettify())                      # print parsed string with formatting

<?xml version="1.0" encoding="utf-8"?>
<group>
 <number>
  IS/b-21-2-о
 </number>
 <student>
  <firstname>
   Konstantin
  </firstname>
  <lastname>
   Movenko
  </lastname>
 </student>
 <student>
  <firstname>
   Anastasia
  </firstname>
  <lastname>
   Olkhovskaya
  </lastname>
 </student>
</group>



This is how we can find the group number in our XML document:

In [4]:
# print string content in <number> tag
obj.group.number.string

'IS/b-21-2-о'

This means "find the `group` tag in the `obj` object, find the `number` tag in it and output as a string what it contains.

And this is how you can list all the students:

In [5]:
# find all <student> tags and list their content
for student in obj.group.findAll('student'):
    print(student.lastname.string, student.firstname.string)

Movenko Konstantin
Olkhovskaya Anastasia


### Getting a list of articles from the category in Wikipedia

Let's say we needed to get a list of all articles from some category in Wikipedia. We could open this category in the browser and continue to use the methods discussed above. However, Wikipedia has a convenient API. To learn how to work with it, you will have to read the [documentation](https://www.mediawiki.org/wiki/API:Main_page) (this will be the case with any API), but it seems complicated only the first time.

So, let's get started. Interaction with the server using the API occurs by sending specially generated requests and receiving a response in one of the machine-readable formats. We will be interested in the XML format, although there are others (later we will get acquainted with JSON). But we can send such a request:

https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmsort=timestamp&cmdir=desc&format=xmlfm

String `https://en.wikipedia.org/w/api.php` (before the question mark) is the API *entry point*. Everything that comes after the question mark is, in fact, a request. It is something like a dictionary and consists of "key=value" pairs separated by an ampersand `&`. Some characters have to be encoded in a special way.

For example, the address above says that we want to make a query (`action=query`), list the elements of the category `list=categorymembers`, as the category that interests us is indicated `Category:Physics` (`cmtitle=Category:Physics`) and some other parameters are specified. If you click on this link, something like this will open:

<pre>
&lt;?xml version="1.0"?&gt;
&lt;api batchcomplete=""&gt;
  &lt;continue cmcontinue="2015-05-30 19:37:50|1653925" continue="-||" /&gt;
  &lt;query&gt;
    &lt;categorymembers&gt;
      &lt;cm pageid="24293838" ns="0" title="Wigner rotation" /&gt;
      &lt;cm pageid="48583145" ns="0" title="Northwest Nuclear Consortium" /&gt;
      &lt;cm pageid="48407923" ns="0" title="Hume Feldman" /&gt;
      &lt;cm pageid="48249441" ns="0" title="Phase Stretch Transform" /&gt;
      &lt;cm pageid="47723069" ns="0" title="Epicatalysis" /&gt;
      &lt;cm pageid="2237966" ns="14" title="Category:Surface science" /&gt;
      &lt;cm pageid="2143601" ns="14" title="Category:Interaction" /&gt;
      &lt;cm pageid="10844347" ns="14" title="Category:Physical systems" /&gt;
      &lt;cm pageid="18726608" ns="14" title="Category:Physical quantities" /&gt;
      &lt;cm pageid="22688097" ns="0" title="Branches of physics" /&gt;
    &lt;/categorymembers&gt;
  &lt;/query&gt;
&lt;/api&gt;
</pre>

We see different tags here, and we see that we are interested in the `<cm>` tags that are inside the tag `<categorymembers>`.

Let's make the appropriate request using Python. To do this, we will need the already familiar `requests` module.

In [6]:
import requests

In [7]:
# URL string and parameters dictionary
url = "https://en.wikipedia.org/w/api.php"
params = {
    'action':'query',
    'list':'categorymembers',
    'cmtitle': 'Category:Physics',
    'format': 'xml'
}

# perform a GET query and assign the result to the variable
g = requests.get(url, params=params)

As you can see, we pass the list of parameters in the form of a regular dictionary. Let's see what happened.

In [8]:
# check request result
g.ok

True

It's all good. Now we use *Beautiful Soup* to process this XML.

In [9]:
# parse an XML document
data = BeautifulSoup(g.text, features='xml')

In [10]:
# print a parsed document with formatting
print(data.prettify())

<?xml version="1.0" encoding="utf-8"?>
<api batchcomplete="">
 <continue cmcontinue="subcat|0a8048385a4e3a2e3a4e504e030648385a4e3a2e3a4e504e011a01c5dcbcdc0d|694942" continue="-||"/>
 <query>
  <categorymembers>
   <cm ns="0" pageid="22939" title="Physics"/>
   <cm ns="100" pageid="1653925" title="Portal:Physics"/>
   <cm ns="0" pageid="74985603" title="Edge states"/>
   <cm ns="0" pageid="74535315" title="Emily Fairfax"/>
   <cm ns="0" pageid="74609356" title="Force control"/>
   <cm ns="0" pageid="72041443" title="Overlap fermion"/>
   <cm ns="0" pageid="74170779" title="Toroidal solenoid"/>
   <cm ns="0" pageid="74786976" title="Trajectoid"/>
   <cm ns="14" pageid="70983414" title="Category:Physics by country"/>
   <cm ns="14" pageid="49740128" title="Category:Subfields of physics"/>
  </categorymembers>
 </query>
</api>



Find all occurrences of the `<cm>` tag and output their `title` attribute:

In [11]:
# print title of each article found
for cm in data.api.query.categorymembers("cm"):
    print(cm['title'])

Physics
Portal:Physics
Edge states
Emily Fairfax
Force control
Overlap fermion
Toroidal solenoid
Trajectoid
Category:Physics by country
Category:Subfields of physics


It was possible to simplify the search for `<cm>` without specifying the "full path" to them:

In [12]:
# short form of same commands
for cm in data("cm"):
    print(cm['title'])

Physics
Portal:Physics
Edge states
Emily Fairfax
Force control
Overlap fermion
Toroidal solenoid
Trajectoid
Category:Physics by country
Category:Subfields of physics


By default, the server returned us a list of 10 items. If we want more, we need to use the `continue` element — this is a kind of hyperlink to the next 10 elements.

In [13]:
# hyperlink to next 10 items
data.find("continue")['cmcontinue']

'subcat|0a8048385a4e3a2e3a4e504e030648385a4e3a2e3a4e504e011a01c5dcbcdc0d|694942'

We had to use the `find()` method instead of just writing `data.continue`, because `continue` in Python has a special meaning.

Now let's add `cmcontinue` to our request and execute it again:

In [14]:
# add new continue parameter
params['cmcontinue'] = data.api("continue")[0]['cmcontinue']

In [15]:
# make request to web page witn next 10 items and print them
g = requests.get(url, params=params)
data = BeautifulSoup(g.text, features='xml')
for cm in data.api.query.categorymembers("cm"):
    print(cm['title'])

Category:Physicists
Category:Concepts in physics
Category:Eponyms in physics
Category:Physics-related lists
Category:Physical modeling
Category:Physics in society
Category:Works about physics
Category:Physics stubs


We got the following 10 items from the category. Continuing in this way, you can even pump it out completely (although it will take a lot of time).

Similarly, work with a variety of other APIs available on different sites is implemented. Somewhere the API is completely open (as in Wikipedia), somewhere you will need to register and get an application id and some key to access the API, somewhere you will even be asked to pay (for example, an automatic Google search costs something like $5 per 100 requests). There are APIs that only allow you to read information, and there are also those that allow you to edit it. For example, you can write a script that will automatically save some information in Google Spreadsheets. Whenever you use the API, you will have to study its documentation, but in any case it is easier than processing HTML code. Sometimes it is possible to simplify API access by using special libraries.