# Flickr 'API' access tutorial: notebookerised

## This notebook:
- Follows [the tutorial](https://wiki.bl.uk:8443/download/attachments/139627026/9%20Using%20an%20API%20-%20hands-on%20exercises%202014.pdf?api=v2) developed by Owen Stephens on behalf of the British Library to access a Flickr "API" (actually just the RSS feed) at the BL
- Applies it in a Jupyter notebook
- Flickr has a powerful API (requires registration)!
    - You will not be using it
        - Just the RSS feeds for simplicity
- The tutorial relies on google spreadsheets to make API requests and format the responses
    - You will not be using that!
    - This notebook uses python libraries to achieve the same effect
        - pandas, BeautifulSoup, numpy, feedparser and many more

### Example URL:  
https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss

In [101]:
import requests as rq
import xml.etree.ElementTree as ET
from PIL import Image
from io import BytesIO
import json

In [307]:
base_url='https://api.flickr.com/services/feeds/photos_public.gne'
tags='food'
response_format='rss2'
api_request=base_url + '?'+ 'tags=' + tags + '&format=' + response_format
api_request

'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2'

|Parameter|Value|Description |
| ------------- |-------------|------|
| base URL | https://api.flickr.com/services/feeds/photos_public.gne| With server and prefix, forms address where API request is made |
| tags | food | A comma-delimited list of tags |
| response format | rss |How should the API present the response? |


There are several python libraries you can use to parse the xml returned from the rss API request. I've looked at a couple here, but you may end up finding a favourite after a google!

Trying with feedparser:

In [103]:
import feedparser
import ssl
if hasattr(ssl, '_create_unverified_context'):
    ssl._create_default_https_context = ssl._create_unverified_context
rss = api_request
feed = feedparser.parse(rss)

print(feed)

{'feed': {'title': 'Recent Uploads tagged food', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2', 'value': 'Recent Uploads tagged food'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.flickr.com/photos/tags/food/'}], 'link': 'https://www.flickr.com/photos/tags/food/', 'subtitle': '', 'subtitle_detail': {'type': 'text/html', 'language': None, 'base': 'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2', 'value': ''}, 'published': 'Fri, 07 Feb 2020 03:32:40 -0800', 'published_parsed': time.struct_time(tm_year=2020, tm_mon=2, tm_mday=7, tm_hour=11, tm_min=32, tm_sec=40, tm_wday=4, tm_yday=38, tm_isdst=0), 'updated': 'Fri, 07 Feb 2020 03:32:40 -0800', 'updated_parsed': time.struct_time(tm_year=2020, tm_mon=2, tm_mday=7, tm_hour=11, tm_min=32, tm_sec=40, tm_wday=4, tm_yday=38, tm_isdst=0), 'generator_detail': {'name': 'https://www.flickr.com/

The output doesn't look very friendly, so we can "pretty print" which gets an output more similar to one displayed in a browser

In [141]:
import lxml.etree as etree
import urllib.request
parser = etree.XMLParser(remove_blank_text=True)


opener = urllib.request.build_opener()
tree = etree.parse(opener.open(api_request),parser).getroot()

#root = etree.parse('file.xml', parser).getroot()


print(etree.tostring(tree, pretty_print=True).decode())

<rss xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:creativeCommons="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html" xmlns:flickr="urn:flickr:user" version="2.0">
  <channel>
    <title>Recent Uploads tagged food</title>
    <link>https://www.flickr.com/photos/tags/food/</link>
    <description/>
    <pubDate>Fri, 07 Feb 2020 04:41:31 -0800</pubDate>
    <lastBuildDate>Fri, 07 Feb 2020 04:41:31 -0800</lastBuildDate>
    <generator>https://www.flickr.com/</generator>
    <image>
      <url>https://combo.staticflickr.com/pw/images/buddyicon.gif</url>
      <title>Recent Uploads tagged food</title>
      <link>https://www.flickr.com/photos/tags/food/</link>
    </image>
    <item>
      <title>hybrid vegetables Seeds</title>
      <link>https://www.flickr.com/photos/186859685@N06/49500599213/</link>
      <description>			&lt;p&gt;&lt;a href="https://www.flickr.com/people/186859685@N06/"&gt;sakuraseed363&lt;/a&gt; posted a phot

But it's still not as useful as it could be. Now I use feedparser to display the results in a slightly more friendly format

In [261]:
d = feedparser.parse(api_request)

Print d to take a look at the data:

In [264]:
print(type(d),d)

<class 'feedparser.FeedParserDict'> {'feed': {'title': 'Recent Uploads tagged food', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2', 'value': 'Recent Uploads tagged food'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.flickr.com/photos/tags/food/'}], 'link': 'https://www.flickr.com/photos/tags/food/', 'subtitle': '', 'subtitle_detail': {'type': 'text/html', 'language': None, 'base': 'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2', 'value': ''}, 'published': 'Fri, 07 Feb 2020 04:41:31 -0800', 'published_parsed': time.struct_time(tm_year=2020, tm_mon=2, tm_mday=7, tm_hour=12, tm_min=41, tm_sec=31, tm_wday=4, tm_yday=38, tm_isdst=0), 'updated': 'Fri, 07 Feb 2020 04:41:31 -0800', 'updated_parsed': time.struct_time(tm_year=2020, tm_mon=2, tm_mday=7, tm_hour=12, tm_min=41, tm_sec=31, tm_wday=4, tm_yday=38, tm_isdst=0), 'generator_detail

So 'd' is a dictionary-type object, that may contain one or more nested dictionaries. We can see what we're working with by looking at the dictionary keys:

In [265]:
d.keys()

dict_keys(['feed', 'entries', 'bozo', 'headers', 'updated', 'updated_parsed', 'href', 'status', 'encoding', 'version', 'namespaces'])

The things we're interested in here are within 'entries', though you can obviously take a look at the other keys! 

I've gone one step further and looked inside `d['entries']`, printing out the `'title'` and `'link'` key-values

In [266]:
for entry in d['entries']:
    print(entry['title'],entry['link'])

hybrid vegetables Seeds https://www.flickr.com/photos/186859685@N06/49500599213/
Salsiccia pizza https://www.flickr.com/photos/skumroffe/49501091321/
Alles Gute zum Geburtstag, liebe AnnA!! https://www.flickr.com/photos/magister111/49500517018/
瓦城乾拌麵 https://www.flickr.com/photos/benagexyz/49500559473/
Menu Book https://www.flickr.com/photos/186819013@N05/49500494468/
Honey Bee in Garden https://www.flickr.com/photos/99144705@N06/49501235752/
Honey Bee in Garden https://www.flickr.com/photos/99144705@N06/49501235667/
ちらし近江町 ¥1530 https://www.flickr.com/photos/62942199@N08/49500472893/
ちらし近江町 ¥1530 https://www.flickr.com/photos/62942199@N08/49501197932/
DSCF6454 https://www.flickr.com/photos/aaroncaley/49501157512/
Croque-monsieur / クロックムッシュセット / danken COFFEE 天文館店 (鹿児島県鹿児島市) https://www.flickr.com/photos/y_shindoh/49501154562/
cic-20-328 https://www.flickr.com/photos/zimmcomm/49501114897/
cic-20-326 https://www.flickr.com/photos/zimmcomm/49501117272/
cic-20-337 https://www.flickr.com/p

Looking good! This retrieves some useful information and is better than just looking at raw XML. We can go a little further by 
getting the results into a pandas dataframe - after looking through the data a little to see how it is formatted and what information we would ultimately like to see

In [281]:
import pandas as pd
import numpy as np
df=pd.DataFrame(d['entries'])

In [282]:
df.head(3)

Unnamed: 0,author,author_detail,authors,content,credit,dc_date.taken,guidislink,href,id,link,...,media_content,media_credit,media_thumbnail,published,published_parsed,summary,summary_detail,tags,title,title_detail
0,nobody@flickr.com (sakuraseed363),"{'name': 'sakuraseed363', 'email': 'nobody@fli...","[{'name': 'sakuraseed363', 'email': 'nobody@fl...","[{'type': 'text/html', 'language': None, 'base...",sakuraseed363,2020-02-07T04:40:54-08:00,False,,"tag:flickr.com,2004:/photo/49500599213",https://www.flickr.com/photos/186859685@N06/49...,...,[{'url': 'https://live.staticflickr.com/65535/...,"[{'role': 'photographer', 'content': 'sakurase...",[{'url': 'https://live.staticflickr.com/65535/...,"Fri, 07 Feb 2020 04:41:31 -0800","(2020, 2, 7, 12, 41, 31, 4, 38, 0)","<p><a href=""https://www.flickr.com/people/1868...","{'type': 'text/html', 'language': None, 'base'...",[{'term': 'beetroot sakuraseed seeds seed hybr...,hybrid vegetables Seeds,"{'type': 'text/plain', 'language': None, 'base..."
1,nobody@flickr.com (skumroffe),"{'name': 'skumroffe', 'email': 'nobody@flickr....","[{'name': 'skumroffe', 'email': 'nobody@flickr...","[{'type': 'text/html', 'language': None, 'base...",skumroffe,2020-02-04T18:18:09-08:00,False,,"tag:flickr.com,2004:/photo/49501091321",https://www.flickr.com/photos/skumroffe/495010...,...,[{'url': 'https://live.staticflickr.com/65535/...,"[{'role': 'photographer', 'content': 'skumroff...",[{'url': 'https://live.staticflickr.com/65535/...,"Fri, 07 Feb 2020 04:36:52 -0800","(2020, 2, 7, 12, 36, 52, 4, 38, 0)","<p><a href=""https://www.flickr.com/people/skum...","{'type': 'text/html', 'language': None, 'base'...",[{'term': 'salsicciapizza pizza furelliosristo...,Salsiccia pizza,"{'type': 'text/plain', 'language': None, 'base..."
2,nobody@flickr.com (magister111),"{'name': 'magister111', 'email': 'nobody@flick...","[{'name': 'magister111', 'email': 'nobody@flic...","[{'type': 'text/html', 'language': None, 'base...",magister111,2019-12-29T18:31:20-08:00,False,,"tag:flickr.com,2004:/photo/49500517018",https://www.flickr.com/photos/magister111/4950...,...,[{'url': 'https://live.staticflickr.com/65535/...,"[{'role': 'photographer', 'content': 'magister...",[{'url': 'https://live.staticflickr.com/65535/...,"Fri, 07 Feb 2020 04:23:54 -0800","(2020, 2, 7, 12, 23, 54, 4, 38, 0)","<p><a href=""https://www.flickr.com/people/magi...","{'type': 'text/html', 'language': None, 'base'...","[{'term': 'milano cakes food', 'scheme': 'urn:...","Alles Gute zum Geburtstag, liebe AnnA!!","{'type': 'text/plain', 'language': None, 'base..."


Printing the dataframe shows us the fields pandas has sorted d['entries'] into. Some of these values are also dictionaries, which we can index into and reassign like:

In [283]:
df['author']

0                nobody@flickr.com (sakuraseed363)
1                    nobody@flickr.com (skumroffe)
2                  nobody@flickr.com (magister111)
3         nobody@flickr.com (Ben Chen Photography)
4                 nobody@flickr.com (nadirashakil)
5     nobody@flickr.com (Daniel Heiss Photography)
6     nobody@flickr.com (Daniel Heiss Photography)
7                    nobody@flickr.com (Takashi H)
8                    nobody@flickr.com (Takashi H)
9                   nobody@flickr.com (aaroncaley)
10                   nobody@flickr.com (y-shindoh)
11                     nobody@flickr.com (AgWired)
12                     nobody@flickr.com (AgWired)
13                     nobody@flickr.com (AgWired)
14                     nobody@flickr.com (AgWired)
15                     nobody@flickr.com (AgWired)
16                     nobody@flickr.com (AgWired)
17                     nobody@flickr.com (AgWired)
18                     nobody@flickr.com (AgWired)
19                     nobody@f

In [284]:
authorList=[]
for author in range(0,len(df['authors'])):
    authorList.append(df['authors'][author][0]['name'])
df['author']=authorList

In [285]:
df['author']

0                sakuraseed363
1                    skumroffe
2                  magister111
3         Ben Chen Photography
4                 nadirashakil
5     Daniel Heiss Photography
6     Daniel Heiss Photography
7                    Takashi H
8                    Takashi H
9                   aaroncaley
10                   y-shindoh
11                     AgWired
12                     AgWired
13                     AgWired
14                     AgWired
15                     AgWired
16                     AgWired
17                     AgWired
18                     AgWired
19                     AgWired
Name: author, dtype: object

Then we can restrict what we see in the final dataframe, while also changing the order of some of the columns around a little

In [287]:
df[['author','title','link','summary']]

Unnamed: 0,author,title,link,summary
0,sakuraseed363,hybrid vegetables Seeds,https://www.flickr.com/photos/186859685@N06/49...,"<p><a href=""https://www.flickr.com/people/1868..."
1,skumroffe,Salsiccia pizza,https://www.flickr.com/photos/skumroffe/495010...,"<p><a href=""https://www.flickr.com/people/skum..."
2,magister111,"Alles Gute zum Geburtstag, liebe AnnA!!",https://www.flickr.com/photos/magister111/4950...,"<p><a href=""https://www.flickr.com/people/magi..."
3,Ben Chen Photography,瓦城乾拌麵,https://www.flickr.com/photos/benagexyz/495005...,"<p><a href=""https://www.flickr.com/people/bena..."
4,nadirashakil,Menu Book,https://www.flickr.com/photos/186819013@N05/49...,"<p><a href=""https://www.flickr.com/people/1868..."
5,Daniel Heiss Photography,Honey Bee in Garden,https://www.flickr.com/photos/99144705@N06/495...,"<p><a href=""https://www.flickr.com/people/9914..."
6,Daniel Heiss Photography,Honey Bee in Garden,https://www.flickr.com/photos/99144705@N06/495...,"<p><a href=""https://www.flickr.com/people/9914..."
7,Takashi H,ちらし近江町 ¥1530,https://www.flickr.com/photos/62942199@N08/495...,"<p><a href=""https://www.flickr.com/people/6294..."
8,Takashi H,ちらし近江町 ¥1530,https://www.flickr.com/photos/62942199@N08/495...,"<p><a href=""https://www.flickr.com/people/6294..."
9,aaroncaley,DSCF6454,https://www.flickr.com/photos/aaroncaley/49501...,"<p><a href=""https://www.flickr.com/people/aaro..."


The `'summary_detail'` and `summary` columns give a short description of the photo, but include some html tags that aren't useful for us if we're trying to analyse plain-text output. 

BeautifulSoup is a very useful library in many respects for parsing html and xml text, so we can use it here to strip the html tags from this field.

In [295]:
from bs4 import BeautifulSoup
summaryList=[]
for summary in range(0,len(d['entries'])):
    html_str=d['entries'][summary]['summary']
    soup = BeautifulSoup(html_str)
    summaryList.append(soup.get_text().split(':',1)[1]) # Included this split to remove a boilerplate first line from summary

In [296]:
summaries

['Sakura range of beetroots, OP/ selected like Ruby Queen, Bikores & royal red ( produced in Europe) & two excellent hybrid F1 Kingdom & F1 Red star with early maturity 55-58 days, uniform smooth roots/ bulb. Deep red flesh, crispy, sweet and suitable for long transportation. Tolerant of all major viruses and diseases.Buy Beetroot seeds online-http://sakuraseed.net/Contact Us- 91-8884261708',
 'From Furellios Ristorante in Stockholm, Sweden',
 'Una torta glassata, made by Pasticceria Marchesi, la più famosa di Milano, per il compleanno di Anna, la mia (quasi) gemella su flickr!!Eine glasiertee Geburtstagstorte, hergestellt von Pasticceria Marchesi, der berühmtesten Konditorei in Mailand, für Anna, meine (fast) Zwillinge auf flickr!',
 '',
 'Menu Book design For Doraemon cafe',
 '',
 '',
 '井ノ弥石川県金沢市上近江町33-1',
 '井ノ弥石川県金沢市上近江町33-1',
 '',
 'danken.jp/tabelog.com/kagoshima/A4601/A460101/46013132/',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

Now we can overwrite the `'summary'` column in the dataframe. You can assign our new plain-text summary list to a new column, if you like.

In [297]:
df['summary']=summaries

In [298]:
df[['author','title','link','summary']].head(5)

Unnamed: 0,author,title,link,summary
0,sakuraseed363,hybrid vegetables Seeds,https://www.flickr.com/photos/186859685@N06/49...,"Sakura range of beetroots, OP/ selected like R..."
1,skumroffe,Salsiccia pizza,https://www.flickr.com/photos/skumroffe/495010...,"From Furellios Ristorante in Stockholm, Sweden"
2,magister111,"Alles Gute zum Geburtstag, liebe AnnA!!",https://www.flickr.com/photos/magister111/4950...,"Una torta glassata, made by Pasticceria Marche..."
3,Ben Chen Photography,瓦城乾拌麵,https://www.flickr.com/photos/benagexyz/495005...,
4,nadirashakil,Menu Book,https://www.flickr.com/photos/186819013@N05/49...,Menu Book design For Doraemon cafe


We now have an API request response similar to the one you receive if you have followed up to page 4 of the tutorial using google spreadsheets

## Going *further*

You can add additional parameters to your request, separated by an '&'

|Parameter|Description |
| ------------- |------|
| id | A single user ID |
| ids | A comma delimited list of user IDs |
| tagmode | Control whether items must have ALL the tags (tagmode=all), or ANY (tagmode=any) of the tags. Default is ALL.|
| format | The format of the feed (default Atom 1.0)|
| lang |The display language for the feed (default en-us) |

Parameters are documented [here](https://www.flickr.com/services/feeds/docs/photos_public/)

e.g:

In [312]:
api_request=base_url + '?'+ 'tags=' + tags + '&format=' + response_format
api_request

'https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2'

Take a look at only posts tagged 'food' from the British Library flickr account 

In [313]:
BritishLibraryFlickrID='12403504@N02'

In [314]:
api_request+='&id='+BritishLibraryFlickrID
print(api_request)

https://api.flickr.com/services/feeds/photos_public.gne?tags=food&format=rss2&id=12403504@N02


Handle the request and response as before:

In [321]:
d = feedparser.parse(api_request)
df=pd.DataFrame(d['entries'])

df[['author','title','link','summary']].head(5)

Unnamed: 0,author,title,link,summary
0,nobody@flickr.com (The British Library),Image taken from page 51 of 'Unbeaten Tracks i...,https://www.flickr.com/photos/britishlibrary/1...,"<p><a href=""https://www.flickr.com/people/brit..."
1,nobody@flickr.com (The British Library),Image taken from page 506 of 'S. W. By the aut...,https://www.flickr.com/photos/britishlibrary/1...,"<p><a href=""https://www.flickr.com/people/brit..."
2,nobody@flickr.com (The British Library),Image taken from page 72 of 'America revisited...,https://www.flickr.com/photos/britishlibrary/1...,"<p><a href=""https://www.flickr.com/people/brit..."
3,nobody@flickr.com (The British Library),Image taken from page 31 of 'Cruel Fred and ot...,https://www.flickr.com/photos/britishlibrary/1...,"<p><a href=""https://www.flickr.com/people/brit..."
4,nobody@flickr.com (The British Library),Image taken from page 244 of 'Peter Simple ......,https://www.flickr.com/photos/britishlibrary/1...,"<p><a href=""https://www.flickr.com/people/brit..."


Clean things up as we did before

In [344]:
summaryList=[]
for summary in range(0,len(d['entries'])):
    html_str=d['entries'][summary]['summary']
    soup = BeautifulSoup(html_str)
    #print(soup.get_text().split(':',1)[1])
    summaryList.append(soup.get_text().split(':',1)[1].replace(r'\n','')) # Included this split to remove a boilerplate first line from summary
df['summary']=summaryList


In [345]:
df[['credit','title','link','summary']].head(5)

Unnamed: 0,credit,title,link,summary
0,The British Library,Image taken from page 51 of 'Unbeaten Tracks i...,https://www.flickr.com/photos/britishlibrary/1...,"\n\nImage taken from:\n\nTitle: ""Unbeaten Trac..."
1,The British Library,Image taken from page 506 of 'S. W. By the aut...,https://www.flickr.com/photos/britishlibrary/1...,"\n\nImage taken from:\n\nTitle: ""S. W. By the ..."
2,The British Library,Image taken from page 72 of 'America revisited...,https://www.flickr.com/photos/britishlibrary/1...,"\n\nImage taken from:\n\nTitle: ""America revis..."
3,The British Library,Image taken from page 31 of 'Cruel Fred and ot...,https://www.flickr.com/photos/britishlibrary/1...,"\n\nImage taken from:\n\nTitle: ""Cruel Fred an..."
4,The British Library,Image taken from page 244 of 'Peter Simple ......,https://www.flickr.com/photos/britishlibrary/1...,"\n\nImage taken from:\n\nTitle: ""Peter Simple ..."


Can clean up the `\n` newline characters in the summary column

In [349]:
summaryList=[s.replace("\n",'') for s in summaryList]

In [350]:
summaryList

['Image taken from:Title: "Unbeaten Tracks in Japan. An account of travels in the interior, including visits to the aborigines of Yezo and the shrines of Nikkô and Isé ... With map and illustrations, etc"Author: BIRD, afterwards BISHOP, Isabella Lucy.Shelfmark: "British Library HMNTS 010058.ee.53."Volume: 01Page: 51Place of Publishing: LondonDate of Publishing: 1880Publisher: John MurrayIssuance: monographicIdentifier: 000356195Explore:  Find this item in the British Library catalogue, \'Explore\'.  Open the page in the British Library\'s itemViewer (page image 51)Download the PDF for this book Image found on book scan 51 (NB not a pagenumber)Download the OCR-derived text for this volume: (plain text) or (json)Click here to see all the illustrations in this book and click here to browse other illustrations published in books in the same year.Order a higher quality version from here.',
 'Image taken from:Title: "S. W. By the author of “A Modern Minister.”"Author: WEIR, Saul.Shelfmark:

In [351]:
df['summary']=summaryList

And finally get a request in a nice format (that we can export to csv etc)

In [353]:
df[['credit','title','link','summary']].head(5)

Unnamed: 0,credit,title,link,summary
0,The British Library,Image taken from page 51 of 'Unbeaten Tracks i...,https://www.flickr.com/photos/britishlibrary/1...,"Image taken from:Title: ""Unbeaten Tracks in Ja..."
1,The British Library,Image taken from page 506 of 'S. W. By the aut...,https://www.flickr.com/photos/britishlibrary/1...,"Image taken from:Title: ""S. W. By the author o..."
2,The British Library,Image taken from page 72 of 'America revisited...,https://www.flickr.com/photos/britishlibrary/1...,"Image taken from:Title: ""America revisited ......"
3,The British Library,Image taken from page 31 of 'Cruel Fred and ot...,https://www.flickr.com/photos/britishlibrary/1...,"Image taken from:Title: ""Cruel Fred and other ..."
4,The British Library,Image taken from page 244 of 'Peter Simple ......,https://www.flickr.com/photos/britishlibrary/1...,"Image taken from:Title: ""Peter Simple ... Illu..."


# Working with the BNB

Exercise 2 of the tutorial details with more complex data structures by retrieving items from the BNB and extracting
information from the XML data

For this exercise you are going to work with a ‘full record display’ for books from the BNB.  
The example URL from the tutorial is http://bnb.data.bl.uk/id/resource/010712074

Click on it and note the difference in the URL displayed in the browser!      
`id` becomes `doc`

HTML looks nice on a web page, but isn't so useful for extracting data (with a script or similar). Formats available through the API are:

- rdf
- ttl
- json
- xml
- html

Could pick json or xml realistically, so let's look at XML again:  
https://bnb.data.bl.uk/doc/resource/010712074.xml

In [381]:
url_list=[
    'http://bnb.data.bl.uk/id/resource/009406660',
    'http://bnb.data.bl.uk/id/resource/010055357',
    'http://bnb.data.bl.uk/id/resource/009406743',
    'http://bnb.data.bl.uk/id/resource/010053535',
    'http://bnb.data.bl.uk/id/resource/008418912',
    'http://bnb.data.bl.uk/id/resource/012702152',
    'http://bnb.data.bl.uk/id/resource/009406658',
    'http://bnb.data.bl.uk/id/resource/009097698',
    'http://bnb.data.bl.uk/id/resource/010975194'
    ]

url_list=[url+'.xml' for url in url_list] # add '.xml' to the end of every url given here

In [382]:
url_list

['http://bnb.data.bl.uk/id/resource/009406660.xml',
 'http://bnb.data.bl.uk/id/resource/010055357.xml',
 'http://bnb.data.bl.uk/id/resource/009406743.xml',
 'http://bnb.data.bl.uk/id/resource/010053535.xml',
 'http://bnb.data.bl.uk/id/resource/008418912.xml',
 'http://bnb.data.bl.uk/id/resource/012702152.xml',
 'http://bnb.data.bl.uk/id/resource/009406658.xml',
 'http://bnb.data.bl.uk/id/resource/009097698.xml',
 'http://bnb.data.bl.uk/id/resource/010975194.xml']

In [425]:
from lxml import etree

tree = etree.parse(opener.open(url_list[0]),parser).getroot()

for sub in tree.xpath('//result/primaryTopic'):
    description = ''.join(sub.xpath('.//description/text()'))
    title= ''.join(sub.xpath('.//title/text()'))
    print("[*] Title: {}".format(title))
    print("[*] Description: {}".format(description))

[*] Title: Espedair Street
[*] Description: Originally published: London : Macmillan, 1987.


Alternatively, use a double backslash `\\` to act as a wildcard in the path. Can find the title like:

In [426]:
tree.xpath('//title/text()')

['Espedair Street']

And the ISBN:

In [431]:
tree.xpath('//ISBN10/text()')

['0316858552']

That's it for this tutorial! The next steps given are:

- How would you amend the formula to display the publication information?
- Now you have an ISBN for a BNB item, can you think of other online resources you could link to or use to further enhance the display?
- How would you go about bringing in an additional source of data?

It is left up to the reader to decide how to proceed!