# xmlcoll tutorial
The goal of this notebook is to demonstrate the use of the [xmlcoll](https://xmlcoll.readthedocs.io) package.  xmlcoll stores collections of items.  Data about each of the items are stored in a dictionary from which the data can be retrieved or updated.  The xmlcoll API permits the user to update collections from XML, validate XML against an XML [schema](https://www.w3schools.com/xml/xml_schema.asp), and write data to a new XML file.

Begin by installing and importing the necessary python libraries

In [1]:
import sys, io, requests, html

!{sys.executable} -m pip install --quiet xmlcoll

import xmlcoll.coll as xc

Create a collection.  In this tutorial, it will be a catalog of famous paintings, but initially it is empty.

In [2]:
my_collection = xc.Collection()

First we will import data.  We do this by updating our empty collection with data from an XML file.  For this tutorial, we will use a data file from [OSF](https://osf.io/k5c9m).

In [3]:
my_collection.update_from_xml(io.BytesIO(requests.get('https://osf.io/k5c9m/download').content))

The collection can have properties.  Let's begin by retrieving and printing those properties.  The properties are available through the properties dictionary, which can be retrieved by the *get_properties()* method and has the data in *key*, *value* pairs.  The value can be any kind of python variable that can be cast to a string (since it must be able to be rendered into XML). The key can be an individual string or [tuple](https://www.w3schools.com/python/python_tuples.asp) of strings with up to six elements. The first element of the tuple will be considered to be the property name while the remaining elements will be considered to be tags.

In [4]:
for property in my_collection.get_properties():
    print(property, ':', my_collection.get_properties()[property])

Title : Famous Paintings
Original Collator : Brad Meyer


We can list the paintings in the collection by retrieving a dictionary of the items with the *get()* method.  The painting names are the *keys* of the dictionary, so printing out the keys gives the names of the items (in our case paintings) in our collection.  In printing out the names, we *unescape* any html entities in the painting names.

In [5]:
for painting in my_collection.get():
    print(html.unescape(painting))

Mona Lisa
The Starry Night
Girl with a Pearl Earring
The Kiss
The Night Watch
The Birth of Venus
American Gothic
Guernica
A Sunday Afternoon on the Island of La Grande Jatte
Whistler's Mother
Bal du moulin de la Galette
La Meninas
Liberty Leading the People
The Last Supper
Arnolfini Portrait
Café Terrace at Night
The Persistence of Memory
Impression, Sunrise
Le Déjeuner sur l'herbe
The Creation
Les Demoiselles d'Avignon
Nighthawks
The Great Wave off Kanagawa
The Storm on the Sea of Galilee
The Sleeping Gypsy
A Bar at the Folies-Bergère
Portrait of Adele Bloch-Bauer I
The Swing
Lady with an Ermine
The Garden of Early Delights


Every item in the collection has a dictionary of properties that can be retrieved by the *get_properties()* method for that item.  Here we select a painting and print its properties.

In [6]:
painting = my_collection.get()['The Sleeping Gypsy']
properties = painting.get_properties()

for prop in properties:
    print(prop, ':', properties[prop])

date : 1897
('artist', 'name', 'first') : Henri
('artist', 'name', 'last') : Rousseau
('artist', 'nationality') : French


We can add properties to the painting by adding *key*, *value* pairs to the properties dictionary.  The item properties dictionary has the same restrictions on *key* and *value* pairs as does the collection properties dictionary, namely, that the keys are stings or tuples of strings and the values are any variable that can be cast to a string.

In [7]:
properties['medium'] = 'watercolor'
properties['some number'] = 5
properties[('my_name', 'my_tag1', 'my_tag2', 'my_tag3', 'my_tag4', 'my_tag5')] = 'my new property'

Check that the properties have been added by printing the properties again.

In [8]:
for prop in properties:
    print(prop, ':', properties[prop])

date : 1897
('artist', 'name', 'first') : Henri
('artist', 'name', 'last') : Rousseau
('artist', 'nationality') : French
medium : watercolor
some number : 5
('my_name', 'my_tag1', 'my_tag2', 'my_tag3', 'my_tag4', 'my_tag5') : my new property


We can change the property by updating the dictionary.  Here, we realized we made a mistake and can update the *medium* property:

In [9]:
properties['medium'] = 'oil'

for prop in properties:
    print(prop, ':', properties[prop])

date : 1897
('artist', 'name', 'first') : Henri
('artist', 'name', 'last') : Rousseau
('artist', 'nationality') : French
medium : oil
some number : 5
('my_name', 'my_tag1', 'my_tag2', 'my_tag3', 'my_tag4', 'my_tag5') : my new property


Let's now add a painting.  To do so, we create an item (a painting in our case) and then add properties.  Notice the use of [html entities](https://www.w3schools.com/charsets/ref_html_entities_e.asp) to render the acute e in the artist's name.

In [10]:
painting = xc.Item('The Raft of the Medusa')

props = {}
props['date'] = '1819'
props[('artist', 'name', 'first')] = 'Th&eacute;odore'
props[('artist', 'name', 'last')] = 'G&eacute;ricault'
props[('artist', 'nationality')] = 'French'
painting.update_properties(props)

We now add the painting.

In [11]:
my_collection.add_item(painting)

Check that the painting has been added by printing out the new collection.  We will also print out the name of each artist.

In [12]:
for painting in my_collection.get():
    props = my_collection.get()[painting].get_properties()
    first_name = html.unescape(props[('artist', 'name', 'first')])
    last_name = html.unescape(props[('artist', 'name', 'last')])
    print(html.unescape(painting), 'by', first_name, last_name)

Mona Lisa by Leonardo da Vinci
The Starry Night by Vincent van Gogh
Girl with a Pearl Earring by Johannes Vermeer
The Kiss by Gustav Klimt
The Night Watch by Rembrandt van Rijn
The Birth of Venus by Sandro Botticelli
American Gothic by Grant Wood
Guernica by Pablo Picasso
A Sunday Afternoon on the Island of La Grande Jatte by Georges Seurat
Whistler's Mother by James Whistler
Bal du moulin de la Galette by Pierre-Auguste Renoir
La Meninas by Diego Velázquez
Liberty Leading the People by Eugène Delacroix
The Last Supper by Leonardo da Vinci
Arnolfini Portrait by Jan van Eyck
Café Terrace at Night by Vincent van Gogh
The Persistence of Memory by Salvador Dali
Impression, Sunrise by Claude Monet
Le Déjeuner sur l'herbe by Édouard Manet
The Creation by Michelangelo Buonarotti
Les Demoiselles d'Avignon by Pablo Picasso
Nighthawks by Edward Hopper
The Great Wave off Kanagawa by Katsushika Hokusai
The Storm on the Sea of Galilee by Rembrandt van Rijn
The Sleeping Gypsy by Henri Rousseau
A Bar

We see that the collection is updated with the new painting.  We can now output the collection to a new XML file with the *write_to_xml()* method.

In [13]:
file = 'my_famous_paintings.xml'
my_collection.write_to_xml(file)

We will now create a new collection and import the data from the new XML file.  Before updating the collection, however, we will validate the XML against the xmlcoll [schema](https://osf.io/t26k4/).  Doing so ensures that the XML is appropriate input for *xmlcoll*.  If there is no error, the *validate()* method will simply return.

In [14]:
new_collection = xc.Collection()
new_collection.validate(file)

Let us now update the data in the new collection but use [XPath](https://www.w3schools.com/xml/xpath_intro.asp) to select only paintings by French artists.

In [15]:
new_collection.update_from_xml(file, xpath = "[.//property[@name = 'artist' and @tag1 = 'nationality'] = 'French']")

Print out the paintings to check the new collection.

In [16]:
for painting in new_collection.get():
    props = my_collection.get()[painting].get_properties()
    first_name = html.unescape(props[('artist', 'name', 'first')])
    last_name = html.unescape(props[('artist', 'name', 'last')])    
    nationality = html.unescape(props[('artist', 'nationality')])

    print(html.unescape(painting), 'by', first_name, last_name, '(', nationality, ')')

A Sunday Afternoon on the Island of La Grande Jatte by Georges Seurat ( French )
Bal du moulin de la Galette by Pierre-Auguste Renoir ( French )
Liberty Leading the People by Eugène Delacroix ( French )
Impression, Sunrise by Claude Monet ( French )
Le Déjeuner sur l'herbe by Édouard Manet ( French )
The Sleeping Gypsy by Henri Rousseau ( French )
A Bar at the Folies-Bergère by Édouard Manet ( French )
The Swing by Jean-Honoré Fragonard ( French )
The Raft of the Medusa by Théodore Géricault ( French )


In fact, it is simply possible in this case to match on *tag1* being nationality.  Thus, we could select Dutch paintings in a new collection:

In [17]:
new_collection2 = xc.Collection()
new_collection2.update_from_xml(file, xpath = "[.//property[@tag1 = 'nationality'] = 'Dutch']")

Once again, print out the paintings to check on the new collection.

In [18]:
for painting in new_collection2.get():
    props = my_collection.get()[painting].get_properties()
    first_name = html.unescape(props[('artist', 'name', 'first')])
    last_name = html.unescape(props[('artist', 'name', 'last')])    
    nationality = html.unescape(props[('artist', 'nationality')])

    print(html.unescape(painting), 'by', first_name, last_name, '(', nationality, ')')

The Starry Night by Vincent van Gogh ( Dutch )
Girl with a Pearl Earring by Johannes Vermeer ( Dutch )
The Night Watch by Rembrandt van Rijn ( Dutch )
Arnolfini Portrait by Jan van Eyck ( Dutch )
Café Terrace at Night by Vincent van Gogh ( Dutch )
The Storm on the Sea of Galilee by Rembrandt van Rijn ( Dutch )
The Garden of Early Delights by Hieronymus Bosch ( Dutch )


The criteria for selecting items by XPath will depend on the particular data in your XML file and the keys and values. Once you have a new collection, you can retrieve items and their properties as illustrated above. You can also write the collection to XML again. We'll first update the Title property in the collection and then write out the XML:

In [19]:
new_collection.get_properties()['Title'] = 'Famous French Paintings'
new_collection.write_to_xml('my_famous_french_paintings.xml')

The new files should be available in the local directory.  If you are running this tutorial from Google Colab, the new xml files should be available under the folder tag to the left of the page.