<a href="https://colab.research.google.com/github/olexandr7/erm_workshop/blob/main/ERM_workshop_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Workshop 1** is focused on trying out textual manipulation

Let's now pick a museum collection from MuIS and try several operations:
* loading all items from collection
* making a word cloud out of item titles
* exporting collection to Excel

**Block 1**: installing relevant libraries for textual manipulation

In [None]:
#RDF scripts taken from rdflib tutorial:
#https://rdflib.readthedocs.io/en/stable/gettingstarted.html#a-more-extensive-example
#-------------------------------
#installing rdflib library
%pip install rdflib
#-------------------------------
import matplotlib.pyplot as plt
from rdflib import Graph
from wordcloud import WordCloud
#downloading files from Colab
from google.colab import files

**Block 2**: Displaying details of MuIS item in RDF
<br>  <font color='orange'>Action point:</font> Try changing URL to any other item from MUiS - there's a *püsiviide/permalink* for each item in UI
https://www.muis.ee/museaalview/1887998 -> https://opendata.muis.ee/object/1887998

In [None]:
# Create a Graph
g = Graph()
#this item could be viewed from MuIS UI via: https://www.muis.ee/museaalview/1887998
# Parse in an RDF file hosted on the Internet
g.parse("https://opendata.muis.ee/object/1887998")         #<---  URL could be changed to any item from MuIS
#displaying RDF contents - details about specific item
for s, p, o in g:
    print(s, p, o)
#Print out the entire Graph in the RDF Turtle format (added as just another example)
#print(g.serialize(format='turtle'))

👕👕👕👕👕👕👕
<br>
**Dataset**: textile collection (Muuseumikogu: tekstiil) from Tallinna Linnamuuseum
http://www.muis.ee/rdf/collection/837
<br>
👕👕👕👕👕👕👕

In [None]:
####
# Textile collection analysis
####

#http://www.muis.ee/rdf/collection/837
#https://www.muis.ee/museaalview/1887998
#textile collection (Muuseumikogu tekstiil) from Tallinna Linnamuuseum

#idea
#getting list of all objects from collection into list
#getting descriptions from each of objects into dataframe

**Block 3**: Loading all items from textile collection into a list
<br>  <font color='orange'>Action point:</font> Try changing URL to a different MuIS collection

In [3]:
g = Graph()
collectionitemslist = []

# Parse in an RDF file
g.parse("http://www.muis.ee/rdf/collection/837")    #<---  URL could be changed to any image from MuIS

#loop through triples
for s, p, o in g:
    if "P46_is_composed_of" in p: collectionitemslist.append(o)

count = 0
for items in collectionitemslist:
        count = count + 1
print("Total count of items in collection:-", count)

Total count of items in collection:- 4076


In [None]:
#Generating dataset for Wordcloud

#filtering out values from valid URLs, adding them into two lists
collectionitemslist_url = []
collectionitemslist_title = []

it = 0

for i in collectionitemslist:
    g = Graph()
    try:
        it += 1
        g.parse(i)
        for s, p, o in g:
            if "http://opendata.muis.ee/object/" in s and "www.w3.org/2000/01/rdf-schema#label" in p:  print(o), print ('-------'), print(it), collectionitemslist_url.append(i), collectionitemslist_title.append(o)
    except:
        pass

In [None]:
#making dataframe out of lists
df = pd.DataFrame(list(zip(collectionitemslist_url, collectionitemslist_title)))
df = df.rename(columns={0: 'URL', 1: 'Title'})
df

In [None]:
#Creating wordcloud based on example from:
#https://github.com/amueller/word_cloud/blob/main/examples/simple.py

#additional filtering could be applied if needed
#df = df[df['Label'].str.contains("Kleit")]
#creating a single string with all values from title column
df_joined = ' '.join(df['Title'].to_list())

In [None]:
df_joined

In [None]:
#generating world cloud

# lower max_font_size
wordcloud = WordCloud(max_font_size=80).generate(df_joined)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Let's now look into exporting MuIS content into Excel file

In [None]:
#filtering out values from valid URLs, adding them into multiple lists
collectionitemslist_url = []
collectionitemslist_title = []
collectionitemslist_label = []
collectionitemslist_availabiletime = []
collectionitemslist_identifier = []
collectionitemslist_publisher = []
collectionitemslist_collection = []
it = 0

for i in collectionitemslist:
    g = Graph()
    try:
        it += 1
        g.parse(i)
        for s, p, o in g:
            if "http://opendata.muis.ee/object/" in s and "www.w3.org/2000/01/rdf-schema#label" in p:  print(it), collectionitemslist_url.append(i), collectionitemslist_title.append(o)
            if "purl.org/dc/terms/available" in p:  print(it), collectionitemslist_url.append(i), collectionitemslist_availabiletime.append(o)
            if "purl.org/dc/terms/identifier" in p:  print(it), collectionitemslist_url.append(i), collectionitemslist_identifier.append(o)
            if "purl.org/dc/elements/1.1/publisher" in p:  print(it), collectionitemslist_url.append(i), collectionitemslist_publisher.append(o)
            if "http://opendata.muis.ee/object/" in s and "cidoc-crm/P46i_forms_part_of" in p and "/collection/" in o:  print(it), collectionitemslist_url.append(i), collectionitemslist_collection.append(o)
            if "rdf-schema#label" in p and "tervik" not in o:  print(it), collectionitemslist_url.append(i), collectionitemslist_label.append(o)
    except:
        pass

In [None]:
#making dataframe out of lists
# import pandas as pd
import pandas as pd

df = pd.DataFrame(list(zip(collectionitemslist_url, collectionitemslist_title, collectionitemslist_label,
                           collectionitemslist_availabiletime, collectionitemslist_identifier, collectionitemslist_publisher, collectionitemslist_collection)))
df = df.rename(columns={0: 'URL',1: 'Title', 2: 'Label', 3: 'Made available', 4: 'Identifier', 5: 'Publisher', 6: 'Collection'})
#df = df[df['Label'].str.contains("tervik") == False]  #filtering out values with "tervik"

In [None]:
#file export out
df.to_excel("labels.xlsx")

#downloading file from browser
files.download('labels.xlsx')