This notebook allows the user to select XML collections and zip them up to send to a service that runs a transform on them and returns a simple CSV made up of six data points. The data included is the Collection name, Dialect name, Record name, Concept name, Content, Xpath location, and the Dialect Definition for the concept. 

The notebook utilizes Bash and Python with the default packages contained in the Mac build of Anaconda with Python 3.6. Saxon, Java, and XSLT form the evaluation service in a virtual machine on an NCEAS server. 

This CSV contains a row for each concept that is found, so some locations may fulfill multiple concepts. A good example of this are the cncepts Keyword and Place Keyword. Every Place Keyword is also a Keyword, so the row would repeat with a different Concept name. It also contains a row for each undefined node that contains text, marking these rows with an Unknown in the Concept column. 

This data can be used in a variety of analyses including RAD and QuickE as well as Concept Verticals. It can also be used to teach the system dialect definitions for concepts that are currently unknown by exposing all of the content at undefined nodes. 

In [85]:
%%HTML
<img src=https://image.slidesharecdn.com/scgordonesipwinter2017-170125170939/95/recommendations-analysis-dashboard-1-1024.jpg height="420" width="420">

## First we need to call all of the libraries we need to perform in our metadata wrangle

In [111]:
import pandas as pd
import os
from os import walk
import shutil
from ipywidgets import *
import ipywidgets as widgets
import requests
from contextlib import closing
import csv
import io

In [142]:
! pip install svn

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
The folder you are executing pip from can no longer be found.


Create a list of directories in the collection directory of MILE2 

In [112]:
Organizations = []
for (dirpath, dirnames, filenames) in walk('../collection/'):
    Organizations.extend(dirnames)
    break  

Create a function to select the organization the metadata comes from

In [113]:
def OrganizationChoices(organization):
    global OrganizationChoice
    global Organization
    Organization=organization
    print("Organization of the collection is", Organization)


Create a dropdown using the Organizations list and the organization selector function. This sets the Organization variable.

In [114]:
interactive(OrganizationChoices, organization=Organizations)

Organization of the collection is BCO-DMO


Create a list of collections in the organization directory selected in the dropdown above

In [115]:
Collections = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization)):
    Collections.extend(dirnames)
    break 
Collections

['GeoTraces']

Create a function to select the collection the metadata comes from

In [116]:
def CollectionChoices(collection):
    global CollectionChoice
    global Collection
    Collection=collection

Create a dropdown using the Collections list and the organization selector function. This sets the Collection variable.

In [117]:
interactive(CollectionChoices, collection=Collections)

Many organizations support multiple metadata dialects, and share their collections in more than one dialect. This list is created the same way the others are. It adds the different dialects the collection is shared in to a list.

In [118]:
Dialects = []
for (dirpath, dirnames, filenames) in walk(os.path.join('../collection',Organization,Collection)):
    Dialects.extend(dirnames)
    break 
dialectList=Dialects


Create a function to select the dialect you want to send to the evaluator service.

In [119]:
def dialectChoice(dialect):
    global Dialect
    Dialect=dialect
    print("Dialect of the collection is", Dialect)


Create a dropdown using the Dialects list and the dialect selector function. This sets the Dialect variable.

In [120]:
interactive(dialectChoice,dialect=dialectList)

Dialect of the collection is ISO


change to the zip directory 

In [121]:
cd ../zip

/Users/scgordon/MILE2/zip


Combine the Organization, Collection, and Dialect variables with the string 'xml' as a relative path and save the string to a variable

In [122]:
MetadataDestination=os.path.join(Organization,Collection,Dialect,'xml')
MetadataDestination

'BCO-DMO/GeoTraces/ISO/xml'

Use the path to create a directory structure in the zip directory

In [123]:
os.makedirs(MetadataDestination, exist_ok=True)

Create a path to the metadata you selected earlier and save the string to a variable, 'MetadataLocation'.

In [124]:
MetadataLocation=os.path.join('../collection/',Organization,Collection,Dialect,'xml')

MetadataLocation

'../collection/BCO-DMO/GeoTraces/ISO/xml'

Copy the metadata to the new directory structure.

In [125]:
src_files = os.listdir(MetadataLocation)
for file_name in src_files:
    full_file_name = os.path.join(MetadataLocation, file_name)
    if (os.path.isfile(full_file_name)):
        shutil.copy(full_file_name, MetadataDestination)

Make a zip file to upload to the evaluator service

In [126]:
shutil.make_archive('../upload/metadata', 'zip', os.getcwd())

'/Users/scgordon/MILE2/upload/metadata.zip'

In [127]:
cd ../upload

/Users/scgordon/MILE2/upload


Send metadata to the Evaluator. Get the responses with csv encoding. This step can take up to a minute and doesn't track progress, but a dataframe will be returned.

In [130]:
url = 'http://metadig.nceas.ucsb.edu/metadata/evaluator'
files = {'zipxml': open('metadata.zip', 'rb')}
r = requests.post(url, files=files)
r.raise_for_status()
CollectionConceptsDF = pd.read_csv(io.StringIO(r.text))
CollectionConceptsDF

Unnamed: 0,Collection,Dialect,Record,Concept,Content,XPath,DialectDefinition,DocumentLocation
0,GeoTraces,ISO,dataset_3840.xml,Unknown,http://www.isotc211.org/2005/gmi http://www.ng...,/gmi:MI_Metadata/@xsi:schemaLocation,Undefined,/gmi:MI_Metadata/@xsi:schemaLocation
1,GeoTraces,ISO,dataset_3840.xml,Metadata Identifier,http://lod.bco-dmo.org/id/dataset/3840,/gmi:MI_Metadata/gmd:fileIdentifier,/*/gmd:fileIdentifier,/gmi:MI_Metadata/gmd:fileIdentifier[1]
2,GeoTraces,ISO,dataset_3840.xml,Metadata Language,eng; USA,/gmi:MI_Metadata/gmd:language,/*/gmd:language,/gmi:MI_Metadata/gmd:language[1]
3,GeoTraces,ISO,dataset_3840.xml,Unknown,utf8,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,Undefined,/gmi:MI_Metadata/gmd:characterSet[1]/gmd:MD_Ch...
4,GeoTraces,ISO,dataset_3840.xml,Unknown,http://www.isotc211.org/2005/resources/Codelis...,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,Undefined,/gmi:MI_Metadata/gmd:characterSet[1]/gmd:MD_Ch...
5,GeoTraces,ISO,dataset_3840.xml,Unknown,utf8,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,Undefined,/gmi:MI_Metadata/gmd:characterSet[1]/gmd:MD_Ch...
6,GeoTraces,ISO,dataset_3840.xml,Resource Type,dataset,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,/*/gmd:hierarchyLevel/gmd:MD_ScopeCode,/gmi:MI_Metadata/gmd:hierarchyLevel[1]/gmd:MD_...
7,GeoTraces,ISO,dataset_3840.xml,Unknown,"Highest level of data collection, from a commo...",/gmi:MI_Metadata/gmd:hierarchyLevelName/gco:Ch...,Undefined,/gmi:MI_Metadata/gmd:hierarchyLevelName[1]/gco...
8,GeoTraces,ISO,dataset_3840.xml,Metadata Contact,Biological and Chemical Oceanography Data Mana...,/gmi:MI_Metadata/gmd:contact,/*/gmd:contact,/gmi:MI_Metadata/gmd:contact[1]
9,GeoTraces,ISO,dataset_3840.xml,Metadata Modified Date,2013-01-04,/gmi:MI_Metadata/gmd:dateStamp/gco:Date,/*/gmd:dateStamp/gco:Date,/gmi:MI_Metadata/gmd:dateStamp[1]/gco:Date[1]


Save the dataframe as a csv for further analysis

In [133]:
rawData.to_csv('../data/data.csv', mode = 'w', index=False)

Clear up temporary files and directories, switch to the data directory

In [108]:
%cd ../
shutil.rmtree('upload')
%cd zip
shutil.rmtree(Organization)
%cd ../data

/Users/scgordon/MILE2
/Users/scgordon/MILE2/zip
/Users/scgordon/MILE2/data


Copy the csv to a directory, named for the organization that had the metadata in it's holdings. Give it a filename matching the the metadata collection and dialect

In [109]:
shutil.copy("data.csv", os.path.join(Organization,Collection+'_'+Dialect+'_'+'data.csv'))

'BCO-DMO/GeoTraces_ISO_data.csv'

Now that we have our metadata data prepared and stored, we can look at collection analytics, cross collection analytics, concept verticals, and help define unknown concepts.

### Select the notebook that prepares the data for different types of analysis

* [Concept Verticals](ConceptVerticals.ipynb)
* [Quick Evaluation Cross Collection Comparisons](QuickEvaluation-CrossCollectionComparisons.ipynb)
* [Create RAD Data](CreateRADdata.ipynb)
* [Exploring Unknown Document Locations](ExploringUnknownDocumentLocations.ipynb)