# Verifying data in OCRE

This notebook shows you how to load OCRE data from a CEX file over the internet, and verify its contents.  It uses version `1.5.0` of the `nomisma` library.


## Configure Jupyter notebook

First configure the Jupyter notebook to find the `nomisma` library.  (You could do the same thing in other environments with `sbt` or `maven`.)

In [None]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

In [None]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.5.0`

## Load the full OCRE data set

In [None]:
import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000) 

## Contents of an `Ocre` object

The object `ocre` created in the preceding cell belongs to the `Ocre` class.  `Ocre` objects have a Vector of `OcreIssue`s, each of which in turn has the following properties:


    id: String,
    labelText:  String,
    denomination: String,
    material: String,
    authority: String,
    mint: String,
    region: String,
    obvType: String,
    obvLegend: String,
    obvPortraitId: String,
    revType: String,
    revLegend: String,
    revPortraitId,
    dateRange: Option[YearRange]
    
In this notebook, we'll check for each property that all the values in the 50,000+ records of imperial coin issues look reasonable.


## Check for presence of required properties

    
The first seven properties should have values for each issue.  As a first step in validating the contents of `ocre`, we'll verify that each of those String properties is non-empty.


In [None]:
println("Number of issues in OCRE: " + ocre.size)
require (ocre.issues.filter(_.id.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.labelText.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.denomination.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.material.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.authority.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.mint.nonEmpty).size == ocre.issues.size)
require (ocre.issues.filter(_.region.nonEmpty).size == ocre.issues.size)

val requiredProperties = List("id", "labelText", "denomination", "material", "authority", "mint", "region")
println("All issues have a non-empty data value for:\n" + requiredProperties.mkString("\n"))

## Check values of required properties

Now we want to see if those non-empty values look reasonable.  The first constraint to check is that all values for the `id` and `labelText` properties must be unique.

In [None]:
require(ocre.issues.map(_.id).distinct.size == ocre.size)
require(ocre.issues.map(_.labelText).distinct.size == ocre.size)
println("All id and labelText values are unique.")

The `Ocre` class includes functions to list all values for a given property.  The name of the functions has the form `[PROPERTYNAME]List`.  Let's look at the `material` property for an example.

In [None]:
println(ocre.materialList.mkString("\n"))

You can see that in addition to abbreviations for bronze (`ae`), silver (`ar`) and gold (`av`), there is a fourth category, `none`.  The `Ocre` class includes functions for each property that create a new `Ocre` containing meaningful values for that property.  The name of these functions has the form `has[PROPERTYNAME]`. The `hasMint` function, for example, creates a new `Ocre` containing only issues that have a value other than `none` for the mint property.

In [None]:
println("All issues in Ocre: " + ocre.size)
println("Issues with mint not equal to 'none': " + ocre.hasMint.size)

## Optional properties

OCRE's RDF records optionally include information for each side (obverse and reverse) about type description, legend and identifiers for portraits.  Unlike the required properties, these properties appear in the delimited-text records simply as empty strings that the `Ocre` object ignores, so you won't find entries like `none` or `uncertain_value` in the list of values for these properties.  


### Portrait identifiers

Let's start with the values for identifiers for obverse portraits.  You'll see that while the identifiers include a mix of plain strings, URLs in the `britishmuseum.org` domain, and URLs in the `nomisma.org` domain, there are no values like `none`.

In [None]:
// Not all issues have an obverse portrait ID:
println("Number of issues in OCRE: " + ocre.size)
println("Issues with obv. portrait ID: " + ocre.hasObvPortraitId.size  + "\n")

// and there are no "no data" values in obvPortraitId:
println("Distinct values for obvPortraitId:")
println(ocre.obvPortraitIdList.mkString("\n"))

// reverse portrait identifiers work the same way.

### Type descriptions 

The optional description of obverse and reverse types is a free-text description, so unlike the properties we've looked at above, there is no `[obv|rev]TypeList` function to get a list of controlled vocabulary.  `Ocre` does `hasObvType` and `hasRevType` to create a new `Ocre` including only those issues with an obverse or reverse type description, respctively.

As the following cell shows, we can of course string those functions together to create an `Ocre` containing only issues including *both* an obverse and reverse type description.

In [None]:
val oTypes = ocre.hasObvType
val rTypes = ocre.hasRevType
println("Total number of issues in OCRE: " + ocre.size)
println("Issues with obv. type description: " + oTypes.size)
println("Issues with rev. type description: " + rTypes.size)
val bothTypes = oTypes.hasRevType // == rTypes.hasObvType
println("Issues with both obv. and rev. type description: " + bothTypes.size)

### Legends

Like type descriptions, obverse and reverse legends are free text, and therefore `Ocre` does not have functions `[obv|rev]LegendList` to get a list of controlled vocabulary.

As you would expect by now, the `hasObvLegend` and `hasRevLegend` functions create a new `Ocre` including only those issues with an obverse or reverse legend, respctively.

In [None]:
val oLegends = ocre.hasObvLegend
val rLegends = ocre.hasRevLegend
println("Total number of issues in OCRE: " + ocre.size)
println("Issues with obv. type description: " + oLegends.size)
println("Issues with rev. type description: " + rLegends.size)
val bothLegends = oLegends.hasRevLegend// == rLegends.hasObvLegend
println("Issues with both obv. and rev. legends: " + bothLegends.size)


The `Ocre` class also includes a function to create a citable corpus of texts from the legends data.  Validating the contents of a text corpus is a more complex undertaking that will be described in a separate notebook.

In [None]:
// bring in text libraries
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`

In [None]:
val corpus = ocre.corpus
println("Citable nodes of text in corpus: " + corpus.size)

## Dating information

The `Ocre` class includes a final optional property with date information about each issue.  Instead of a simple string value, it's an object modeling a range of years.  

The `datable` function creates a new `Ocre` containing only issues that have dating information.  The functions `dateRange`, `minDate` and `maxDate` identify the chronological limits of all the issues in a given `Ocre` instance.  Negative values represent years BCE; positive values represent years CE.


In [None]:
// THe date range object:
println("Total number of issues in OCRE: " + ocre.size)
println("Number of datable issues: " + ocre.datable.size)

println("Chronological range of issues in OCRE: " + ocre.dateRange)
println("Earliest issue: " + ocre.minDate)
println("Latest issues: " + ocre.maxDate)
