# Summarizing data in OCRE


[A previous notebook](https://mybinder.org/v2/gh/neelsmith/nomisma-jupyter/master?filepath=building%2FVerifying_ocre.ipynb) showed how to get an overview of the contents of data in OCRE.  This notebook shows you how to summarize and graph distributions of different values for OCRE properties.  It uses version `1.5.0` of the `nomisma` library.



## Configure Jupyter notebook

First configure the Jupyter notebook to find the `nomisma` library.  (You could do the same thing in other environments with `sbt` or `maven`.)  In addition to the `nomisma` library, we use `plotly` for graph plots, and a `histoutils` package to simplify the working with histograms.

In [None]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

In [None]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.5.0`
import $ivy.`edu.holycross.shot::histoutils:2.2.0`
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

## Load the full OCRE data set

In [None]:
import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000) 

## How are issues distributed?

Let's start with distribution by year.

In [None]:
val years = ocre.datable.issues.map(_.dateRange.get.pointAverage)
val frequencies = years.groupBy(yr => yr).toVector.map{ case (k,v) => Frequency(k,v.size)}
val histogram = edu.holycross.shot.histoutils.Histogram(frequencies)

How many issues are recorded for each issuing authority?  That's a straightforward question to answer with a common Scala idiom:  

1. select all the authority values (` ocre.issues.map(_.authority)`)
2. cluster occurrences of the same authority together (`.groupBy(auth => auth)`). The result maps each authority name to a list of occurrences of that name.
3.  map the key/value pairing to a pairing of key->count of the list's size.  For convenience we'll store the key/count pair in a `Frequency` object from the `histoutils` library.

With a list of `Frequency`s, we can construct a `Histogram`.


In [None]:
import edu.holycross.shot.histoutils._

val authorityFreqs = ocre.issues.map(_.authority).groupBy(auth => auth).map { case (k,v) => Frequency(k, v.size)}
val authorityHistogram = edu.holycross.shot.histoutils.Histogram(authorityFreqs.toVector)

Let's visualize the resulting histogram as a bar graph using the `plotly` library:

In [None]:
// 1. Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

In [None]:
val years = ocre.datable.issues.map(_.dateRange.get.pointAverage)
val frequencies = years.groupBy(yr => yr).toVector.map{ case (k,v) => Frequency(k,v.size)}


val yrVals = frequencies.sortBy(_.item).map(_.item)
val counts = frequencies.sortBy(_.item).map(_.count)

val yearPlot = Seq(
  Bar(
   yrVals, counts
  )
)
plot(yearPlot)
     

Sort the histogram from largest to smallest number of issues, and plot the number of issues by each authority:

In [None]:
val authNames = authorityHistogram.sorted.frequencies.map(_.item)
val authCounts = authorityHistogram.sorted.frequencies.map(_.count)
val authPlot = Seq(
  Bar(
   authNames, authCounts
  )
)
plot(authPlot)


Now let's view the number of issues struck by each issuing authority in chronological sequence.  The x-axis represents years:  each issuing authority is plotted at the mid point of that issuer's production. 

In [None]:
val authGroups = ocre.datable.issues.groupBy(_.authority)
val ocreMaps = authGroups.map{case (k,v) => (k, Ocre(v))}
val chronMaps = ocreMaps.map{ case (k,ocre) => (ocre.dateRange.pointAverage, ocre)}
val chronSorted = chronMaps.toVector.sortBy(_._1)

val datePoints = chronSorted.map(_._1)
val issueCounts = chronSorted.map(_._2.size)

val nameLabels = chronSorted.map(_._2.issues.map(_.authority).distinct).flatten

val issueChronPlot = Seq(
  Bar(
   datePoints, issueCounts, 
   name = "Issues per authority",
    showlegend = true,
      text = nameLabels
    
  )
)
plot(issueChronPlot)

