# Building a citable text corpus from OCRE

This notebook shows you how to load OCRE data from a CEX file over the internet, and build a corpus of text citable by CTS URN.  It uses version `1.7.0` of the `nomisma` library. 


## Configure Jupyter notebook

First configure the Jupyter notebook. In addition to the `nomisma` library, we'll need the `cite` and `ohco2` libraries from the CITE architecture.

In [1]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [2]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.7.0`
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.pom
Downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.4/scala-library-2.12.4.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/cite/xcite_2.12/4.1.1/xcite_2.12-4.1.1.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.2.0/histoutils_2.12-2.2.0.pom
Downloading https://repo1.maven.or

Downloaded https://repo1.maven.org/maven2/org/scala-lang/modules/scala-xml_2.12/1.0.6/scala-xml_2.12-1.0.6-sources.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.2.0/histoutils_2.12-2.2.0-sources.jar
Downloaded https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8-sources.jar
Downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.2.0/histoutils_2.12-2.2.0.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.12/10.16.0/ohco2_2.12-10.16.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/cex_2.12/6.3.3/cex_2.12-6.3.3.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/nomisma_2.12/1.7.0/nomisma_2.12-1.7.0.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/ohco2_2.12/10.16.0/ohco2_2.

[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     [39m

## Load the full OCRE data set

In [3]:
import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000) 

Dec 10, 2019 1:39:36 PM wvlet.log.Logger log
INFO: Reading 50644 lines of CEX data.
Dec 10, 2019 1:39:38 PM wvlet.log.Logger log
INFO: Created Ocre with 50644 issues.


[32mimport [39m[36medu.holycross.shot.nomisma._
[39m
[36mocreCex[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"[39m
[36mocre[39m: [32mOcre[39m = [33mOcre[39m(
  [33mVector[39m(
    [33mOcreIssue[39m(
      [32m"3.com.43"[39m,
      [32m"RIC III Commodus 43"[39m,
      [32m"denarius"[39m,
      [32m"ar"[39m,
      [32m"commodus"[39m,
      [32m"rome"[39m,
      [32m"italy"[39m,
      [32m"Head of Commodus, laureate, right"[39m,
      [32m"M COMMODVS ANTONINVS AVG"[39m,
      [32m"http://nomisma.org/id/commodus"[39m,
      [32m"Roma, helmeted, draped, standing left, holding Victory in extended right hand and vertical spear in left hand"[39m,
      [32m"TR P VII IMP V COS III P P"[39m,
      [32m"http://collection.britishmuseum.org/id/person-institution/60208"[39m,
      [33mSome[39m([33mYearRange[39m([32m182[39m, [33mSome[39m([32m182[39m)))
    ),
    [33mOcreIssue[39m(

In [7]:
val obvLegends = ocre.hasObvLegend.issues.map(_.obvLegend)
val revLegends = ocre.hasRevLegend.issues.map(_.revLegend)

[36mobvLegends[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"M COMMODVS ANTONINVS AVG"[39m,
  [32m"D N VALEN-S P F AVG"[39m,
  [32m"D N VALEN-S P F AVG"[39m,
  [32m"IMP C M AVR SEV ALEXAND AVG"[39m,
  [32m"D N VALENTINI-ANVS P F AVG"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP LICINIVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP VALERIANVS AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP LICINIVS P F AVG"[39m,
  [32m"Q HER ETR MES DECIVS NOB C"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"MAXIMINVS P F AVG"[39m,
  [32m"CONSTANTINVS P F AVG"[39m,
  [32m"LICINIVS P F AVG"[39m,
  [32m"MAXENTIVS P F AVG"[39m,

In [14]:
println("Os: " + obvLegends.size)
println("Rs: " + revLegends.size)
val allLegends = obvLegends ++ revLegends
println("All: " + allLegends.size)
val allChars = allLegends.map(_.toVector).flatten
println("All chars: " + allChars.size)

Os: 50148
Rs: 48418
All: 98566
All chars: 2144309


[36mallLegends[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"M COMMODVS ANTONINVS AVG"[39m,
  [32m"D N VALEN-S P F AVG"[39m,
  [32m"D N VALEN-S P F AVG"[39m,
  [32m"IMP C M AVR SEV ALEXAND AVG"[39m,
  [32m"D N VALENTINI-ANVS P F AVG"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP LICINIVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP VALERIANVS AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP LICINIVS P F AVG"[39m,
  [32m"Q HER ETR MES DECIVS NOB C"[39m,
  [32m"IMP MAXIMINVS P F AVG"[39m,
  [32m"IMP C CONSTANTINVS P F AVG"[39m,
  [32m"MAXIMINVS P F AVG"[39m,
  [32m"CONSTANTINVS P F AVG"[39m,
  [32m"LICINIVS P F AVG"[39m,
  [32m"MAXENTIVS P F AVG"[39m,

In [13]:
import $ivy.`edu.holycross.shot::histoutils:2.2.0`
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

[36mallChars[39m: [32mVector[39m[[32mChar[39m] = [33mVector[39m(
  [32m'M'[39m,
  [32m' '[39m,
  [32m'C'[39m,
  [32m'O'[39m,
  [32m'M'[39m,
  [32m'M'[39m,
  [32m'O'[39m,
  [32m'D'[39m,
  [32m'V'[39m,
  [32m'S'[39m,
  [32m' '[39m,
  [32m'A'[39m,
  [32m'N'[39m,
  [32m'T'[39m,
  [32m'O'[39m,
  [32m'N'[39m,
  [32m'I'[39m,
  [32m'N'[39m,
  [32m'V'[39m,
  [32m'S'[39m,
  [32m' '[39m,
  [32m'A'[39m,
  [32m'V'[39m,
  [32m'G'[39m,
  [32m'D'[39m,
  [32m' '[39m,
  [32m'N'[39m,
  [32m' '[39m,
  [32m'V'[39m,
  [32m'A'[39m,
  [32m'L'[39m,
  [32m'E'[39m,
  [32m'N'[39m,
  [32m'-'[39m,
  [32m'S'[39m,
  [32m' '[39m,
  [32m'P'[39m,
  [32m' '[39m,
...

In [17]:
import edu.holycross.shot.histoutils._
val charFreqsSeq = allChars.groupBy( c => c).map{ case (c,vect) => Frequency(c, vect.size)}
val charFreqs = charFreqsSeq.toVector

[32mimport [39m[36medu.holycross.shot.histoutils._
[39m
[36mcharFreqsSeq[39m: [32mcollection[39m.[32mimmutable[39m.[32mIterable[39m[[32mFrequency[39m[[32mChar[39m]] = [33mList[39m(
  [33mFrequency[39m([32m'E'[39m, [32m92968[39m),
...
[36mcharFreqs[39m: [32mVector[39m[[32mFrequency[39m[[32mChar[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m'E'[39m, [32m92968[39m),
...

In [16]:
// 1. Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

[32mimport [39m[36mplotly._, plotly.element._, plotly.layout._, plotly.Almond._
[39m

In [21]:
val charValues = charFreqs.sortBy(_.count).map(_.item.toString)
val charCounts = charFreqs.sortBy(_.count).map(_.count)


[36mcharValues[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"x"[39m,
...
[36mcharCounts[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m1[39m,
...

In [23]:
val charHistPlot = Seq(
  Bar(x = charValues.reverse, y = charCounts.reverse)
)
plot(charHistPlot)
     

[36mcharHistPlot[39m: [32mSeq[39m[[32mBar[39m] = [33mList[39m(
  [33mBar[39m(
...
[36mres22_1[39m: [32mString[39m = [32m"plot-c9df23b2-93ad-457d-9efd-cd72cf53b307"[39m

In [26]:
val rareChars = charFreqs.sortBy(_.count).reverse.filter(_.count < 15)
val lessRareChars = charFreqs.filter(_.count >= 15)
println("Rare chars: \n" + rareChars.mkString("\n"))

Rare chars: 
Frequency(←,13)
Frequency(m,11)
Frequency(Λ,10)
Frequency(Γ,9)
Frequency(θ,9)
Frequency(p,7)
Frequency(w,6)
Frequency(J,5)
Frequency(●,4)
Frequency(;,3)
Frequency(W,2)
Frequency(+,2)
Frequency(Κ,2)
Frequency(̄,2)
Frequency(Δ,2)
Frequency(\,2)
Frequency(ϵ,2)
Frequency(,2)
Frequency(Α,2)
Frequency(Π,1)
Frequency(3,1)
Frequency(Σ,1)
Frequency(ϟ,1)
Frequency(—,1)
Frequency(̅,1)
Frequency(ϕ,1)
Frequency(Ä,1)
Frequency(x,1)


[36mrareChars[39m: [32mVector[39m[[32mFrequency[39m[[32mChar[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m'\u2190'[39m, [32m13[39m),
...
[36mlessRareChars[39m: [32mVector[39m[[32mFrequency[39m[[32mChar[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m'E'[39m, [32m92968[39m),
...

In [34]:
val charValues = charFreqs.sortBy(_.count).map(_.item.toString)
val alphaChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ "
val sortedFreqs =  charFreqs.sortBy(_.count).reverse
val nonAlpha =  sortedFreqs.filterNot(c => alphaChars.contains(c.item))


[36mcharValues[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"x"[39m,
...
[36malphaChars[39m: [32mString[39m = [32m"ABCDEFGHIJKLMNOPQRSTUVWXYZ "[39m
[36msortedFreqs[39m: [32mVector[39m[[32mFrequency[39m[[32mChar[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m' '[39m, [32m363616[39m),
...
[36mnonAlpha[39m: [32mVector[39m[[32mFrequency[39m[[32mChar[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m'-'[39m, [32m29988[39m),
...

In [35]:
println("Non-alphabetic: \n" + nonAlpha.mkString("\n"))

Non-alphabetic: 
Frequency(-,29988)
Frequency(r,4601)
Frequency(o,4552)
Frequency(•,717)
Frequency([,318)
Frequency(],315)
Frequency(.,251)
Frequency(e,222)
Frequency(_,205)
Frequency((,199)
Frequency(),199)
Frequency(…,175)
Frequency(n,136)
Frequency(˙,110)
Frequency(i,103)
Frequency(l,78)
Frequency(v,78)
Frequency(a,74)
Frequency(c,66)
Frequency(d,56)
Frequency(?,53)
Frequency(|,53)
Frequency(f,49)
Frequency(b,41)
Frequency(t,41)
Frequency(:,39)
Frequency(Ꜹ,38)
Frequency(g,37)
Frequency(s,37)
Frequency(Ϥ,33)
Frequency(∈,28)
Frequency(☧,24)
Frequency(*,21)
Frequency(u,18)
Frequency(/,17)
Frequency(h,17)
Frequency(,,15)
Frequency(Ʈ,15)
Frequency(←,13)
Frequency(m,11)
Frequency(Λ,10)
Frequency(Γ,9)
Frequency(θ,9)
Frequency(p,7)
Frequency(w,6)
Frequency(●,4)
Frequency(;,3)
Frequency(+,2)
Frequency(Κ,2)
Frequency(̄,2)
Frequency(Δ,2)
Frequency(\,2)
Frequency(ϵ,2)
Frequency(,2)
Frequency(Α,2)
Frequency(Π,1)
Frequency(3,1)
Frequency(Σ,1)
Frequency(ϟ,1)
Frequency(—,1)
Frequency(̅,1)
Frequenc