# Analyzing  character usage in OCRE coin legends

This notebook analyzes the orthography of coin legends in OCRE, so that we can specify a set of accepted characters in OCRE coin legends, and reject legends that have characters outside of the specified set.  This will provide a starting point for then mapping the abbreviation-filled text of OCRE's coin legends to a parallel version with fully expanded forms (to be described in a subsequent Jupyter notebook).

Valid characters can be either alphabetic characters or punctuation characters; we will reject legends that have embedded editorial notes in English, indications of lacunae or editorial restoration, or legends with content in non-Latin alphabets.

The notebook uses version `2.0.1` of the `nomisma` library. 


## Configure Jupyter notebook

Configure the notebook's repository list, and import libraries we'll use.

In [None]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++=   
Seq(myBT)

In [None]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:2.0.1`
//import $ivy.`edu.holycross.shot::ohco2:10.16.0`
//import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::histoutils:2.2.0`
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

In [None]:
// 3. All scala imports, and configure plotly
import edu.holycross.shot.nomisma._
import edu.holycross.shot.histoutils._

import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
// Set display defaults suggested for use in Jupyter NBs:
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

## Load the full OCRE data set


In [None]:
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
val expectedIssues = 50644
require(ocre.size == expectedIssues) 

## Collect text of legends, examine character distribution

To define a set of permitted characters, we begin by surveying the distribution of characters throughout all legends of the OCRE data set.

We can filter `Ocre` objects to create new `Ocre` objects containing only those issues with obverse or reverse legends, and map the results to the text content of the legends.  Since the results are Vectors of Strings, we can concatenate them with `++`.

In Scala, Strings are really just sequences of characters, so we can use the `toVector` function to map the text of each legend to a Vector of characters.  Scala's `flatten` function turns the results into a single Vector of Strings.

In [None]:
val obvLegends : Vector[String] = ocre.hasObvLegend.issues.map(_.obvLegend)
val revLegends : Vector[String]  = ocre.hasRevLegend.issues.map(_.revLegend)
val allLegends : Vector[String] = obvLegends ++ revLegends

val allChars = allLegends.map(_.toVector).flatten


Two million characters is a lot of text: a little under 100,000 coin legends averaging about 20 characters in length.

In [None]:
println("Obv legends: " + obvLegends.size)
println("Rev legends: " + revLegends.size)
println("All: " + allLegends.size)
println("Total characters: " + allChars.size)

println("Average number of characters per legend: " + allChars.size / allLegends.size)

To find how frequently individual characters occur, we use the same idiom with Scala `groupBy` that we used in making histograms of data values when we analyzed OCRE's numismatic data.

In [None]:
val charFreqsSeq = allChars.groupBy( c => c).map{ case (c,vect) => Frequency(c, vect.size)}
val charHistogram = edu.holycross.shot.histoutils.Histogram(charFreqsSeq.toVector).sorted
println("Total distinct characters: " + charHistogram.size)

In [None]:
val charValues = charHistogram.frequencies.map(_.item.toString)
val charCounts = charHistogram.frequencies.map(_.count)

In [None]:
val charHistPlot = Seq(
  Bar(x = charValues, y = charCounts)
)
plot(charHistPlot)
     

## Survey usage of unusual characters

The frequency plot of the 91 distinct characters has a *very* long tail.  To approach the question of which characters we should accept and which ones we should reject, we'll cut the histogram into two parts. We would expect  that the more common a character is, the more likely it is to be a valid character in an OCRE edition. (Conversely, the less common a character, the more likely it is to be some kind of error.)

We'll divide the histogram at a threshold point where characters appear fewer than 600 times: that is, where individual occurrences of the character represent less than three-tenths of one percent of the two million characters in OCRE.

In [None]:
val threshhold = 600

// Find percent of two Ints:
def pct(i1: Int, i2: Int): Float = {
    i1 * 100.0f / i2
}

val rareChars = charHistogram.frequencies.filter(_.count < threshhold)
val rareTotal = edu.holycross.shot.histoutils.Histogram(rareChars).total
val threshholdPct = pct(rareTotal, allChars.size)

val lessRareChars = charHistogram.frequencies.filter(_.count >= threshhold)
val lessRareTotal = edu.holycross.shot.histoutils.Histogram(lessRareChars).total
val aboveThreshholdPct = 100 - threshholdPct

println("USING THRESHHOLD VALUE OF " + threshhold + ":")
println( "Percent of character occurrences of " + lessRareChars.size + " characters above threshhold: " + aboveThreshholdPct)
println("Percent of character occurrences of " + rareChars.size + " characters below threshhold: " + threshholdPct)



So 25 characters account for more the 99.8% of the text content of legends in OCRE, and the remaining 66 characters in total represent less than two-tenths of one percent of the content of OCRE legends.  Let's examine the 25 frequent characters:

In [None]:
for (ch <- lessRareChars) {
    println(ch)
}


The space character, upper-case alphabetic characters and two punctuation marks, are all easily explained.  The appearance of lower-case alphabetic "o" and "r" is unexpected.  Let's inspect a random selection of legends where they occur:

In [None]:
val lowerOs = allLegends.filter(_.contains("o")) 
val lowerRs =allLegends.filter(_.contains("r"))

println("Sample of ten legends with 'o's out of " + lowerOs.size + " (" + pct(lowerOs.size,  allLegends.size) + "% of legends)")
println(lowerOs.take(10).mkString("\n"))
println("\nSample of ten legends with 'r's out of " + lowerRs.size + " (" + pct(lowerRs.size,  allLegends.size) + "% of legends)")
println(lowerRs.take(10).mkString("\n"))




It looks like those characters only occur when the editor has inserted an English comment into the text of the legend.  We don't want to accept those characters, but the space, alphabetic characters and punctuation marks should be part of our specified set.

Let's look next at the rare characters.  Are there any that we should accept?



In [None]:
val distinctChars = rareChars.size + lessRareChars.size
println( distinctChars + " distinct chars")
println(rareChars.size + " rare ones:")
for (ch <- rareChars) {
    println(ch)
}

The rarer characters are primarily Greek characters, lower-case (and so English-language) Latin alphabetic characters, and punctuation marking lacunae or editorial completions (such as parentheses, brackets, ellipses).  We want to omit legends containing any of these

Two rare characters are used as punctuation marks that we can process like the more frequent punctuation marks we have already observed, namely  `←` and `|`.

We can now compile a list of all acceptable characters, and define a function to determine if a String value is composed exclusively of valid characters or not.


In [None]:
val allowedChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ -•←|"

// True if String s composed only of allowable characters
def validOrtho(s: String, allowedCharacters: String = allowedChars) : Boolean = {
 
    val charChecks = for (c <- s.toVector) yield {
        allowedCharacters.contains(c)
    }
    val flatVals = charChecks.distinct
    (flatVals.size == 1) && (flatVals(0)== true)
}
println("Total valid characters: " + allowedChars.size)

## Summary of orthographic validity of OCRE coin legends

We can now apply our `validOrtho` function to all coin legends in OCRE.

Of the 98,566 legends in OCRE, 96.6% are composed solely using the 31 characters we specified as valid.  Almost 3.4% include one or more of the 60 characters we defined as invalid.

In [None]:
val total = allLegends.size
val sheep = allLegends.filter(leg => validOrtho(leg))
val goats = allLegends.filterNot(leg => validOrtho(leg))

val sheepPct = sheep.size * 100.0f / allLegends.size
val goatsPct = goats.size * 100.0f / allLegends.size
println("Sheep: " + sheep.size + " (" + sheepPct + "% of " + allLegends.size + ")")
println("Goats: " + goats.size + " (" + goatsPct + "% of " + allLegends.size + ")")