## How many words fit in a dactylic hexameter?

Homeric poetry is shaped in lines of 6 dactyls.  How many words make up a single hexameter?


This notebook will load a text of the *Iliad* and count words in it. It introduces a class named `histoutils` that defines some convenient methods for working with histograms (that is, counts of occurrences of some item).




## None of the standard stuff works on Greek

If Unicode wasn't broken, I would do this:

### Load an *Iliad* text

The first cell below just loads two long texts:  don't worry yet about how it works, but notice that at the end, we've created a named String value called `iliad` and a named String value called `scholia`.


```scala
import scala.io.Source
val iliadUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/iliad-dipl.txt"
val scholiaUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/scholia-dipl.txt"

val iliad = Source.fromURL(iliadUrl).getLines.toVector
val scholia = Source.fromURL(scholiaUrl).getLines.toVector

// sanity check: visually inspect a sample of 10 lines
println(iliad.take(10))
```


## And then I would do this

So let's count "words".  We could spend a long time deciding what a word is, but let's keep it simple today by:

1. throwing away some punctuation characters
2. then splitting the long text on whitespace



```scala
// We'll learn more later about what's going on here: the list of
// characters inside square brackets is actually a *regular expression*
val iliadNoPunct = iliad.map(ln => ln.replaceAll("[,·\\.:~]", ""))
// The expression [ ] means "one or more occurrences of any Whitespace character"
val iliadWordsByLineBORKEN = iliadNoPunct.map(ln => ln.split("[ \n]+").toVector)

// Print first 100 words to see if they look right...
println(iliadWordsByLineBORKEN.take(10))
```

### And then I'd do my counting

```scala
val wordCountsByIliadLine =  iliadWordsByLine.map(words => words.size).toVector
```

Group word counts into lists, then find the size of each list.

```scala
val wcGrouped = wordCountsByIliadLine.groupBy(count => count).toVector
```

## Instead...


CITE architecture citable texts and the `greek` library to the rescue.

In [1]:
val personalBintray = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalBintray)

[36mpersonalBintray[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [2]:
import $ivy.`edu.holycross.shot.cite::xcite:4.3.0`
import $ivy.`edu.holycross.shot::ohco2:10.19.0`
import $ivy.`edu.holycross.shot.mid::orthography:1.0.0`
import $ivy.`edu.holycross.shot::greek:5.0.0`
import $ivy.`edu.holycross.shot::scm:7.3.0`

[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                
[39m
[32mimport [39m[36m$ivy.$                              [39m

In [3]:
import edu.holycross.shot.ohco2._
import edu.holycross.shot.cite._
import edu.holycross.shot.mid.orthography._
import edu.holycross.shot.scm._
import edu.holycross.shot.greek._

[32mimport [39m[36medu.holycross.shot.ohco2._
[39m
[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.mid.orthography._
[39m
[32mimport [39m[36medu.holycross.shot.scm._
[39m
[32mimport [39m[36medu.holycross.shot.greek._[39m

Build a CITE library

In [4]:
import scala.io.Source
val url = "https://raw.githubusercontent.com/homermultitext/hmt-archive/master/releases-cex/hmt-2020i-texts.cex"
val libSource = Source.fromURL(url).mkString
val lib = CiteLibrary(libSource)


Jun 15, 2020 8:43:59 AM wvlet.log.Logger log
INFO: Building text repo from cex ...
Jun 15, 2020 8:44:03 AM wvlet.log.Logger log
INFO: Building collection repo from cex ...
Jun 15, 2020 8:44:05 AM wvlet.log.Logger log
INFO: Building relations from cex ...
Jun 15, 2020 8:44:10 AM wvlet.log.Logger log
INFO: All library components built.


[32mimport [39m[36mscala.io.Source
[39m
[36murl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/homermultitext/hmt-archive/master/releases-cex/hmt-2020i-texts.cex"[39m
[36mlibSource[39m: [32mString[39m = [32m"""// Metadata for the current release

#!cexversion
3.0.1

#!citelibrary
// These values are inserted programmtacally when
// the CITE library is built:
name#Homer Multitext project, release 2020i
urn#urn:cite2:hmt:publications.cex.2020i:texts
license#Creative Commons Attribution, Non-Commercial 4.0 License <https://creativecommons.org/licenses/by-nc/4.0/>.

// CITE namespace definitions
namespace#hmt#http://www.homermultitext.org/citens/hmt
namespace#greekLit#http://chs.harvard.edu/ctsns/greekLit

// Collection of data models used in this release:

#!citecollections
URN#Description#Labelling property#Ordering property#License
urn:cite2:cite:datamodels.v1:#CITE data models#urn:cite2:cite:datamodels.v1.label:##Public domain

#!citeproperties
Property#Lab

Extract its text corpus and select the Venetus A diplomatic text of the *Iliad*.

In [7]:
val corpus = lib.textRepository.get.corpus
val iliad = corpus ~~ CtsUrn("urn:cts:greekLit:tlg0012.tlg001.msA:")

println(iliad.nodes.take(7).mkString("\n\n"))

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.1,Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.2,οὐλομένην· ἡ μυρί' Ἀχαιοῖς ἄλγε' ἔθηκεν·)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.3,πολλὰς δ' ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.4,ἡρώων· αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.5,οἰωνοῖσί τε πᾶσι· Διὸς δ' ἐτελείετο βουλή·)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.6,ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε)

CitableNode(urn:cts:greekLit:tlg0012.tlg001.msA:1.7,Ἀτρείδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς ·)


[36mcorpus[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.due_ebbott:10.1"[39m),
      [32m"Alongside the ships the other best men of the Panachaeans"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.due_ebbott:10.2"[39m),
      [32m"slept all night long, subdued by gentle sleep,"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.due_ebbott:10.3"[39m),
      [32m"but not the son of Atreus, Agamemnon, the shepherd of the warriors\u2014"[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.due_ebbott:10.4"[39m),
      [32m"sweet sleep did not hold him, as he pondered many things in his mind."[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.due_ebbott:10.5"[39m),
      [32m"As wh

Tokenize each node in the corpus, filter for lexical tokens only, and count tokens per line.

In [10]:
val iliadTokens = iliad.nodes.map(n => LiteraryGreekString.tokenizeNode(n))

val lexTokens = iliadTokens.map(line => line.filter(t => t.tokenCategory == Some(LexicalToken)))
val tokenCounts = lexTokens.map(tkns => tkns.size).groupBy(count => count)



[36miliadTokens[39m: [32mVector[39m[[32mVector[39m[[32mMidToken[39m]] = [33mVector[39m(
  [33mVector[39m(
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:1.1.0"[39m),
      [32m"\u039c\u1fc6\u03bd\u03b9\u03bd"[39m,
      [33mSome[39m(LexicalToken)
    ),
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:1.1.1"[39m),
      [32m"\u1f04\u03b5\u03b9\u03b4\u03b5"[39m,
      [33mSome[39m(LexicalToken)
    ),
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:1.1.2"[39m),
      [32m"\u03b8\u03b5\u1f70"[39m,
      [33mSome[39m(LexicalToken)
    ),
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:1.1.3"[39m),
      [32m"\u03a0\u03b7\u03bb\u03b7\u03ca\u1f71\u03b4\u03b5\u03c9"[39m,
      [33mSome[39m(LexicalToken)
    ),
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:

In [12]:
val countFrequencies = tokenCounts.toVector.map{ 
  case (num, occurrences) => (num, occurrences.size)}



[36mcountFrequencies[39m: [32mVector[39m[([32mInt[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m0[39m, [32m7[39m),
  ([32m5[39m, [32m640[39m),
  ([32m10[39m, [32m1607[39m),
  ([32m14[39m, [32m21[39m),
  ([32m1[39m, [32m19[39m),
  ([32m6[39m, [32m2009[39m),
  ([32m9[39m, [32m2853[39m),
  ([32m13[39m, [32m69[39m),
  ([32m2[39m, [32m7[39m),
  ([32m12[39m, [32m256[39m),
  ([32m7[39m, [32m3536[39m),
  ([32m3[39m, [32m27[39m),
  ([32m11[39m, [32m677[39m),
  ([32m8[39m, [32m3787[39m),
  ([32m4[39m, [32m121[39m),
  ([32m15[39m, [32m4[39m)
)

## Now back to your scheduled programming

## Introducing the `histoutils` class


The next two cells configure this Jupyter notebook to find the `histoutils` library in a personal repository on the widely used site `bintray.com`.

In [8]:
import $ivy.`edu.holycross.shot::histoutils:2.3.0`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.jar
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0-sources.jar
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/hist

[32mimport [39m[36m$ivy.$                                     [39m

### Standard Scala

The remaining cells import the `histoutils` library, and create a `Histogram` object from our data about frequencies of word counts.

In [13]:
import edu.holycross.shot.histoutils._

val tokenFrequencies =   
countFrequencies.map{ case (words, occurs) => Frequency(words, occurs)}

[32mimport [39m[36medu.holycross.shot.histoutils._

[39m
[36mtokenFrequencies[39m: [32mVector[39m[[32mFrequency[39m[[32mInt[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m0[39m, [32m7[39m),
  [33mFrequency[39m([32m5[39m, [32m640[39m),
  [33mFrequency[39m([32m10[39m, [32m1607[39m),
  [33mFrequency[39m([32m14[39m, [32m21[39m),
  [33mFrequency[39m([32m1[39m, [32m19[39m),
  [33mFrequency[39m([32m6[39m, [32m2009[39m),
  [33mFrequency[39m([32m9[39m, [32m2853[39m),
  [33mFrequency[39m([32m13[39m, [32m69[39m),
  [33mFrequency[39m([32m2[39m, [32m7[39m),
  [33mFrequency[39m([32m12[39m, [32m256[39m),
  [33mFrequency[39m([32m7[39m, [32m3536[39m),
  [33mFrequency[39m([32m3[39m, [32m27[39m),
  [33mFrequency[39m([32m11[39m, [32m677[39m),
  [33mFrequency[39m([32m8[39m, [32m3787[39m),
  [33mFrequency[39m([32m4[39m, [32m121[39m),
  [33mFrequency[39m([32m15[39m, [32m4[39m)
)

In [14]:
val iliadWordsHisto = Histogram(tokenFrequencies)

[36miliadWordsHisto[39m: [32mHistogram[39m[[32mInt[39m] = [33mHistogram[39m(
  [33mVector[39m(
    [33mFrequency[39m([32m0[39m, [32m7[39m),
    [33mFrequency[39m([32m5[39m, [32m640[39m),
    [33mFrequency[39m([32m10[39m, [32m1607[39m),
    [33mFrequency[39m([32m14[39m, [32m21[39m),
    [33mFrequency[39m([32m1[39m, [32m19[39m),
    [33mFrequency[39m([32m6[39m, [32m2009[39m),
    [33mFrequency[39m([32m9[39m, [32m2853[39m),
    [33mFrequency[39m([32m13[39m, [32m69[39m),
    [33mFrequency[39m([32m2[39m, [32m7[39m),
    [33mFrequency[39m([32m12[39m, [32m256[39m),
    [33mFrequency[39m([32m7[39m, [32m3536[39m),
    [33mFrequency[39m([32m3[39m, [32m27[39m),
    [33mFrequency[39m([32m11[39m, [32m677[39m),
    [33mFrequency[39m([32m8[39m, [32m3787[39m),
    [33mFrequency[39m([32m4[39m, [32m121[39m),
    [33mFrequency[39m([32m15[39m, [32m4[39m)
  )
)

## Look at the problems

In [25]:
val tooShort = lexTokens.filter(ln => ln.size < 4).sortBy(_.size)
println(tooShort.size + " lines have bad sizes.")
println(tooShort.map 
        (ln => ln.size + " " + ln).mkString("\n\n"))



60 lines have bad sizes.
0 Vector()

0 Vector()

0 Vector()

0 Vector()

0 Vector()

0 Vector()

0 Vector()

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.235.0,νῦν δ' έτι καὶ μᾶλλον νοέω φρεσὶ τιμήσασθαι~,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.237.0,τείχεος ἐξελθεῖν~ ἄλλοι δ' ἔντοσθε μένουσι~,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.239.0,ἠθεῖ'~ ῆ μὲν πολλὰ πατὴρ καὶ πότνια μήτηρ,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.242.0,ἂλλ' ἐμὸς ἔνδοθι θυμὸς ἐτείρετο πένθεϊ λυγρῷ~,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.243.0,νῦν δ' ἰ̈θὺς μεμαῶτε μαχώμεθα~ μη δέ τι δούρων,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.248.0,οἱ δ' ὅτε δὴ σχεδὸν ἦσαν ἐπ ἀλλήλοισιν ἰ̈όντες~,Some(LexicalToken)))

1 Vector(MidToken(urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.254.0,ἂλλ' άγε δε

[36mtooShort[39m: [32mVector[39m[[32mVector[39m[[32mMidToken[39m]] = [33mVector[39m(
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(),
  [33mVector[39m(
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.235.0"[39m),
      [32m"\u03bd\u1fe6\u03bd\u00a0\u03b4'\u00a0\u1f73\u03c4\u03b9\u00a0\u03ba\u03b1\u1f76\u00a0\u03bc\u1fb6\u03bb\u03bb\u03bf\u03bd\u00a0\u03bd\u03bf\u1f73\u03c9\u00a0\u03c6\u03c1\u03b5\u03c3\u1f76\u00a0\u03c4\u03b9\u03bc\u03ae\u03c3\u03b1\u03c3\u03b8\u03b1\u03b9~"[39m,
      [33mSome[39m(LexicalToken)
    )
  ),
  [33mVector[39m(
    [33mMidToken[39m(
      [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA_tkns:22.237.0"[39m),
      [32m"\u03c4\u03b5\u03af\u03c7\u03b5\u03bf\u03c2\u00a0\u1f10\u03be\u03b5\u03bb\u03b8\u03b5\u1fd6\u03bd~\u00a0\u1f04\u03bb\u03bb\u03bf\u03b9\u00a0\u03b4'\u00a0\u1f1

In [27]:
val badLines = iliad.nodes.filter(n => LiteraryGreekString.tokenizeNode(n).size < 3)


[36mbadLines[39m: [32mVector[39m[[32mCitableNode[39m] = [33mVector[39m(
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:5.57"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:7.234"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:7.385"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:8.315"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:8.410"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:12.197"[39m), [32m" "[39m),
  [33mCitableNode[39m([33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:12.218b"[39m), [32m" "[39m),
  [33mCitableNode[39m(
    [33mCtsUrn[39m([32m"urn:cts:greekLit:tlg0012.tlg001.msA:22.235"[39m),
    [32m"\u03bd\u1fe6\u