# What is common Latin?

Observations on Hyginus:


- Top 400 lexemes produce 18K occurrences
- Next 400 lexemes produce 1565 occurrences


Goals of this notebook:

Analyze breakdown by part of speech: what is distribution of lexemes within each PoS?





## Configuring Jupyter notebook

In [1]:
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [2]:
import $ivy.`edu.holycross.shot::ohco2:10.18.2`
import $ivy.`edu.holycross.shot.cite::xcite:4.2.0`
import $ivy.`edu.holycross.shot::midvalidator:10.0.0`
import $ivy.`edu.holycross.shot::latincorpus:2.2.1`
import $ivy.`edu.holycross.shot::latphone:2.7.2`

[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$ivy.$                                   [39m

## Load a citable corpus from a URL

In [3]:
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._

val hyginusUrl = "https://raw.githubusercontent.com/neelsmith/hctexts/master/cex/hyginus.cex"
val corpus = CorpusSource.fromUrl(hyginusUrl, cexHeader = true)

[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._

[39m
[36mhyginusUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/hctexts/master/cex/hyginus.cex"[39m
[36mcorpus[39m: [32mCorpus[39m = [33mCorpus[39m(
  [33mVector[39m(
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:latinLit:stoa1263.stoa001.hc:t.1"[39m),
      [32m"EXCERPTA EX HYGINI GENEALOGIIS, VOLGO FABVLAE."[39m
    ),
    [33mCitableNode[39m(
      [33mCtsUrn[39m([32m"urn:cts:latinLit:stoa1263.stoa001.hc:pr.1"[39m),
      [32m"Ex Caligine Chaos: ex Chao et Caligine Nox Dies Erebus Aether. ex Nocte et Erebo Fatum Senectus Mors Letum Continentia Somnus Somnia Amor id est Lysimeles, Epiphron dumiles Porphyrion Epaphus Discordia Miseria Petulantia Nemesis Euphrosyne Amicitia Misericordia Styx; Parcae tres, id est Clotho Lachesis Atropos; Hesperides, Aegle Hesperie aerica."[39m
    ),
    [33mCitableNode[39m(
    

## Create a tokenizable corpus

Combine a citable corpus with an orthographic system to create a tokenizable corpus.

Tokenize corpus according to its orthographic system (here, `Latin23Alphabet`).

In [5]:
import edu.holycross.shot.latin._
import edu.holycross.shot.mid.validator._

val tcorpus = TokenizableCorpus(corpus, Latin23Alphabet )
val wordList =  tcorpus.wordList

[32mimport [39m[36medu.holycross.shot.latin._
[39m
[32mimport [39m[36medu.holycross.shot.mid.validator._

[39m
[36mtcorpus[39m: [32mTokenizableCorpus[39m = [33mTokenizableCorpus[39m(
  [33mCorpus[39m(
    [33mVector[39m(
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:latinLit:stoa1263.stoa001.hc:t.1"[39m),
        [32m"EXCERPTA EX HYGINI GENEALOGIIS, VOLGO FABVLAE."[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:latinLit:stoa1263.stoa001.hc:pr.1"[39m),
        [32m"Ex Caligine Chaos: ex Chao et Caligine Nox Dies Erebus Aether. ex Nocte et Erebo Fatum Senectus Mors Letum Continentia Somnus Somnia Amor id est Lysimeles, Epiphron dumiles Porphyrion Epaphus Discordia Miseria Petulantia Nemesis Euphrosyne Amicitia Misericordia Styx; Parcae tres, id est Clotho Lachesis Atropos; Hesperides, Aegle Hesperie aerica."[39m
      ),
      [33mCitableNode[39m(
        [33mCtsUrn[39m([32m"urn:cts:latinLit:stoa1263.sto

## Create a `LatinCorpus`

Add morphological output to the tokenizable corpus to create an instance of a `LatinCorpus`.

In [7]:
// Read morphological output from a URL:
val hyginusFstUrl = "https://raw.githubusercontent.com/neelsmith/hctexts/master/parser-output/hyginus/hyginus-parses.txt"
import scala.io.Source
val fstOutput = Source.fromURL(hyginusFstUrl).getLines.toVector

[36mhyginusFstUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/hctexts/master/parser-output/hyginus/hyginus-parses.txt"[39m
[32mimport [39m[36mscala.io.Source
[39m
[36mfstOutput[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"> et"[39m,
  [32m"<u>latcommon.indecln16278</u><u>ls.n16278</u>et<indecl><indeclconj><div><indeclconj><indecl><u>indeclinfl.2</u>"[39m,
  [32m"> in"[39m,
  [32m"<u>latcommon.indecln22111</u><u>ls.n22111</u>in<indecl><indeclprep><div><indeclprep><indecl><u>indeclinfl.1</u>"[39m,
  [32m"> cum"[39m,
  [32m"<u>latcommon.indecln11872</u><u>ls.n11872</u>cum<indecl><indeclprep><div><indeclprep><indecl><u>indeclinfl.1</u>"[39m,
  [32m"> filius"[39m,
  [32m"<u>latcommon.nounn18185</u><u>ls.n18185</u>fili<noun><masc><us_i><div><us_i><noun>us<masc><nom><sg><u>livymorph.us_i1</u>"[39m,
  [32m"> ex"[39m,
  [32m"<u>latcommon.indecln16519</u><u>ls.n16519</u>ex<indecl><indeclconj><div><indeclconj><indecl>

Combine parser output with tokenized corpus to get a `LatinCorpus` instance.

In [None]:
import edu.holycross.shot.latincorpus._

val lc = LatinCorpus.fromFstLines(
      corpus,
       Latin23Alphabet,
     fstOutput,
      strict = false
    )


In [None]:
// This should be the number of distinct analyzed tokens
lc.lexemeTokenIndex.size

In [9]:
// This is the histogram of recognized lexemes:
lc.labelledLexemeHistogram

[36mres8[39m: [32medu[39m.[32mholycross[39m.[32mshot[39m.[32mhistoutils[39m.[32mHistogram[39m[[32mString[39m] = [33mHistogram[39m(
  [33mVector[39m(
    [33mFrequency[39m([32m"ls.n40103:qui1"[39m, [32m1136[39m),
    [33mFrequency[39m([32m"ls.n16278:et"[39m, [32m935[39m),
    [33mFrequency[39m([32m"ls.n40242:quis1"[39m, [32m834[39m),
    [33mFrequency[39m([32m"ls.n25029:is"[39m, [32m798[39m),
    [33mFrequency[39m([32m"ls.n46529:sum1"[39m, [32m756[39m),
    [33mFrequency[39m([32m"ls.n22111:in1"[39m, [32m707[39m),
    [33mFrequency[39m([32m"ls.n18185:filius"[39m, [32m683[39m),
    [33mFrequency[39m([32m"ls.n11872:cum1"[39m, [32m636[39m),
    [33mFrequency[39m([32m"ls.n16519:ex"[39m, [32m402[39m),
    [33mFrequency[39m([32m"ls.n18173:filia"[39m, [32m354[39m),
    [33mFrequency[39m([32m"ls.n46498:sui"[39m, [32m350[39m),
    [33mFrequency[39m([32m"ls.n665:ad"[39m, [32m298[39m),
    [33mFrequency[39m(

In [None]:
// It would be nice to visualize, so let's use the 
// plotly library with ammonite sh:
// Make plotly libraries available to this notebook:
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

In [None]:
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

# To work out

- relation of counts: 
    - lexical tokens in corpus
    - analyzed lexical tokens
    - recognized lexemes
- PoS distribution:  map each lexeme in lexeme histogram to its PoS 

(This is OK since lexical ambiguity is effectively 0)


## Create a map of lexeme to  PoS


In [None]:
val sampleForm = lc.analyzed.map (a => a.analyses(0))
val lexemePoSpairing = sampleForm.map (f => f.lemmaId -> f.posLabel)
val lexemeToPosMap = lexemePoSpairing.toMap

## Map lexeme histogram to PoS histogram

In [None]:
val freqOpts = lc.lexemeHistogram.frequencies.map(
  fr => {
    if (lexemeToPosMap.contains(fr.item)) {
      Some(edu.holycross.shot.histoutils.Frequency(lexemeToPosMap(fr.item),  fr.count))
    } else {
      None
    }
    
  })
val freqs = freqOpts.flatten

## Look at PoS distribution for top 400 lexemes

In [None]:
val top400Items = freqs.map(f => f.item).take(400)
val top400Counts = freqs.map(f => f.count).take(400)

In [None]:
val top400Freqs = freqs.take(400)

In [None]:
val posGroups = top400Freqs.groupBy(fr => fr.item)
val posCounts = posGroups.toVector.map{ case (pos, freqsV) => pos -> freqsV.map(f => f.count).sum }

In [None]:
val topPosCounts = posCounts.toVector.sortBy( _._2).map{ case(p,c) => edu.holycross.shot.histoutils.Frequency(p,c)}

In [None]:
val topPosHisto = edu.holycross.shot.histoutils.Histogram(topPosCounts).sorted

In [None]:
val items = topPosHisto.sorted.frequencies.map(fr => fr.item)
val counts = topPosHisto.sorted.frequencies.map(fr => fr.count)

val topPosPlot = Vector(
  Bar(x = items, y = counts)
)
plot(topPosPlot)

## Repeat for second 400 item

In [None]:
val second400Freqs = freqs.slice(400, 800)

In [None]:
val tier2Groups = second400Freqs.groupBy(fr => fr.item)
val tier2Counts = tier2Groups.toVector.map{ case (pos, freqsV) => pos -> freqsV.map(f => f.count).sum }

In [None]:
val tier2PosCounts = tier2Counts.toVector.sortBy( _._2).map{ case(p,c) => edu.holycross.shot.histoutils.Frequency(p,c)}

In [None]:
val tier2PosHisto = edu.holycross.shot.histoutils.Histogram(tier2PosCounts).sorted

In [None]:
val items = tier2PosHisto.sorted.frequencies.map(fr => fr.item)
val counts = tier2PosHisto.sorted.frequencies.map(fr => fr.count)

val tierPosPlot = Vector(
  Bar(x = items, y = counts)
)
plot(tierPosPlot)