# Search the OCRE text corpus

## Contents of this notebook

This notebook shows how to build a parsed Latin corpus (a `latincorpus` object) for OCRE texts, and how to go from a single surface form to a morphologically sensitive full-corpus search.


## Organization of this notebook

This notebook uses Scala with the almond kernel (<https://almond.sh/>).  The cells labelled "Notebook configuration" configure the almond kernel to find and import a series of custom libraries using syntax specific to the ammonite shell that almond use.  This is analogous to defining imports in a `build.sbt` file if you were using `sbt` to run scala.

The following section (labelled  "Analyis with generic Scala") consists of completely generic scala that could be used in any environment with access to the repositories and libraries configured in the section labelled "Notebook configuration".


## Notebook configuration

Set up notebook for access to libraries.  For reasons I don't understand (but perhaps having to do with asynchronous loading) I have to separate out the two steps of adding a maven repository and using `$ivy` imports with those repositories into separate notebook cells.

In [1]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

[36mmyBT[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [2]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`
import $ivy.`edu.holycross.shot::midvalidator:9.1.0`
import $ivy.`edu.holycross.shot::latphone:2.7.2`
import $ivy.`edu.holycross.shot::latincorpus:2.2.1`


[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                       
[39m
[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                      
[39m

## Analyis with generic Scala

Import libraries, as always:

In [3]:
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._

import edu.holycross.shot.mid.validator._
import edu.holycross.shot.latin._
import edu.holycross.shot.latincorpus._

import scala.io.Source

[32mimport [39m[36medu.holycross.shot.cite._
[39m
[32mimport [39m[36medu.holycross.shot.ohco2._

[39m
[32mimport [39m[36medu.holycross.shot.mid.validator._
[39m
[32mimport [39m[36medu.holycross.shot.latin._
[39m
[32mimport [39m[36medu.holycross.shot.latincorpus._

[39m
[32mimport [39m[36mscala.io.Source[39m

### Build a `latincorpus`

The `latincorpus` objects unifies a citable text corpus (an ohco2 `Corpus`) with a 

In [4]:
val fstUrl = "https://raw.githubusercontent.com/neelsmith/hctexts/master/workfiles/ocre/ocre-fst.txt"
val fstLines = Source.fromURL(fstUrl).getLines.toVector

// Read CEX data from URL, create a corpus of citable nodes.
// The `CorpusSource` object should really have a function that does this for you,
// analogous to its `fromFile` function.
val url = "https://raw.githubusercontent.com/neelsmith/hctexts/master/cex/ocre43k.cex"
val ctsLines = Source.fromURL(url).getLines.toVector.tail.filter(_.nonEmpty)

val stringPairs = ctsLines.map(_.split("#"))
val citableNodes = stringPairs.map( arr => CitableNode(CtsUrn(arr(0)), arr(1)))
val corpus = Corpus(citableNodes)

[36mfstUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/hctexts/master/workfiles/ocre/ocre-fst.txt"[39m
[36mfstLines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"> avgvstvs"[39m,
  [32m"<u>ocremorph.n4509</u><u>ls.n4509</u>avgvst<adj><us_a_um><div><us_a_um><adj>vs<masc><nom><sg><pos><u>ocremorph.us_a_um1</u>"[39m,
  [32m"> pivs"[39m,
  [32m"<u>ocremorph.n36487</u><u>ls.n36487</u>pi<adj><us_a_um><div><us_a_um><adj>vs<masc><nom><sg><pos><u>ocremorph.us_a_um1</u>"[39m,
  [32m"> imperator"[39m,
  [32m"<u>ocremorph.n21857</u><u>ls.n21857</u>imperator<noun><masc><0_is><div><0_is><noun><masc><nom><sg><u>ocremorph.0_is1</u>"[39m,
  [32m"<u>ocremorph.n21857</u><u>ls.n21857</u>imperator<noun><masc><0_is><div><0_is><noun><masc><voc><sg><u>ocremorph.0_is11</u>"[39m,
  [32m"> felix"[39m,
  [32m"<u>ocremorph.n17887</u><u>ls.n17887</u>feli<adj><x_cis><div><x_cis><adj>x<masc><nom><sg><pos><u>livymorph.x_cis1</u>"[39m,
  [32m"<u

In [None]:
// A corpus of parsed tokens:
val ocrelatin = LatinCorpus.fromFstLines(corpus, Latin24Alphabet, fstLines, strict = false)

## `Libertas` in Ocre

Given a surface form (a *token*) appearing somewhere in your corpus, you can find all *lexemes* it can derive from, and find all occurrences of that lexeme in any form.


In [None]:
val token = "libertas"
val lexemeUrns = ocrelatin.tokenLexemeIndex(token)
// here, we assume there's only one matching lexeme:
val lexemeUrn = lexemeUrns(0)
val occurrences =  ocrelatin.lexemeConcordance(lexemeUrn)
