## Chapter 2 — From Transcription to Digital Text 

We are faced with a problem. There is now a machine readable text, but it is littered with stuff we do not want. To create their edition, Bouwman et al. added line numbers, a translation in English, a plethora of footnotes and annotations. There's also remnants of the print book: page numbers, page headers, etc. If a full computational edition of the Reynaert edition by Bouwman et al. would be th goal of this project, we might actually be interested in these elements and we might want to capture them. However here I am interested in the meticulous reproducibility of the scholarly process of editing through code and in reading the text of the Reynaert through code. Capturing the book metaphor that is 'inbuilt' in the exiting edition is less relevant to this direct aim, though it would be an interesting later project to persu the mise-en-abyme of computationally creating the digital scholarly edition of the scholarly print edition. But right now the task is to separate the Middle Dutch verses from the rest.

### Of a Parser, Model, and Visitor

The approach I take is rather straight forward. I will read in the text line by line and I will see if each line mathces to a certain model. So we will need models and some machinery to have the text be matched for those models. The latter piece of machinery we usually call a parser. The models we will just call models. Having the parser calling on the models to see of they match for any line is equivalent to applying a so called '[visitor pattern](https://en.wikipedia.org/wiki/Software_design_pattern)'. Patterns in software development are certain standard strategies to attack common tasks or problems. 

#### The Super Model

So, we know that we will need several models, pieces of code that can recognize footnotes, empty lines, page numbers, etc. for us. Thus a case where we have several objects of the category model. And each model will have to be able to match itself against all lines of a text. We express this commonality via a super class Model. All concrete models for matching will be variant of that.

In [2]:
class Model

  # Determines if the model matches a line of text.
  def matches( line )
    false
  end

  # 'Visits' a multiline text, that is: applies the
  # matches function above to each line of the text.
  def visit( text )
    matches = []
    text.each_line do |line|
      if matches( line )
        matches.push( self.class )
      else
        matches.push( nil )
      end
    end
    matches
  end

end

:visit

#### Concrete models

Now we need several concrete models that will enable us to categorize lines in the text. Looking at the text we see that there are a number of 'types' of lines that we don't need. Lines that contain only numbers (page number or verse numbers) for instance, lines that are in all capital font and that coincide with page headers and chapter headings, lines belonging to footnotes, and lastly empty lines. We can express this by creating concrete model classes that implement the matches method of the super class in specific ways. Thus we end up with four models (AllCaps, FootNote, Numbers, and Empty) that each use a different [regular expression](https://en.wikipedia.org/wiki/Regular_expression) to match the text surfaces that are typical for each type of line. You'll find these regular expressions as the red parts below in each class (e.g. /[[:upper:]]/, which matches upper case letters). These expression if not encountered before may seem hermetic, but with a bit of study [effort](http://www.regular-expressions.info/) they will be sufficiently understandable.

In [3]:
# Matches a line that only contains capitals.
class AllCaps < Model

  def matches( line )
    !!line.match( /[[:upper:]]/ ) && !!!line.match( /[[:lower:]]/ )
  end

end

# Matches a line starting with at least one digit, followed by a dash or a space.
class FootNote < Model

  def matches( line )
    line.match( /^\d+(-| )(.+)$/ ) != nil
  end

end

# Matches a line containing only numbers. 'o' (lower case letter o) is also
# accepted as the OCR frequently misreads 0 for o.
class Numbers < Model

  def matches( line )
    line.match( /^[\do]+$/ ) != nil
  end

end

# Matches an empty line.
class Empty < Model

  def matches( line )
    line.match( /^\s*$/ ) != nil
  end

end

:matches

#### Differentiating Middle Dutch from English

All the line types we have seen until now have some recognizable features (they're empty, contain numbers, and so forth). When it comes to telling apart "dat die avonture van Reynaerde" from "that the tales of Reynaert", we are lost for visual clues at the surface of the text only. We will need some more knowledge to identify the former as Middle Dutch and the latter as English. The 'English model' is therefore quite somewhat more complicated than the other classes. It does not need to get as complicated as using sophisticated natural language processing (NLP) software packages. An admittedly naive but straight forward approach is to use a list of English stop words. If a line is made up for more than 20% (or differently put: if is passes a 0.2 threshold of words in English) of such stop words we can safely assume that the line is in English. There are some subtleties that might be worth noting, to point these out commentary is provided within the code of the class below.

In [10]:
class English < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  # Hints may be provided to recognize particular cases, that is: if we know certain
  # words definitely indicate an English line we can at these to the set of hints.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @hints = [ "prologue" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch + @hints
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

  # The visit method for the English model is a little more elaborate than
  # the one for the other models. It happens that some lines are not
  # recognized as English whereas they are, so we need additional power
  # of recognition. That's why this visit method ask for the context of
  # the lines it is curretly trying to match. If both the line before
  # and after are English, then it is assumed that the line in between
  # is also English
  def visit( text )
    matches=[]
    text.each_line_with_context do | line, prev, succ |
      if matches( line )
        matches.push( self.class )
      else
        empty_model = Empty.new
        if !empty_model.matches( line )
          # Post correction, if in between two english matches, it probably should be matched too
          prev.reject! { |line| empty_model.matches( line ) }
          succ.reject! { |line| empty_model.matches( line ) }
          previous_matches = matches( prev[0] ) if prev.size > 0
          next_matches = matches( succ[0] ) if succ.size > 0
          if previous_matches && next_matches
            matches.push( self.class )
          else
            matches.push( nil )
          end
        else
          matches.push( nil )
        end
      end
    end
    matches
  end

end

:visit

#### Adding an engine

Now that we have all these models we need something that will take an actual text, set the models lose on it and returns us just the Middle Dutch verses that we were looking for. This piece of machinery we will call the OCRParser. The OCRParser class takes a text and splits it on line breaks. Then it delegates the matching of lines to the models described above. It then returns only those lines that did not answer to any model, which should be only the Middle Dutch verses.


In [5]:
class OCRParser

  attr_accessor :models

  def text=( text )
    @text = text
    @lines = text.split( "\n" )
  end

  def load_text( file_path )
    self.text = File.read( file_path )
  end

  # Used by models to iterate over all lines for matching.
  def each_line
    @lines.each{ |line| yield line }
  end

  # Used by models to iterate over all lines for matching, when context is 
  # needed, e.g. in the case of the English model.
  # Context is provided as 10 lines before and after the current line.
  def each_line_with_context
    @lines.each_with_index do |line,index|
      yield line, @lines[(index-10)..(index-1)].reverse, @lines[(index+1)..(index+10)]
    end
  end

  # Delegates matching to the models.
  # yields each line [String] and its matches [Array]
  def match_lines
    matches = []
    @models.each { |model| matches.push( model.visit( self ) ) }
    matches = matches.transpose
    @lines.each_with_index do |line, index|
      yield line, matches[index].compact
    end
  end

  # TODO!! This needs to be adjusted (and that adjustment should go into the notebook).
  # The problem is that the knowledge that the footnote model may stretch
  # multiple lines is now embeddd in this method, but it should be knowledge
  # of the model and that can be 'handed' to the parser.
  def parse_to_annotated_array
    result = []
    context = []
    match_lines do |line, matches|
      if matches.include?( FootNote )
        context.pop
        context.push( FootNote )
      end
      if matches.include?( AllCaps )
        context.pop
      end
      if context.last != FootNote
        if matches.size() == 0
          result.push( [ "A", line, matches ] )
        else
          result.push( [ "I", line, matches ] )
        end
      else
        result.push( [ "I", line, matches ] )
      end
    end
    result
  end

  def parse
    raw = parse_to_annotated_array()
    raw.reject! { |annotated_line| annotated_line[0] == "I" }
    raw.map! { |annotated_line| annotated_line[1] }
  end

end

:parse

#### Kicking it into life
We have all the different parts now that yield the Middel Dutch text only to us. All that is left to do is to instantiate a new OCRParser, feeding it a text and the models, and the requesting the result. That's what the next little snippet finally does.

In [12]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, English.new ]
parsed = text.parse()
puts parsed.join( "\n" )

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren 