## Chapter 3 — On Iterations 

Writing code shares a lot of commonalities with authoring text. For instance, only rarely you get it right in exactly one time. What authors would call rewriting, scrapping, polishing, starting anew, editing, and so forth, in IT terms is called "refactoring" or "doing another iteration"—although indeed you will here plain and simple "rewriting" as well.

When writing the code for this experiment, it was often needed to rewrite parts of it. During these iterations it was not always clear which iterations were a pure technical matter (e.g. fixing a typo so that "parser.pase()" would not actually simple error out due to a missing 'r') and which were also tied to scholarly action (e.g. improving the model for detecting English lines of the text). This was important to discern as the rules of this experiment stated that all scholarly actions should be reproducible. Including in this notebook all the states of the code according to each iteration would be rather tedious, pretty boring, and little informative. It seems reasonable to silently accept iterations that are code oriented, that is: those rewritings of the code that make the performance and heuristics of the code in some ways technically more adequate or more efficient but that do not change the result of the heuristic itself. A good example is a method extraction. Suppose you have the following code.


In [None]:
puts "The byciclist rides on the bike".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )

This code might be part of some application that marks particles for teaching purposes. But the marking is done very explicit for every case, making maintaining the code harder: if we wanted to change the label of the particles we would have to change it in six places, giving us as many if not more occasions to err. This is why you would refactor code like that to use a method (function) that is called each time when we need to write a label.

In [None]:
def mark_part( string )
  string.gsub( "the", "<part>the</part>" ) 
end 

puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )

Now if we would want to change the label, we can change it in one place and we will be sure that all instances will act the same. Note however that the performance of the code, its output, is still the same (as you can confirm by running the code).

Similarly I have made many iterations and improvements to the code described in this notebook: they improved structure, maintainability, readability of the code, or simply solved bugs that prevented successful execution. But those refactorings did not alter the actual resulting performance.<a href="#note_001" name="backref_note_001" id="backref_note_001">1</a> I choose not to represent all these (often tiny) iterations. However when a refactoring essentially changes the heuristics of the code, then that iteration should be reported as a matter of scholarly completeness. Thus if I were to change the above code to…

In [None]:
def mark_part( string )
  string.gsub( /(t|T)he/, '<part>\1he</part>' ) 
end 

puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )

…I am changing the heuristics of its task. Instead of just looking for "the" and marking it, I am now looking for either "the" or "The" and marking those. This involves, in the case of this notebook, scholarly decisions on how textual material should be interpreted. These scholarly decision should be represented as part of the reproducibility of scholarly effort and workflow.

### A scholarly iteration: improving the English model
When we ran the code we found that it marks a number of lines as being English that are really not and that it qualifies certain lines as Middle Dutch that are not. Clearly some refactoring is needed to remedy this. Before we can do so however we need to recreate the models and code that we had already. That is the reason for the next somewhat hermetic lines of code. They figure out where we are on the file system, then load the models and parser of the prior chapter. 

In [1]:
require File.join(File.dirname(__FILE__), '../lib/ocr_parse_models')
require File.join(File.dirname(__FILE__), '../lib/ocr_parser')

true

There is a number of words in the [English stop words list](./resources/stopwords_en.txt) that are homonymous with Middle Dutch vocabulary. If we count these fully towards English then we are counting too many terms as English. At the same time we do not want to count too many truly English terms as Middle Dutch. Let's give these terms a weight of 0.4 to see if things improve.

In [2]:
class EnglishSecondIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  # Hints may be provided to recognize particular cases, that is: if we know certain
  # words definitely indicate an English line we can at these to the set of hints.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

end

:matches

In [3]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishSecondIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
Prologue
was so extremely annoyed
remained unwritten in Dutch
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontue

Alright, that seems to yield more Middle Dutch lines in any case. 

  > Willem die Madocke maecte, [192va,22]<br/>
    daer hi dicken omme waecte,<br/>
    hem vernoyde so haerde<br/>
    dat die avonture van Reynaerde<br/>
    in Dietsche onghemaket bleven<br/>
    — die Arnout niet hevet vulscreven —<br/>
    dat hi die vijte dede soucken<br/>
    ende hise na den Walschen boucken<br/>
    in Dietsche dus hevet begonnen.<br/>

But unfortunately also more English lines, as in:

  > die in groeter hovesscheden<br/>
    Prologue<br/>
    was so extremely annoyed<br/>
    remained unwritten in Dutch<br/>
    gherne keert hare saken.<br/>

or:

  > Mi hevet Reynaert, dat felle dier,<br/>
    inﬂicted upon me by Reynaert,<br/>
    so vele te leede ghedaen,<br/>

Intuitively seems logical, if we allow words less to be identified decisively as English, than less sentences will end up being identified as such. But how can we further improve our selection. The last example gives a hint: the English line is part of a longer section of translated Middle Dutch. The matching algorithm suddenly amidst of all English lines decide one line wasn't English. We can make use of the fact that this is unlikely. We will rewrite the English model such that if a line is not identified as English but both the previous and next lines *are*, then we'll identify the line itself also as English.

In [4]:
class EnglishThirdIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  # Hints may be provided to recognize particular cases, that is: if we know certain
  # words definitely indicate an English line we can at these to the set of hints.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

  # The visit method for the English model is a little more elaborate than
  # the one for the other models. It happens that some lines are not
  # recognized as English whereas they are, so we need additional power
  # of recognition. That's why this visit method ask for the context of
  # the lines it is curretly trying to match. If both the line before
  # and after are English, then it is assumed that the line in between
  # is also English
  def visit( text )
    matches=[]
    text.each_line_with_context do | line, prev, succ |
      if matches( line )
        matches.push( self.class )
      else
        empty_model = Empty.new
        if !empty_model.matches( line )
          # Post correction, if in between two english matches, it probably should be matched too
          prev.reject! { |line| empty_model.matches( line ) }
          succ.reject! { |line| empty_model.matches( line ) }
          previous_matches = matches( prev[0] ) if prev.size > 0
          next_matches = matches( succ[0] ) if succ.size > 0
          if previous_matches && next_matches
            matches.push( self.class )
          else
            matches.push( nil )
          end
        else
          matches.push( nil )
        end
      end
    end
    matches
  end

end

:visit

In [5]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishThirdIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
Prologue
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die

Excellent, apart from two rogue English relicts that looks like a clean text. We can add some hints to our filtering mechanism to make sure also these relicts are filtered out. These hints are equivalent to a scholarly editor deciding 'no, that is not supposed to be part of the witness text'. 

In [6]:
class EnglishFourthIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  # Hints may be provided to recognize particular cases, that is: if we know certain
  # words definitely indicate an English line we can at these to the set of hints.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @hints = [ "prologue", "ofone" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch + @hints
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

  # The visit method for the English model is a little more elaborate than
  # the one for the other models. It happens that some lines are not
  # recognized as English whereas they are, so we need additional power
  # of recognition. That's why this visit method ask for the context of
  # the lines it is curretly trying to match. If both the line before
  # and after are English, then it is assumed that the line in between
  # is also English
  def visit( text )
    matches=[]
    text.each_line_with_context do | line, prev, succ |
      if matches( line )
        matches.push( self.class )
      else
        empty_model = Empty.new
        if !empty_model.matches( line )
          # Post correction, if in between two english matches, it probably should be matched too
          prev.reject! { |line| empty_model.matches( line ) }
          succ.reject! { |line| empty_model.matches( line ) }
          previous_matches = matches( prev[0] ) if prev.size > 0
          next_matches = matches( succ[0] ) if succ.size > 0
          if previous_matches && next_matches
            matches.push( self.class )
          else
            matches.push( nil )
          end
        else
          matches.push( nil )
        end
      end
    end
    matches
  end

end

:visit

In [23]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishFourthIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren 

### Notes
<small>

<a href="#backref_note_001" name="note_001" id="note_001">1</a>) 'Performance' is an ambiguous term in this context, as it is also used by programmers to indicate the very *speed* by which a program executes, and code is often also rewritten to improve that speed. However, unless otherwise indicated, I use the term 'performance' to refer to that what the code does, that what it shows, its output, and the tasks it conducts.

</small>