## Chapter 6 — Combining and Mining

I have modeled some object classes meanwhile (Text, Verse, Word, Person, Annotation) that hold relations with each other that are also expressed in their make up. E.g. Text points to the first Verse, a Word can have a Person as a denotation. What is still left to do is to integrate these bits into some working performance. Having little other insiration as of yet I opt to first of all combine the code to produce the data needed for some visualization. Let's somehow visualize that we can annotated the first word of the text. First of all we need to require the models and other components again.

In [1]:
require File.join(File.dirname(__FILE__), '../lib/ocr_parser')
require File.join(File.dirname(__FILE__), '../lib/oo_models')

true

The AnnotationGenerator of the last chapter needs to be integrated with the Word class to be able to annotate an individual word. For now as you can gauge from the code comments I will do so in a bluntly naive way. I just tell any word that if its surface makes up the word 'willem' it should go and find the sentences in the introduction of our first source ([see chapter 5](/notebooks/05%20%20A%20Cherry%20Picked.ipynb)) and use them as annotations. This is naive in at least three ways. First it will potentially yield a lot of false positives if we broaden the search to more source (that will yield sentences of or about different Willems). Secondly, the fact that one finds a 'Willem' in a sentence does not imply directly that it is a sentence *about* that willem, although it is likely that the sentence is at least related to a 'Willem' at least. The third naive aspect is that as to object modeling this is probably not the right place for the AnnotationGenerator to make its performance. It is probably not the task of a word to know or mine annotations on its denotation (it would be more correct if it would be tasked with finding lexical or syntactic information I suppose). Knowing about 'Willem die Madocke maecte' seems to me to be a task that should be bestowed upon a Person object that represents this actual 'Willem'. However, for the purpose of this chapter and notebook, this will do for the moment.

In [2]:
class Word

  attr_accessor :surface
  attr_accessor :next_word
  attr_accessor :denotation

  def initialize( str )
    m = str.match( / / )
    if m != nil
      index = str.match( / / ).end(0)
      @surface = str[ 0..index-2 ]
      @next_word = Word.new( str[ index..-1 ] )
    else
      @surface = str
    end
    # This is utterly simplisitic, yet I have not better
    # idea at the moment…
    # Probably this should be done by models such as Person.
    # E.g. Person would determine if this Word is that Person.
    # That would also allow for competing interpretations btw.
    if surface.downcase == "willem"
      @denotation = Person.new( "Willem" )
      @denotation.annotations = AnnotationGenerator.get_annotations_for( "willem" )
    end
  end

  def as_text
    print @surface
    if @next_word != nil
      print " "
      @next_word.as_text
    end
  end

  def as_json( pos=0 )
    json = "{ \"src\": \"Bouwman\", \"wrd\": \"#{@surface}\", \"x\": 0, \"y\": #{pos}"
    if @denotation != nil
      json << ", \"denotation\": " << @denotation.as_json
    end
    json << " }"
    pos += 1
    if @next_word != nil
      json << ", "
      sub_json = @next_word.as_json( pos )
      json << sub_json[0]
      pos = sub_json[1]
    end
    [json,pos]
  end

end

:as_json

Having adapted the Word class to be able to become 'knowledgeable' on subject matter, we now need to kick the complete mechanism into life again. As you will notice, the next few lines of code are just a repetition of the same as those that we used at the end of [chapter 3](/notebooks/03%20%20On%20Iterations.ipynb) and that yielded a fair representation of the Middle Dutch text. So in effect we are just making sure that scholarly activity is performed again.

In [3]:
file_path = File.join( File.dirname(__FILE__), "./resources/Bouwman_ Of Reynaert the Fox.txt" )
text_parser = OCRParser.new
text_parser.load_text( file_path )
text_parser.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, English.new ]
parsed_text = text_parser.parse()

["Willem die Madocke maecte, [192va,22]", "daer hi dicken omme waecte,", "hem vernoyde so haerde", "dat die avonture van Reynaerde", "in Dietsche onghemaket bleven", "— die Arnout niet hevet vulscreven —", "dat hi die vijte dede soucken", "ende hise na den Walschen boucken", "in Dietsche dus hevet begonnen.", "God moete ons ziere hulpen jonnen!", "Nu keert hem daertoe mijn zin", "dat ic bidde in dit beghin", "beede den dorpren enten doren,", "ofte si commen daer si horen", "dese rijme ende dese woort", "(die hem onnutte sijn ghehoort),", "dat sise laten onbescaven.", "Te vele slachten si den raven,", "die emmer es al even malsch.", "Si maken sulke rijme valsch,", "daer si niet meer of ne weten [192vb]", "dan ic doe hoe dat si heeten", "die nu in Babilonien leven.", "Daden si wel, si soudens begheven.", "Dat en segghic niet dor minen wille.", "Mijns dichtens ware een ghestille,", "ne hads mi eene niet ghebeden", "die in groeter hovesscheden", "gherne keert hare saken.", "Soe bat mi dat 

Then now there is only the task left of letting our Object Models do their jobs by handing the Text model that is in a way at the top of 'food chain' the text for processing. Here we let the performance result into a json representation of text and annotations. In the [next chapter](/notebooks/07%20%20The%20Distracting%20Interface.ipynb) we will use this result for a visualization.

In [4]:
text = Text.new( parsed_text.join( "\n" ) )
puts text.as_json

{ "nodes": [ { "src": "Bouwman", "wrd": "Willem", "x": 0, "y": 0, "denotation": { "class": "Person", "name": "Willem", "annotations": [ { "annotation": "The author of this Middle Dutch beast epic, ‘Willem’, was familiar with at least part of the Old French corpus of texts and used it in the course of his composition.", "source": "Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015]." }, { "annotation": "It may well be that Willem did not know this fable in its original form.", "source": "Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015]." }, { "annotation": "It is 

[Proceed to the next chapter](/notebooks/07%20%20Adding%20a%20First%20Visualization.ipynb)