# Colibri Core Python Tutorial: Efficiently working with n-grams, skipgrams and flexgrams

*by Maarten van Gompel, Radboud University Nijmegen*

This tutorial will show you how to work with Colibri Core's Python API, a tool for Natural Language Processing. It is assumed that you have already read the Colibri Core documentation, followed the installation instructions, and are familiar its purpose and concepts. The documentation also provides an API reference for all the Python classes and method. This tutorial is in the form of a Python Notebook, allowing you to interactively participate. Press ``shift+enter`` in code field to evaluate it.

Colibri Core is written in C++ and the Python binding is writting in Cython. This offers the advantage of native-speed and memory efficiency, combined with the ease of a high-level pythonic interface. We will be using Python 3 here, but Colibri Core can also work with Python 2.7.

We obviously start our adventure with an import of colibricore, so make sure you installed it properly:

In [3]:
import colibricore

TMPDIR = "/tmp/" #this is where we'll store intermediate files


## Class encoding/decoding

To give us something to work with, we will take an excerpt of Shakespeare's Hamlet as our corpus text:

In [4]:
corpustext = """To be, or not to be, that is the question
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them? To die, to sleep
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause. There's the respect
That makes Calamity of so long life:
For who would bear the Whips and Scorns of time,
Th' Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
When he himself might his Quietus make
With a bare Bodkin? Who would these Fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered Country, from whose bourn
No Traveler returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And enterprises of great pitch and moment,
With this regard their Currents turn awry,
And lose the name of Action. Soft you now,
The fair Ophelia. Nymph, in all thy Orisons
Be all my sins remembered"""

#first we do some very rudimentary tokenisation
# Yes, I realise this is a very stupid way ;)
corpustext = corpustext.replace(',',' ,')
corpustext = corpustext.replace('.',' .')
corpustext = corpustext.replace(':',' :')


corpusfile_plaintext = TMPDIR + "hamlet.txt"

with open(corpusfile_plaintext,'w',encoding='utf-8') as f:
    f.write(corpustext)


To work with this data with Colibri Core. We need to *class encode* it, assigning integer values to each word type. Using Python, a class encoder is built as follows:


In [5]:
classfile = TMPDIR + "hamlet.colibri.cls"

#Instantiate class encoder
classencoder = colibricore.ClassEncoder()

#Build classes
classencoder.build(corpusfile_plaintext)

#Save class file
classencoder.save(classfile)

print("Encoded ", len(classencoder), " classes, well done!")

Encoded  184  classes, well done!


Now we have a class encoder we can encode our corpus, producing a new encoded file (which tends to be about 50% compressed compared to the original):

In [6]:
corpusfile = TMPDIR + "hamlet.colibri.dat" #this will be the encoded corpus file
classencoder.encodefile(corpusfile_plaintext, corpusfile)

To check whether that worked as planned, we will construct a Class Decoder, load our class file, and decode the corpus:

In [7]:
#Load class decoder from the classfile we just made
classdecoder = colibricore.ClassDecoder(classfile)

#Decode corpus data
decoded = classdecoder.decodefile(corpusfile)

#Show
print(decoded)

To be , or not to be , that is the question
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune ,
Or to take Arms against a Sea of troubles ,
And by opposing end them? To die , to sleep
No more; and by a sleep , to say we end
The Heart-ache , and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished . To die , to sleep ,
To sleep , perchance to Dream; Aye , there's the rub ,
For in that sleep of death , what dreams may come ,
When we have shuffled off this mortal coil ,
Must give us pause . There's the respect
That makes Calamity of so long life :
For who would bear the Whips and Scorns of time ,
Th' Oppressor's wrong , the proud man's Contumely ,
The pangs of despised Love , the Law’s delay ,
The insolence of Office , and the Spurns
That patient merit of the unworthy takes ,
When he himself might his Quietus make
With a bare Bodkin? Who would these Fardels bear ,
To grunt and sweat under a weary life ,
But that t

## Playing with patterns

Now we have a class encoder and decoder, we can toy around with the most basic units in Colibri Core: **patterns**. These are using for n-grams, skipgrams, flexgrams and any kind of test. You would basically use an instance of ``Pattern`` where you'd normally use a string, as Patterns are much smaller in memory. Let's build a pattern from a string using the classencoder, note that we will only be able to use words that are known by the class encoder:

In [8]:
#Build a pattern from a string, using the class encoder
p = classencoder.buildpattern("To be or not to be")

#To print it we need the decoder
print(p.tostring(classdecoder))
print(len(p))

To be or not to be
6


Iteration over a pattern will produce all the tokens that it is made up of. Note that the concepts of characters is gone from patterns! As a consequence, the ability to lowercase or uppercase text is also lost.

In [9]:
#Iterate over the token in a pattern, each token will be a Pattern instance

for token in p:
    print(token.tostring(classdecoder))
    

To
be
or
not
to
be


In [10]:
#Extracting subpatterns by offset

#Get first token
print(p[0].tostring(classdecoder))

#Get last token
print(p[-1].tostring(classdecoder))

#Get slice
print(p[2:4].tostring(classdecoder))
    

To
be
or not


Given a pattern, we can now very easily extract all n-grams in it, one of the most common NLP tasks:

In [11]:
#let's get all bigrams
for ngram in p.ngrams(2):
    print(ngram.tostring(classdecoder))

To be
be or
or not
not to
to be


In [12]:
#or all n-grams:
for ngram in p.ngrams():
    print(ngram.tostring(classdecoder))

To
be
or
not
to
be
To be
be or
or not
not to
to be
To be or
be or not
or not to
not to be
To be or not
be or not to
or not to be
To be or not to
be or not to be


In [13]:
#or particular ngrams, such as unigrams up to trigrams:
for ngram in p.ngrams(1,3):
    print(ngram.tostring(classdecoder))

To
be
or
not
to
be
To be
be or
or not
not to
to be
To be or
be or not
or not to
not to be


The ``in`` operator can be used to check if a token **OR** ngram is part of a pattern

In [14]:
#token
p2 = classencoder.buildpattern("be")
print(p2 in p)

#ngram
p3 = classencoder.buildpattern("or not")
print(p3 in p)

True
True


The follow snippet is here just to prove that our Pattern representation is usually smaller than a string representation, and offers a sneak peek under the hood:

In [15]:
print(bytes(p), len(bytes(p)))
print(b"To be or not to be", len(b"To be or not to be"))
len(bytes(p)) < len(b"To be or not to be")

b'\x0e\x16\x81\x01\x1d\t\x16' 7
b'To be or not to be' 18


True

## Reading a corpus

If we want to read an entire corpus, we can use the ``IndexedCorpus`` class. This we can use, for example, if we are merely interested in moving a sliding window over our data and extracting n-grams without counting or storing them:


In [16]:
corpusdata = colibricore.IndexedCorpus(corpusfile) #encoded data, will be loaded into memory entirely

for sentence in corpusdata.sentences(): #will return a Pattern per sentence (generator)
    for trigram in sentence.ngrams(3):
        print(trigram.tostring(classdecoder))

To be ,
be , or
, or not
or not to
not to be
to be ,
be , that
, that is
that is the
is the question
Whether 'tis Nobler
'tis Nobler in
Nobler in the
in the mind
the mind to
mind to suffer
The Slings and
Slings and Arrows
and Arrows of
Arrows of outrageous
of outrageous Fortune
outrageous Fortune ,
Or to take
to take Arms
take Arms against
Arms against a
against a Sea
a Sea of
Sea of troubles
of troubles ,
And by opposing
by opposing end
opposing end them?
end them? To
them? To die
To die ,
die , to
, to sleep
No more; and
more; and by
and by a
by a sleep
a sleep ,
sleep , to
, to say
to say we
say we end
The Heart-ache ,
Heart-ache , and
, and the
and the thousand
the thousand Natural
thousand Natural shocks
That Flesh is
Flesh is heir
is heir to?
heir to? 'Tis
to? 'Tis a
'Tis a consummation
Devoutly to be
to be wished
be wished .
wished . To
. To die
To die ,
die , to
, to sleep
to sleep ,
To sleep ,
sleep , perchance
, perchance to
perchance to Dream;
to Dream; Aye
Dream; Aye ,
Aye 

Now you may be very tempted to start storing and counting n-grams this way, but **don't**. This method is only suitable for iterating and quickly discarding the ngrams. Colibri core has facilities to deal with storing and counting far more efficiently, these are *pattern models* which we will discuss in the next section.

First some more about ``IndexedCorpus``. We can also obtain any pattern using its index, a ``(sentence,token)`` tuple:

In [17]:
unigram = corpusdata[(2,3)]
print(unigram.tostring(classdecoder))


in


A slice syntax is also supported, but may never cross line/sentence boundaries. As is customary in Python, the last index is non-inclusive.

In [18]:
ngram = corpusdata[(2,3):(2,8)]
print(ngram.tostring(classdecoder))

in the mind to suffer


The number of sentences and the length of each sentence can be extracted as follows:

In [19]:
sentencecount = corpusdata.sentencecount()
for i in range(1, sentencecount+1): #note the 1..+1 range, sentences are 1-indexed (whereas tokens are 0-indexed)
    print("Length of sentence " + str(i) + ":", corpusdata.sentencelength(i))

Length of sentence 1: 12
Length of sentence 2: 8
Length of sentence 3: 8
Length of sentence 4: 10
Length of sentence 5: 10
Length of sentence 6: 11
Length of sentence 7: 8
Length of sentence 8: 8
Length of sentence 9: 11
Length of sentence 10: 12
Length of sentence 11: 12
Length of sentence 12: 9
Length of sentence 13: 8
Length of sentence 14: 8
Length of sentence 15: 11
Length of sentence 16: 9
Length of sentence 17: 10
Length of sentence 18: 8
Length of sentence 19: 8
Length of sentence 20: 7
Length of sentence 21: 10
Length of sentence 22: 9
Length of sentence 23: 9
Length of sentence 24: 7
Length of sentence 25: 8
Length of sentence 26: 10
Length of sentence 27: 10
Length of sentence 28: 9
Length of sentence 29: 7
Length of sentence 30: 11
Length of sentence 31: 8
Length of sentence 32: 8
Length of sentence 33: 11
Length of sentence 34: 10


## Pattern Models

Now it's time to build our first pattern model on the Hamlet excerpt. We will extract all patterns occurring at least twice and with maximum length 8.

In [20]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty unindexed model 
model = colibricore.UnindexedPatternModel()

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)

print("Found " , len(model), " patterns:")

#Let's see what patterns are in our model (the order will be 'random')
for pattern in model:
    print(pattern.tostring(classdecoder))


Found  54  patterns:
To die , to sleep
die , to sleep
To die , to
, and the
, to sleep
die , to
To die ,
, the
we have
and the
, and
sleep ,
to sleep
die ,
To die
to be
be ,
all
With
make
we
That
not
No
die
And
a
,
be
is
that
would
in
to
and
end
For
To
The
have
death
us
the
When
this
makes
death ,
, to
by
life
sleep
bear
of
.


Rather than just output the patterns, we of course now have the counts as well, let's output it:

In [21]:
#Models behave much alike to Python dictionaries:
for pattern, count in model.items():
    print(pattern.tostring(classdecoder), count)


To die , to sleep 2
die , to sleep 2
To die , to 2
, and the 2
, to sleep 2
die , to 2
To die , 2
, the 2
we have 2
and the 2
, and 2
sleep , 3
to sleep 2
die , 2
To die 2
to be 2
be , 2
all 2
With 2
make 2
we 4
That 3
not 2
No 2
die 2
And 5
a 5
, 36
be 3
is 2
that 4
would 2
in 3
to 9
and 7
end 2
For 2
To 5
The 6
have 2
death 2
us 3
the 15
When 2
this 2
makes 2
death , 2
, to 3
by 2
life 2
sleep 5
bear 3
of 15
. 5


We can also query specific patterns:

In [22]:

querypattern = classencoder.buildpattern("sleep")

print("How much sleep?")
print(model[querypattern])



How much sleep?
5


In [23]:
#Like dictionaries, unknown patterns will trigger a KeyError
querypattern = classencoder.buildpattern("insolence")

print("How much insolence?")
try:
    print(model[querypattern])
except KeyError:
    print("Nope, KeyError, no such pattern in model..")


How much insolence?
Nope, KeyError, no such pattern in model..


We can check whether a pattern is in a model in the usual pythonic fashion:

In [24]:
if querypattern in model:
    print("Insolence in model!")
else:
    print("No insolence in model!")

No insolence in model!


Rather than the absolute counts, we can get the frequency of a pattern *within its type and class*. For example the frequency of a bigram amongst all bigrams:

In [25]:
querypattern = classencoder.buildpattern("and the")

print(model.frequency(querypattern))

0.07692307692307693


To analyse the distribution of occurrences, we can extract a histogram from our model as follows:

In [26]:
for occurrencecount, frequency in model.histogram():
    print(occurrencecount , " occurrences by ", frequency , "patterns")
    

2  occurrences by  34 patterns
3  occurrences by  7 patterns
4  occurrences by  2 patterns
5  occurrences by  5 patterns
6  occurrences by  1 patterns
7  occurrences by  1 patterns
9  occurrences by  1 patterns
15  occurrences by  2 patterns
36  occurrences by  1 patterns


Once we have a model, we can save it to file, to reload later, loading is much faster than training:
    

In [27]:
patternmodelfile = TMPDIR + "hamlet.colibri.patternmodel"

model.write(patternmodelfile)

#and reload just to show we can:
model = colibricore.UnindexedPatternModel(patternmodelfile, options)



Unindexed models are much smaller in memory than indexed models, but their functionality is also limited. Let's take a look at *indexed models*. Indexed models keep a *forward index* to all locations in the original corpus where patterns occur. The references are 2-tuples in the form ``(sentence,token)``, where ``sentence`` is 1-indexed and ``token`` is 0-indexed.

In [28]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty indexed model 

corpus = colibricore.IndexedCorpus(corpusfile)
model = colibricore.IndexedPatternModel(reverseindex=corpus)

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile, options)

print("Found " , len(model), " patterns:")

#Let's see what patterns are in our model (the order will be 'random')
for pattern, indices in model.items():
    print(pattern.tostring(classdecoder),end=" ")
    for index in indices:
        print(index,end=" ") #(sentence,token) tuple, sentences start with 1, tokens with 0
    print()
        

Found  54  patterns:
To die , to sleep (5, 5) (9, 5) 
die , to sleep (5, 6) (9, 6) 
To die , to (5, 5) (9, 5) 
, and the (7, 2) (18, 4) 
, to sleep (5, 7) (9, 7) 
die , to (5, 6) (9, 6) 
To die , (5, 5) (9, 5) 
, the (16, 3) (17, 5) 
we have (12, 1) (26, 7) 
and the (7, 3) (18, 5) 
, and (7, 2) (18, 4) 
sleep , (6, 5) (9, 9) (10, 1) 
to sleep (5, 8) (9, 8) 
die , (5, 6) (9, 6) 
To die (5, 5) (9, 5) 
to be (1, 5) (9, 1) 
be , (1, 1) (1, 6) 
all (28, 7) (34, 7) 
With (21, 0) (32, 0) 
make (20, 6) (28, 3) 
we (6, 9) (12, 1) (26, 7) (27, 5) 
That (8, 0) (14, 0) (19, 0) 
not (1, 4) (27, 7) 
No (6, 0) (25, 0) 
die (5, 6) (9, 6) 
And (5, 0) (26, 0) (29, 0) (31, 0) (33, 0) 
a (4, 5) (6, 4) (8, 6) (21, 1) (22, 5) 
, (1, 2) (1, 7) (3, 7) (4, 9) (5, 7) (6, 6) (7, 2) (9, 7) (9, 10) (10, 2) (10, 7) (10, 11) (11, 6) (11, 11) (12, 8) (15, 10) (16, 3) (16, 8) (17, 5) (17, 9) (18, 4) (19, 7) (21, 9) (22, 8) (23, 8) (24, 3) (25, 3) (25, 7) (26, 9) (28, 8) (30, 3) (30, 10) (31, 7) (32, 7) (33, 10) (34, 5

One interesting feature we can get from indexed models, is coverage information. This shows how many of the tokens in the original corpus data are covered by a particular pattern.

In [29]:
querypattern = classencoder.buildpattern("and the")

print(model.coverage(querypattern))

0.012698412698412698


Some numbers on the original corpus data can be obtained from the model:

In [30]:
print("Total amount of tokens in the corpus data:" , model.tokens() )
print("Total amount of word types in the corpus data:" , model.types() )


Total amount of tokens in the corpus data: 315
Total amount of word types in the corpus data: 180


We trained the corpus with the option ``doreverseindex=True``, when used with indexed pattern models this constructs a *reverse index* from the forward index. This reverse index allows you to look what patterns *begin* at a particular location, expressed as a ``(sentence, token)`` tuple in the corpus.

In [31]:
print("Patterns at (1,5): ")
for pattern in model.getreverseindex( (1,5) ):
    print(pattern.tostring(classdecoder))

Patterns at (1,5): 
to
to be
to be ,
to be , that
to be , that is


You can also use this to easily get all patterns in a sentence:
    


In [32]:
print("Patterns in first sentence")
for (sentence, token), pattern in model.getreverseindex_bysentence(1):
    print(sentence,token, " -- ", pattern.tostring(classdecoder))

Patterns in first sentence
1 0  --  To
1 0  --  To be
1 0  --  To be ,
1 0  --  To be , or
1 0  --  To be , or not
1 1  --  be
1 1  --  be ,
1 1  --  be , or
1 1  --  be , or not
1 1  --  be , or not to
1 2  --  ,
1 2  --  , or
1 2  --  , or not
1 2  --  , or not to
1 2  --  , or not to be
1 3  --  or
1 3  --  or not
1 3  --  or not to
1 3  --  or not to be
1 3  --  or not to be ,
1 4  --  not
1 4  --  not to
1 4  --  not to be
1 4  --  not to be ,
1 4  --  not to be , that
1 5  --  to
1 5  --  to be
1 5  --  to be ,
1 5  --  to be , that
1 5  --  to be , that is
1 6  --  be
1 6  --  be ,
1 6  --  be , that
1 6  --  be , that is
1 6  --  be , that is the
1 7  --  ,
1 7  --  , that
1 7  --  , that is
1 7  --  , that is the
1 7  --  , that is the question
1 8  --  that
1 8  --  that is
1 8  --  that is the
1 8  --  that is the question
1 9  --  is
1 9  --  is the
1 9  --  is the question
1 10  --  the
1 10  --  the question
1 11  --  question


It is easy to iterate over all indices in the reverse index:

In [None]:
for ref in model.reverseindex():
    print(ref, end=" ")

Actually, the reverse index, as returned by the ``reverseindex()`` method, is just an instance of ``IndexedCorpus``, which we already saw earlier.

## Skipgrams and flexgrams and relations between patterns

Skipgrams are n-grams with one or more *gaps* of a particular size. Flexgrams have a gap of dynamic size. Colibri Core can deal with both. Let's start with a new, and somewhat bigger, corpus. As the data in our previous example was too sparse to find any skipgrams. To that end, we will download Plato's *Republic*, this version is already tokenised and has one sentence per line, just as Colibri Core likes it: 

In [None]:
import urllib.request
corpusfile_plato_plaintext = TMPDIR + "republic.txt"
f = urllib.request.urlopen('http://lst.science.ru.nl/~proycon/republic.txt')
with open(corpusfile_plato_plaintext,'wb') as of:
    of.write(f.read())
print("Downloaded to " + corpusfile_plato_plaintext)


Now we create a class file and class encode the corpus, but because we may later on want to compare Shakespeare's Hamlet with Plato's Republic, we ensure that we use the same vocabulary. Note that it would have been better (more optimal classes, better compression) if we had built the original class encoder on both files right away, but you don't always have the luxury of foresight.


In [None]:
classfile_plato = TMPDIR + "republic.colibri.cls"
corpusfile_plato  = TMPDIR + "republic.colibri.dat"

#Build classes, re-using our classencoder from Hamlet! Let's reload it just for completion's sake
classencoder = colibricore.ClassEncoder(TMPDIR + "hamlet.colibri.cls")

#Now we will extend it by buildiing classes on Plato's data. If we had done this earlier, 
# we could have passed a list of filenames, ensuring more optimal encoding.
classencoder.build(corpusfile_plato_plaintext)

#Save new class file, this will be a superset of the original one.
classencoder.save(classfile_plato)

#Encode the corpus
classencoder.encodefile(corpusfile_plato_plaintext, corpusfile_plato)

#Load decoder because the old one will only handle Hamlet
classdecoder = colibricore.ClassDecoder(classfile_plato)

print("Done")


Now we have a proper class file and encoded corpus, we can build an indexed pattern model with skipgrams. Skipgrams can only be build most efficiently using indexed models.

In [None]:
#Set the options, doskipgrams=True is the key to enabling skipgrams
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8 doskipgrams=True)

#Instantiate an empty indexed model 
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
model = colibricore.IndexedPatternModel(reverseindex=corpus)

#Train it on our corpus file (class-encoded data, not plain text)
model.train(corpusfile_plato, options)

print("Found " , len(model), " patterns:")

Now how many of those patterns are skipgrams? We can find out ourselves by iterating over the patterns and checking their *category*.

In [1]:
skipgrams = 0
for pattern in model:
    if pattern.category() == colibricore.Category.SKIPGRAM:
        skipgrams += 1
print("Found",skipgrams," skipgrams")
        
    

NameError: name 'model' is not defined

However, it is much faster to do this using the built-in ``filter()`` method, which can also be used to filter patterns above a certain occurrence threshold, we can constrain it to a specific type such as skipgrams, and to a specific length (third argument, not used here):

In [None]:
skipgrams = 0
for pattern, occurrencecount in model.filter(0,colibricore.Category.SKIPGRAM): #the first parameter is the occurrence threshold
    skipgrams += 1
print("Found",skipgrams," skipgrams")

Similar to ``filter()`` is the ``top()`` method, which we can use to extract the top patterns, let's get the top 20 of skipgrams. We will still need to relay it through a sorting function to get it in descending order:

In [None]:
for pattern, occurrencecount in sorted( model.top(20,colibricore.Category.SKIPGRAM), key=lambda x:x[1]*-1 ):
    print(pattern.tostring(classdecoder), " -- ", occurrencecount)
    
    

The size of a skipgram gap is indicated by the number. We can create skipgrams from scratch using the same syntax with the classencoder:


In [None]:
skipgram = classencoder.buildpattern("To {*1*} or not to {*1*} is the question")

The consecutive non-gap parts of a skipgram can be obtained using the ``parts()`` method. The skipgram above consists of three parts:

In [None]:
for part in skipgram.parts():
    print(part.tostring(classdecoder))

Because an indexed model stores all the locations at which a pattern occurs, and a reverse index allows us to fill missing gaps, we can easily obtain all n-grams of which the skipgram is an abstraction:

In [None]:
#let's pick a common skipgram from the data:
skipgram = classencoder.buildpattern("to the {*1*} of")

for ngram, occurrences in model.getinstances(skipgram):
    print(ngram.tostring(classdecoder), " -- occurring ", occurrences, " times" )
        

The reverse is also possible, given an ngram we can find what skipgrams are abstractions, or *templates* of it:

In [None]:
#let's pick a common skipgram from the data:
ngram = classencoder.buildpattern("to the idea of")

for skipgram, occurrences in model.gettemplates(ngram):
    print(skipgram.tostring(classdecoder), " -- occurring ", occurrences, " times" )
    
#(TODO: a bug in colibri core has to be solved for this to work as it should)

Another trait of indexed pattern models is the ability to extract co-occurrence information using the ``getcooc()`` method. Let's see with what patterns the ngram "the law of" co-occurs more than five times (the second argument specifies this threshold, using it is always more efficient than doing a check on the variable ``occurrences`` that is returned):


In [None]:
ngram = classencoder.buildpattern("the law")

for coocngram, occurrences in sorted( model.getcooc(ngram,5), key=lambda x: x[1] *-1): #let's sort the output too
    print(coocngram.tostring(classdecoder), " -- occurring ", occurrences, " times")
        
        

There are also specific methods for extracting co-occurrences left or right of the pattern: ``getleftcooc()`` and ``getrightcooc()``. Other relationships can be extracted in an identical fashion:

 * **``getleftneighbours``**``(pattern,threshold=0,category=0,size=0)`` -- returns the neighbours to the immediate left of a pattern (threshold, category and size are constraints which are set to 0 by default)
 * **``getrightneighbours``**``(pattern,threshold=0,category=0,size=0)``-- returns the neighbours to the immediate right of a pattern
 * **``getsubchildren``**``(pattern,threshold=0,category=0,size=0)``-- returns patterns that are a subpart (subsumed by) the specified
 * **``getsubparents``**``(pattern,threshold=0,category=0,size=0)``-- the reverse of the above, returns patterns which subsume the specified patterns

*(TODO: this section is be continued later with flexgrams)*

## Comparing pattern models

Pattern Models can be used in a train/test paradigm. You can create a Pattern Model on the training corpus and then generated a Pattern Model on the test corpus **constrained** by the training model. This allows you to test what patterns from the training corpus also occur in the test corpus, and how often. Statistics on these two differing counts can provide insight into how much corpora differ.

We already saw the *coverage* metric previously, when applied to a train/test scenario it measures the number or ratio of tokens in the test corpus covered by patterns found during training. Let's perform such a comparison.

We made a Pattern Model on Plato's Republic and we have a small excerpt from Hamlet. Let's use the former as training and the letter as test.

When doing any kind of comparison, it is absolutely crucial that you make sure the training and test data are class encoded with the same classes. The best method for this is to build the class files for all data in advance. In the previous class encoding example we saw ``classencoder.build()`` which does nothing more than provide us with a shortcut to call ``classencoder.processcorpus()`` followed by ``classencoder.buildclasses()``. To process multiple corpora, we do this ourselves:

In [None]:
classfile2 = TMPDIR + "platoandhamlet.colibri.cls"

#Instantiate class encoder
classencoder2 = colibricore.ClassEncoder()

#Build classes
classencoder2.processcorpus(corpusfile_plato_plaintext)
classencoder2.processcorpus(corpusfile_plaintext)
classencoder2.buildclasses()

#Save class file
classencoder2.save(classfile2)

print("Encoded ", len(classencoder2), " classes, well done!")

It is important to realise that the Class Encoder we just built  (``classencoder2``) is now not compatible with the earlier class encoder used for previous examples! 

Often, however, you do not have all data available in advance. You may add a different test set later on, long after training. The way to make sure you have a proper class encoding is to extend your original class encoding. Rather than using the class encoder we just build, let us opt for that method, as this will keep all the classes we already had for the training data (Plato's Republic). This we do by calling the ``encodefile()`` method with two extra arguments set to True, indicating respectively that unknown words are allowed, and that unknown words are automatically added the the class encoding. If the second boolean is set to False, all unknown words would be encoded by one single class reserved for unknown words.


In [None]:
print("Class encoder has ", len(classencoder), " classes prior to extension")

testcorpusfile = TMPDIR + "hamlet_test.colibri.dat" #this will be the encoded test corpus file
classencoder.encodefile(corpusfile_plaintext, testcorpusfile, True, True)

classfile_test = TMPDIR + "platoplushamlet.colibri.cls"
classencoder.save(classfile_test)

print("Class encoder has ", len(classencoder), " classes after extension")

Do note that this method of encoding is not optimal, only encoding everything in one go ensures the smallest possible memory footprint.

We already created a pattern model on the training data in one of our earlier steps (called ``model``), to create our test model we *train* a constrained model on the test set, this model is constrained by the training model we made earlier. This will result in a new pattern model. The nomenclature may be a bit confusing at first. We simply do all this by instantiating a new model and calling the ``train()`` method and passing the contraining model as the last argument.

In [None]:
#Set the options
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8)

#Instantiate an empty indexed model 
testmodel = colibricore.IndexedPatternModel()

#Train it on our test corpus file (class-encoded data, not plain text)
testmodel.train(testcorpusfile, options, model)

Now we have a test model (effectively the intersection an unconstrained model of the test corpus and the training model). We can see what patterns from the training corpus occur in the test corpus:

In [None]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder))

We can inspect the differences between the counts:

In [None]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder), " ---  in training: ", model.occurrencecount(pattern), ", in test: ", testmodel.occurrencecount(pattern)   )

This isn't so informative unless we apply some normalisation, so let's get the coverage instead:

In [None]:
for pattern in testmodel:
    print(pattern.tostring(classdecoder), " ---  in training: ", model.coverage(pattern), ", in test: ", testmodel.coverage(pattern)   )

Particularly the total coverage may be an interesting metric for similarity accross of corpora, which we can compute as follows:

In [None]:
coverage = testmodel.totaltokensingroup() / testmodel.tokens()

print(coverage)

To get a more traditional frequency metric for a pattern, you have to be aware that the total that is used in normalisation is impacted by the fact that the model is constrained! It will not include any unseen n-grams, for that you'd need an unconstrained model.

In [None]:
sleep = classencoder.buildpattern("to sleep")

print("Frequency in training:", model.frequency(sleep))

print("Frequency in test (constrained):", testmodel.frequency(sleep) )
print("Coverage in test (constrained):", testmodel.coverage(sleep) )

fullmodel = colibricore.IndexedPatternModel()
fullmodel.train(testcorpusfile, options)
print("Frequency in test (unconstrained):", fullmodel.frequency(sleep) )
print("Coverage in test (unconstrained):", fullmodel.coverage(sleep) )
