# Friendly Python Clojures in an automatic Text Annotator

While creating some pipelines for automatic text annotation, I encountered a bug that made me realise that I didn't understand a basic block of programming.
If you know what closures are and how to use them, you can skip this post. 
Still here? Then I'll reconstruct some [SpaCy](https://spacy.io/) and [skweak](https://github.com/NorskRegnesentral/skweak) functionalities for you so I can give you some context in which you may also need to use closures.

## SpaCy objects

Central to the SpaCy package is the `Doc` objects (short for document). It's a neatly way to pack data for NLP and if it doesn't provide what you need out of the box, you can [always extend it's functionalities](https://spacy.io/usage/processing-pipelines#custom-components-attributes) to match your usecase.

In [1]:
from typing import Callable, Generator, Iterable

class Doc:
    def __init__(self, text) -> None:
        self.text = text
        self.tokens: List[str] = text.split()
        self.spans: Span = []
    def __len__(self):
        return len(self.tokens)
    def __getitem__(self, position):
        return self.tokens[position]

By implementing `__len__` and `__getitem__` [double-under functions](https://www.geeksforgeeks.org/dunder-magic-methods-python/) we got the ability to iterate through the Doc's tokens with a simple for as below. This is thanks to the Python datamodel. It's outside the scope of this post, but learning to leverage the datamodel will pay dividends on your effectiveness in Python. The first chapter of [Fluent Python](https://learning.oreilly.com/library/view/fluent-python/9781491946237/) introduces it in the first chapter in a very neat way. If you like video format more, [James Powell](https://www.youtube.com/watch?v=AmHE0kZhLIQ) got you covered.

In [2]:
doc = Doc("Today I ate garbonzo beans")
for token in doc:
    print(token)

Today
I
ate
garbonzo
beans


A `Span` is a slice of a `Doc`. Usually it can.. span multiple tokens, but today I have a feeling that all the spans we'll look at will match exactly one token. Also, in our case the spans will be always labeled.

In [3]:
from typing import NamedTuple

class Span(NamedTuple):
    position: int
    label: str


# skweak functions

Now, skweak provides us with some very interesting objects. One is a `FunctionAnnotator`. This takes a function that returns a list of spans from a document and attaches these spans to the given document. 

In [4]:
class FunctionAnnotator:
    def __init__(self, function: Callable[[Doc], Iterable[Span]]):
        self.find_spans = function
		
    def __call__(self, doc: Doc) -> Doc:
        # We start by clearing all existing annotations
        doc.spans = []

        for position, label in self.find_spans(doc):
            doc.spans.append(Span(position, label))

        return doc

In [5]:
def animal_labeling_function(doc: Doc) -> Generator[Span, None, None]:
    for position, token in enumerate(doc.tokens):
        if token.startswith('a'):
            yield Span(position, 'ANIMAL')

doc = Doc('this animal is some kind of antilope')
animal_annotator = FunctionAnnotator(animal_labeling_function)
doc = animal_annotator(doc)

print(doc.spans)

[Span(position=1, label='ANIMAL'), Span(position=6, label='ANIMAL')]


The `FunctionAnnotatorAggregator` takes multiple annotator functions and combines them in some fancy way. We will implement it so our documents to have a maximum of one label per span. We will also sort them by the order of appearance.

In [6]:
class FunctionAnnotatorAggregator:
    def __init__(self, annotators: Iterable[FunctionAnnotator]) -> None:
        self.annotators = annotators
    
    def __call__(self, doc: Doc) -> Doc:
        spans_dict = dict()
        for annotator in self.annotators:
            for span in annotator(doc).spans:
                spans_dict[span.position] = span.label
        
        
        doc.spans = []
        for position, label in spans_dict.items():
            doc.spans.append(Span(position, label))
        doc.spans.sort()
        
        return doc

In [7]:
def verb_labeling_function(doc: Doc) -> Generator[Span, None, None]:
    for position, token in enumerate(doc.tokens):
        if token in ['is', 'has']:
            yield Span(position, 'VERB')

verb_annotator = FunctionAnnotator(verb_labeling_function)

aggregated_annotator = FunctionAnnotatorAggregator([animal_annotator, verb_annotator])

doc = aggregated_annotator(doc)

print(doc.spans)


[Span(position=1, label='ANIMAL'), Span(position=2, label='VERB'), Span(position=6, label='ANIMAL')]


# The problem

All nice and dandy, the packages are well implemented and work as expected! 
Now, since the aggregator is so fancy, we may wish to programatically generate some labeling functions from a list of excellent heuristic parameters
<a id='problem_cell'></a>

In [17]:
heuristic_parameters = [
    ('M', 'MAMMAL'),
    ('F', 'FISH'),
    ('B', 'BIRD')
    ]

labeling_functions = []
for strats_with, label in heuristic_parameters:
    def labeling_function(doc: Doc) -> Generator[Span, None, None]:
        for position, word in enumerate(doc.tokens):
            if word.startswith(strats_with):
                yield Span(position, label)
    labeling_functions += [labeling_function]

strats_with, label = 'B', 'BOVINE'


<a id='problem_cell_2'></a>

In [9]:
doc = Doc("Monkeys are red. Firefish are blue; Besra is a bird and so are you")

def print_spans_from_labelers(doc, labeling_functions):
    annotators = [FunctionAnnotator(labeling_function) for labeling_function in labeling_functions]
    aggregated_annotator = FunctionAnnotatorAggregator(annotators)

    doc = aggregated_annotator(doc)

    print(doc.spans)
print_spans_from_labelers(doc, labeling_functions)


[Span(position=6, label='BOVINE')]


What happened? It seems that only the last function was applied. Let's look at the `labeling_functions`

In [10]:
for labeling_function in labeling_functions:
    print(labeling_functions)

[<function labeling_function at 0x7fd21437ac10>, <function labeling_function at 0x7fd21437aca0>, <function labeling_function at 0x7fd21437ad30>]
[<function labeling_function at 0x7fd21437ac10>, <function labeling_function at 0x7fd21437aca0>, <function labeling_function at 0x7fd21437ad30>]
[<function labeling_function at 0x7fd21437ac10>, <function labeling_function at 0x7fd21437aca0>, <function labeling_function at 0x7fd21437ad30>]


They point to the same memory address. This is because Python redefines functions in place. 
But this is not the only problem. Let's rewrite this with lambda functions. 

*Note* if you haven't worked with list comprehensions before: don't worry about it; think of the code below as a way to create a new function without replacing the existing function with the same name

In [18]:
labeling_functions = [
                        lambda doc: 
                            ( # this is also a generator in Python; it's the same syntax as list comprehension
                            # but we use round braces instead of square ones
                                Span(position, label) 
                                    for position, word in enumerate(doc.tokens) 
                                    if word.startswith(strats_with)
                            )
                        for strats_with, label in heuristic_parameters
                    ]

for labeling_function in labeling_functions:
    print(labeling_function)

<function <listcomp>.<lambda> at 0x7fd21438d670>
<function <listcomp>.<lambda> at 0x7fd21438d700>
<function <listcomp>.<lambda> at 0x7fd21438d790>


But when we want to print the function the problem stays.

In [13]:
print_spans_from_labelers(doc, labeling_functions)

[Span(position=6, label='BIRD')]


This is because of scoping. The problem is that, since we didn't declare strats_with, label in the lambda body or parameters, the lambdas will always look in the scope immediately outside them and they will find the last values that `strats_with`, `label` had.
If you come from other languages it might be strange to you, but Python doesn't create a new scope for the `for` body. Instead it uses the same local scope. This is why `strats_with, label = 'B', 'BOVINE'` in cell [8](#problem_cell) produced cell [9](#problem_cell_2) to display the label as 'BOVINE'




But be not affraid! There is a solution:

In [14]:
def function_closure(strats_with, label):
    def labeling_function(doc: Doc) -> Generator[Span, None, None]:
        for position, word in enumerate(doc.tokens):
            if word.startswith(strats_with):
                yield Span(position, label)
    return labeling_function

labeling_functions = [function_closure(strats_with, label) for strats_with, label in heuristic_parameters]

Now, when we get the annotators things go as expected.

In [15]:
print_spans_from_labelers(doc, labeling_functions)

[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]


Again, what happened? Well, this time we created a runtime around the labeling function and put strats_with, label in it.

