### Making a few baselines available

Okay, so this is mostly working, but there are some issues

**Note -- all the paths in the next cell, as well as any place where submission output is writte to "subdir", those paths need to be updated. The subdir paths should be empty directories where .tsv files will be output.**

Pipeline here is:
* Train each spaCy entity model separately
* Predict entities from each model and collect them 
* Work out extra Quantity components.
    * [TODO] Modifiers -- Will probably just do these as a series regex
    * Units -- Doing this one in a straight brute force matching thing -- take last longest mathing string of any unit found in the training data. 
* Relationships and alignment.
    * This is the really hard bit. 
    * Initially I tried to align based on relationships using the dependency parse trick in the example here, but it didn't quite work the way I wanted: https://spacy.io/usage/examples#entity-relations
    * Below, there are two simple versions. 
        * First is incredibly naive, and just takes each predicted span in the order they are found in the text.
        * Second is slightly more complex, matching each span to it's nearest neighbor in the text and knocking them out to prevent reuse.
        * [TODO] Third possibility will be to only rely on SpaCy predictions for the Quantities, then use nearest noun phrase chunks to approximate the other related components.


In [15]:
# A few imports and set up our paths
import itertools
import spacy
import random
import os
from spacy.util import minibatch, compounding

currentdir = os.getcwd() # ~/MeasEval/baselines
print(currentdir)
filename = os.path.join(currentdir, '../data/trial/tsv')
print(filename)

trainpaths = [os.path.join(currentdir, "../data/trial/tsv/"),
             os.path.join(currentdir, "../data/train/tsv/")]

evalpath = os.path.join(currentdir, "../data/eval/text/")

textpaths = [os.path.join(currentdir, "../data/trial/txt/"),
            os.path.join(currentdir, "../data/train/text/")]

/home/sam/MeasEval/baselines
/home/sam/MeasEval/baselines/../data/trial/tsv


In [16]:
# Set shorthands for annotation spans
typemap = {"Quantity": "QUANT",
           "MeasuredEntity": "ME", 
           "MeasuredProperty": "MP", 
           "Qualifier": "QUAL"}

In [17]:
# Collect all the ids and all the text files in both the train and trial directories
# Set our train test split for doing initial model development.
docIds = []
textset = {}
for fileset in textpaths:
    for fn in os.listdir(fileset):
        with open(fileset+fn) as textfile:
            text = textfile.read() #.splitlines()
            #print(fn[:-4])
            textset[fn[:-4]] = text
            docIds.append(fn[:-4])

random.seed(42)
random.shuffle(docIds)

trainIds = docIds[:220]
testIds = docIds[220:]

In [18]:
# Build training data from TSVs in expected format for spacy NER models...
# We have to train each model separately, because spacy doesn't let us have 
# Multiple entities that overlap, and we have this a lot (Especially in our Qualifiers)
# Unfortunately, we even have a fair bit of overlap within annotation types, 
# and end up needing to throw away a bunch of training data.

# Note that we have data split for train / test, and we also have full training data.

trainents = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}
traindata = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}
testents = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}
testdata = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}

alltrainents = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}
alltraindata = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}

for fileset in trainpaths:
    for fn in os.listdir(fileset):
        entities = {"QUANT": [], "ME": [], "MP": [], "QUAL": []}
        with open(fileset+fn) as annotfile:
            text = textset[fn[:-4]]
            next(annotfile)
            annots = annotfile.read().splitlines()
            for a in annots:
                annot = a.split("\t")
                atype = typemap[annot[2]]
                start = int(annot[3])
                stop = int(annot[4])
                # This is where we toss out the overlaps:
                overlap = False
                for ent in entities[atype]:
                    if ((start >= ent[0] and start <= ent[1]) or (stop >= ent[0] and stop <= ent[1]) or
                        (ent[0] >= start and ent[0] <= stop) or (ent[1] >= start and ent[1] <= stop)):
                        #print(str(start)+"-"+str(stop)+" overlaps " + str(ent))
                        overlap = True
                if overlap == False:    
                    entities[atype].append((start, stop, atype))
            if fn[:-4] in trainIds:
                traindata["QUANT"].append((text, {"entities": entities["QUANT"]}))
                traindata["ME"].append((text, {"entities": entities["ME"]}))
                traindata["MP"].append((text, {"entities": entities["MP"]}))
                traindata["QUAL"].append((text, {"entities": entities["QUAL"]}))
                trainents["QUANT"].extend(entities["QUANT"])
                trainents["ME"].extend(entities["ME"])
                trainents["MP"].extend(entities["MP"])
                trainents["QUAL"].extend(entities["QUAL"])
            else:
                testdata["QUANT"].append((text, {"entities": entities["QUANT"]}))
                testdata["ME"].append((text, {"entities": entities["ME"]}))
                testdata["MP"].append((text, {"entities": entities["MP"]}))
                testdata["QUAL"].append((text, {"entities": entities["QUAL"]}))
                testents["QUANT"].extend(entities["QUANT"])
                testents["ME"].extend(entities["ME"])
                testents["MP"].extend(entities["MP"])
                testents["QUAL"].extend(entities["QUAL"])
            alltraindata["QUANT"].append((text, {"entities": entities["QUANT"]}))
            alltraindata["ME"].append((text, {"entities": entities["ME"]}))
            alltraindata["MP"].append((text, {"entities": entities["MP"]}))
            alltraindata["QUAL"].append((text, {"entities": entities["QUAL"]}))
            alltrainents["QUANT"].extend(entities["QUANT"])
            alltrainents["ME"].extend(entities["ME"])
            alltrainents["MP"].extend(entities["MP"])
            alltrainents["QUAL"].extend(entities["QUAL"])

In [19]:
# We don't throw out _that_ many, see counts below.
print("Training:")
entcount = 0
for t in ["QUANT", "ME", "MP", "QUAL"]:
    print(t + ": " + str(len(trainents[t])))
    entcount+=len(trainents[t])
print("Total: " + str(entcount))
entcount = 0

print("\nTest:")
for t in ["QUANT", "ME", "MP", "QUAL"]:
    print(t + ": " + str(len(testents[t])))
    entcount+=len(testents[t])
print("Total: " + str(entcount))
entcount = 0

print("\nAll training:")
for t in ["QUANT", "ME", "MP", "QUAL"]:
    print(t + ": " + str(len(alltrainents[t])))
    entcount+=len(alltrainents[t])
print("Total: " + str(entcount))
# Before filtering overlaps:
# QUANT: 1164
# ME: 1148
# MP: 742
# QUAL: 309

# Only filtered the one direction:
# QUANT: 1164
# ME: 914
# MP: 651
# QUAL: 278

# From the full set
# QUANT: 1164
# ME: 911
# MP: 651
# QUAL: 278

Training:
QUANT: 817
ME: 632
MP: 472
QUAL: 193
Total: 2114

Test:
QUANT: 347
ME: 279
MP: 179
QUAL: 85
Total: 890

All training:
QUANT: 1164
ME: 911
MP: 651
QUAL: 278
Total: 3004


In [20]:
#check to make sure we're close to a 70/30 split. :)
print(804/(804+360))
print(360/(804+360))

0.6907216494845361
0.30927835051546393


In [21]:
# Simplest possible model training. I'm sure there's tons I could do to optimize here.
# Note that we lose a few more training instances here due to tokenizer mismatch issues.
# Only effects Qualifiers and MeasuredProperties...
models = {}
for entType in ["QUANT", "ME", "MP", "QUAL"]:
    print("Starting training for " + entType)
    models[entType] = spacy.blank("en")
    ner = models[entType].create_pipe("ner")
    models[entType].add_pipe(ner)
    print(models[entType].pipe_names)
    ner.add_label(entType)
    optimizer = models[entType].begin_training()

    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(20):
        random.shuffle(traindata[entType])
        batches = minibatch(traindata[entType], size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            models[entType].update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

Starting training for QUANT


ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.ner.EntityRecognizer object at 0x7f40dad78820> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

In [58]:
# And build our entity predictions for each of the four models...
ents = {}
counts = { "total": 0, "QUANT": 0, "ME": 0, "MP": 0, "QUAL": 0}
for docId in testIds:
    text = textset[docId]
    #for docid,text in evaltextset.items():
    counts["total"] += 1
    ents[docId] = {}

    for entType in ["QUANT", "ME", "MP", "QUAL"]:
        ents[docId][entType] = ()
        doc = models[entType](text)
        ents[docId][entType] = doc.ents
        if len(list(ents[docId][entType])) > 0:
            counts[entType]+=1

In [61]:
# Collect a set of unique units for use in populating the unit data...
import json
units = []

for fileset in trainpaths:
    for fn in os.listdir(fileset):
        # Let's make sure to limit the units to just the smaller train set
        if fn[:-4] in trainIds:
            with open(fileset+fn) as annotfile:
                text = textset[fn[:-4]]
                next(annotfile)
                annots = annotfile.read().splitlines()
                for a in annots:
                    annot = a.split("\t")
                    atype = typemap[annot[2]]
                    if atype == "QUANT" and annot[7] != "":
                        jsondata = json.loads(annot[7])
                        if "unit" in jsondata:
                            units.append(jsondata["unit"])
uniqunits = list(set(units))

In [62]:
print(len(units))
print(len(uniqunits))

644
140


In [118]:
# Simplest version, let's just check the lengths of everything
# Then pop them off in the order they exist.
header = "docId\tannotSet\tannotType\tstartOffset\tendOffset\tannotId\ttext\tother"
subdir = "/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/"
count = 0
for docId, allents in ents.items():
    #if docId == "S0378112713005288-1800":
    #print(allents)
    annotSet = 1
    #print(str(len(allents['QUANT']))+"|"+str(len(allents['ME']))+"|"
    #      +str(len(allents['MP']))+"|"+str(len(allents['QUAL'])))
    sub = open(subdir+docId + ".tsv", "w")
    sub.write(header+"\n")
    for quant in allents['QUANT']:
        unitmatches = []
        for unit in uniqunits: 
            if unit in quant.text:
                unitmatches.append(unit)
        if len(unitmatches) > 0: 
            unit = max(unitmatches, key=len)
        strings = []
        meId = 0
        annotId = 1
        quantString = (docId + "\t" + str(annotSet) + "\tQuantity\t" + str(quant.start_char) + "\t" +
                        str(quant.end_char) + "\t" + str(annotId) + "\t" + quant.text+"\t{\"unit\": \"" + unit +  "\"}")
        strings.append(quantString)
        annotId+=1
        if (len(allents['ME']) > annotSet-1 and len(allents['MP']) > annotSet-1):
            mp = allents['MP'][annotSet-1]
            me = allents['ME'][annotSet-1]
            mpString = (docId + "\t" + str(annotSet) + "\tMeasuredProperty\t" + str(mp.start_char) + "\t" + 
                    str(mp.end_char) + "\t" + str(annotId) + "\t" + mp.text + "\t{\"HasQuantity\": \"" + 
                    str(annotId-1) + "\"}" )
            strings.append(mpString)
            annotId+=1

            #print(me.text)
            meString = (docId + "\t" + str(annotSet) + "\tMeasuredEntity\t" + str(me.start_char) + "\t" + 
                        str(me.end_char) + "\t" + str(annotId) + "\t" + me.text + "\t{\"HasProperty\": \"" + 
                        str(annotId-1) + "\"}" )
            strings.append(meString)
            meId = annotId
            annotId+=1
        elif (len(allents['ME']) > annotSet-1):
            me = allents['ME'][annotSet-1]
            meString = (docId + "\t" + str(annotSet) + "\tMeasuredEntity\t" + str(me.start_char) + "\t" + 
                        str(me.end_char) + "\t" + str(annotId) + "\t" + me.text + "\t{\"HasProperty\": \"" + 
                        str(annotId-1) + "\"}" )
            strings.append(meString)
            meId = annotId
            annotId+=1     
        if (len(allents['QUAL']) > annotSet-1 and meId != 0):
            qual = allents['QUAL'][annotSet-1]
            qualString = (docId + "\t" + str(annotSet) + "\tQualifier\t" + str(qual.start_char) + "\t" + 
                        str(qual.end_char) + "\t" + str(annotId) + "\t" + qual.text + "\t{\"Qualifies\": \"" + 
                        str(meId) + "\"}" )
            strings.append(qualString)
            meId = annotId
            annotId+=1                           

        #print("ENT: " + me.text)
        #print("PROP: " + mp.text)
        for s in strings:
            #print(s)
            sub.write(s+"\n")
        annotSet+=1
    sub.close()

In [22]:
for docId, allents in ents.items():
    if docId == "S0378112713005288-1800":
        print(type(allents['QUANT']))

<class 'tuple'>


In [35]:
ents['S0378112713005288-1800']['MP'][0]

height

#### Shelling out to measeval-eval.py inline.

Note, we have added another new flag to the evaluation script: -l or limit.

This was the default up until the evaluation period opened. It limits the gold data files loaded to only files that are included in the submission. This is so that you can set an arbitrary train/test split (as we've done above) and not record the training portion in the gold data used for evaluation.

Also note that the "gold/scratch" directory used for eval below is a combined copy of _all_ .tsv files in both the data/train/tsv and data/test/tsv directories in the MeasEval Github repo.

In [66]:
!python /Users/harperco/projects/semeval/MeasEval/eval/measeval-eval.py -i "/Users/harperco/projects/semeval/" -g "scratch/gold/" -s "scratch/subs/baseline-simpler-split/" -l


Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0019103511004994-1382.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S2213671113001306-1286.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0960148113002048-3775.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0025322712001600-2406.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S2213671113000921-994.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0165587612003680-998.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harpe

[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0019103512004009-3488.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0167819113001051-1550.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0167880913001229-1033.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0921818113002245-859.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0019103512003533-4685.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-split/S0019103512003995-3420.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/h

In [70]:
# This last, fairly unweildy chunk of code is:
# * collecting everything, 
# * Building the TSV strings
# * Attempting to identify a unit
# * matching and populating annotSet based on knockout logic, 
# * resorting, and populating TSV files.

# Configure header string and submission directory (latter needs to exist.)
header = "docId\tannotSet\tannotType\tstartOffset\tendOffset\tannotId\ttext\tother"
subdir = "/Users/harperco/projects/semeval/scratch/subs/baseline-split/"

for docId, allents in ents.items():
    # First we collect our Quantities
    # We want to get the strin version, the full set, and the "knockout" list.
    quantstrings = []
    quants = []
    knockout = []
    annotSet = 1
    for quant in allents['QUANT']:
        # Match units in the Quant, then find the longest unit 
        unitmatches = []
        for unit in uniqunits: 
            if unit in quant.text:
                unitmatches.append(unit)
        if len(unitmatches) > 0: 
            unit = max(unitmatches, key=len)
        # Build the quantity string, and also the dictionary for quant and knockout.
        quantstrings.append(docId + "\t" + str(annotSet) + "\tQuantity\t" + str(quant.start_char) + "\t" +
                          str(quant.end_char) + "\t1\t" + quant.text+"\t{\"unit\": \"" + unit +  "\"}")
        quants.append({"annotSet": annotSet, "annotId": 1, "start": quant.start_char, "end": quant.end_char, 
                       "text": quant.text, "type": "Quantity"}) 
        knockout.append({"annotSet": annotSet, "annotId": 1, "start": quant.start_char, "end": quant.end_char, 
                       "text": quant.text, "type": "Quantity"}) 
        annotSet+=1
    
    # So now we want to do the ents, as we need this queued up to do more matching with the MPs
    mestrings = []
    mestring = ""
    mes = []
    knockoutmes = []
    #annotSet = 1
    for me in allents['ME']:
        knockoutmes.append({"start": me.start_char, "end": me.end_char, "text": me.text, "type": "MeasuredEntity"}) 

    # Now we work through our measured properties.
    mpstrings = []
    mpstring = ""
    mps = []
    knockoutmps = []
    for mp in allents['MP']:
        if len(knockout) > 0 and len(knockoutmes) > 0:
            start = mp.start_char
            end = mp.end_char
            nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000}
            index = 0
            for q in knockout:
                dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
                mindist = min(dists)
                if mindist < nearest["dist"]:
                    nearest["dist"] = mindist
                    nearest["set"] = q["annotSet"]
                    nearest["id"] = q["annotId"]
                    nearest["index"] = index
                index+=1
            knockout.pop(nearest["index"])

            mpString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredProperty\t" + str(mp.start_char) + "\t" + 
                        str(mp.end_char) + "\t" + str(nearest["id"]+1) + "\t" + mp.text + "\t{\"HasQuantity\": \"" + 
                        str(nearest["id"]) + "\"}" )
            mpstrings.append(mpString)
            mps.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": mp.start_char, 
                        "end": mp.end_char, "text": mp.text, "type": "MeasuredProperty"})
            knockoutmps.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": mp.start_char, 
                        "end": mp.end_char, "text": mp.text, "type": "MeasuredProperty"})

            nearestme = {"dist": 100000000, "index": 100000000}
            index = 0
            if len(knockoutmes) > 0:
                for me in knockoutmes:
                    dists = [abs(start-me["start"]), abs(end-me["start"]), abs(start-me["end"]), abs(end-me["end"])]
                    mindist = min(dists)
                    if mindist < nearestme["dist"]:
                        nearestme["dist"] = mindist
                        nearestme["index"] = index
                    index+=1
                meString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredEntity\t" + str(me["start"]) + "\t" + 
                            str(me["end"]) + "\t" + str(nearest["id"]+2) + "\t" + me["text"] + "\t{\"HasProperty\": \"" + 
                            str(nearest["id"]+1) + "\"}" )   
                mestrings.append(meString)

                knockoutmes.pop(nearestme["index"])


    # Now we do any leftover MEs, which should go straight to a Quantity:

    for me in knockoutmes:
        start = me["start"]
        end = me["end"]
        nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000, "type": ""}
        index = 0                
        for q in knockout:
            dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
            mindist = min(dists)
            if mindist < nearest["dist"]:
                nearest["dist"] = mindist
                nearest["set"] = q["annotSet"]
                nearest["id"] = q["annotId"]
                nearest["index"] = index
                nearest["type"] = q["type"]
            index+=1
        if len(knockout) > 0:
            knockout.pop(nearest["index"])
            meString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredEntity\t" + str(me["start"]) + "\t" + 
                        str(me["end"]) + "\t" + str(nearest["id"]+1) + "\t" + me["text"] + "\t{\"HasQuantity\": \"" + 
                        str(nearest["id"]) + "\"}" )   
            mestrings.append(meString)
            mes.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": me["start"], 
                        "end": me["end"], "text": me["text"], "type": "MeasuredEntity"})
            
    #Finally, let's process our Qualifiers:
    kitchensink = [x for x in itertools.chain(quants, mps, mes)]
    qualstrings = []
    for qual in allents['QUAL']:
        start = qual.start_char
        end = qual.end_char
        nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000}
        index = 0
        for q in kitchensink:
            dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
            mindist = min(dists)
            if mindist < nearest["dist"]:
                nearest["dist"] = mindist
                nearest["set"] = q["annotSet"]
                nearest["id"] = q["annotId"]
                nearest["index"] = index
            index+=1
        kitchensink.pop(nearest["index"])

        qualString = (docId + "\t" + str(nearest["set"]) + "\tQualifier\t" + str(qual.start_char) + "\t" + 
                    str(qual.end_char) + "\t" + str(nearest["id"]+1) + "\t" + qual.text + "\t{\"Qualifies\": \"" + 
                    str(nearest["id"]) + "\"}" )
        qualstrings.append(qualString)

    # Finally, we collect everythign:

    import itertools
    allstrings = [x for x in itertools.chain(quantstrings, mpstrings, mestrings, qualstrings)]
    sortedstrings = {}

    sub = open(subdir+docId + ".tsv", "w")

    for string in allstrings:
        annotSet = string.split("\t")[1]
        annotId = string.split("\t")[5]
        if annotSet not in sortedstrings:
            sortedstrings[annotSet] = {}
        sortedstrings[annotSet][annotId] = string   
    sub.write(header+"\n")
    for aset, val in sortedstrings.items():
        for aid, string in val.items():
            sub.write(string+"\n")
    sub.close()

In [71]:
!python /Users/harperco/projects/semeval/MeasEval/eval/measeval-eval.py -i "/Users/harperco/projects/semeval/" -g "scratch/gold/" -s "scratch/subs/baseline-split/" -l


Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S0019103511004994-1382.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S2213671113001306-1286.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S0960148113002048-3775.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S0025322712001600-2406.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S2213671113000921-994.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split/S0165587612003680-998.tsv'))
[0;32mPassed! :)[0m

Validating Vlad(source=LocalFile('/Users/harperco/projects/semeval/scratch/subs/baseline-split

Submission directory contains: 93
Gold directory contains: 88
Gold count of Quantity: 360
Gold count of MeasuredProperty: 221
Gold count of MeasuredEntity: 352
Gold count of Qualifier: 95

Submission count of Quantity: 363
Submission count of MeasuredProperty: 57
Submission count of MeasuredEntity: 89
Submission count of Qualifier: 30

Working in mode overall
True positives (matching rows): 616
False positives (submission only): 461
False negatives (gold only): 1553

Precision: 0.5719591457753017
Recall: 0.28400184416781926
F-measure: 0.3795440542205792

Overall Score Exact Match: 0.19467680608365018
Overall Score F1 (Overlap): 0.23017754630369258


### Noteably, we see from the two cells above that the more involved matching of spans based on proximity doesn't add muchj more than .01 to the overall F1 score.

### Now we'll repeat the training above, but using the full set of training data

In [110]:
# Now we'll repeat the same set of things for the full set of training data:

models = {}
for entType in ["QUANT", "ME", "MP", "QUAL"]:
    print("Starting training for " + entType)
    models[entType] = spacy.blank("en")
    ner = models[entType].create_pipe("ner")
    models[entType].add_pipe(ner)
    print(models[entType].pipe_names)
    ner.add_label(entType)
    optimizer = models[entType].begin_training()

    sizes = compounding(1.0, 4.0, 1.001)
    # batch up the examples using spaCy's minibatch
    for itn in range(20):
        random.shuffle(alltraindata[entType])
        batches = minibatch(alltraindata[entType], size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            models[entType].update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

Starting training for QUANT
['ner']
Losses {'ner': 3953.4779805369817}
Losses {'ner': 1610.3809841818131}
Losses {'ner': 1447.6390614423142}
Losses {'ner': 1590.713278219099}
Losses {'ner': 1035.438445837068}
Losses {'ner': 842.9937791002653}
Losses {'ner': 807.0279002391468}
Losses {'ner': 725.251484618387}
Losses {'ner': 736.8128775222854}
Losses {'ner': 510.5135251911736}
Losses {'ner': 403.54704855692324}
Losses {'ner': 440.63163169393533}
Losses {'ner': 534.9749610727363}
Losses {'ner': 403.5207589780159}
Losses {'ner': 438.44466070558275}
Losses {'ner': 375.71735788605736}
Losses {'ner': 337.82190201392393}
Losses {'ner': 318.26199420943146}
Losses {'ner': 313.2439186123273}
Losses {'ner': 279.8252344196454}
Starting training for ME
['ner']
Losses {'ner': 3567.991322077707}
Losses {'ner': 3004.681139232435}
Losses {'ner': 2878.56946327679}
Losses {'ner': 3680.0836548520483}
Losses {'ner': 3098.901827742639}
Losses {'ner': 4310.777471259318}
Losses {'ner': 4702.907995848022}
Losse

In [111]:
# Get the eval text data together
evalpath = "/Users/harperco/projects/semeval/measeval-publish-stage/eval/text/"

evaltextset = {}
for fn in os.listdir(evalpath):
    with open(evalpath+fn) as textfile:
        text = textfile.read() #.splitlines()
        #print(fn[:-4])
        evaltextset[fn[:-4]] = text

In [112]:
# And build our entity predictions for each of the four models...
ents = {}
counts = { "total": 0, "QUANT": 0, "ME": 0, "MP": 0, "QUAL": 0}
for docid,text in evaltextset.items():
    counts["total"] += 1
    ents[docid] = {}

    for entType in ["QUANT", "ME", "MP", "QUAL"]:
        ents[docid][entType] = ()
        doc = models[entType](text)
        ents[docid][entType] = doc.ents
        if len(list(ents[docid][entType])) > 0:
            counts[entType]+=1

In [113]:
# Collect a set of unique units for use in populating the unit data...
import json
units = []

for fileset in trainpaths:
    for fn in os.listdir(fileset):
        # This time we run the unit collection for all the training data
        # if fn[:-4] in trainIds:
            with open(fileset+fn) as annotfile:
                text = textset[fn[:-4]]
                next(annotfile)
                annots = annotfile.read().splitlines()
                for a in annots:
                    annot = a.split("\t")
                    atype = typemap[annot[2]]
                    if atype == "QUANT" and annot[7] != "":
                        jsondata = json.loads(annot[7])
                        if "unit" in jsondata:
                            units.append(jsondata["unit"])
uniqunits = list(set(units))

In [114]:
# Simpler version, let's just check the lenths of everything
# Then pop them off in the order they exist.
header = "docId\tannotSet\tannotType\tstartOffset\tendOffset\tannotId\ttext\tother"
subdir = "/Users/harperco/projects/semeval/scratch/subs/baseline-simpler-2/"
count = 0
for docId, allents in ents.items():
    #if docId == "S0378112713005288-1800":
    #print(allents)
    annotSet = 1
    #print(str(len(allents['QUANT']))+"|"+str(len(allents['ME']))+"|"
    #      +str(len(allents['MP']))+"|"+str(len(allents['QUAL'])))
    sub = open(subdir+docId + ".tsv", "w")
    sub.write(header+"\n")
    for quant in allents['QUANT']:
        unitmatches = []
        for unit in uniqunits: 
            if unit in quant.text:
                unitmatches.append(unit)
        if len(unitmatches) > 0: 
            unit = max(unitmatches, key=len)
        strings = []
        meId = 0
        annotId = 1
        quantString = (docId + "\t" + str(annotSet) + "\tQuantity\t" + str(quant.start_char) + "\t" +
                        str(quant.end_char) + "\t" + str(annotId) + "\t" + quant.text+"\t{\"unit\": \"" + unit +  "\"}")
        strings.append(quantString)
        annotId+=1
        if (len(allents['ME']) > annotSet-1 and len(allents['MP']) > annotSet-1):
            mp = allents['MP'][annotSet-1]
            me = allents['ME'][annotSet-1]
            mpString = (docId + "\t" + str(annotSet) + "\tMeasuredProperty\t" + str(mp.start_char) + "\t" + 
                    str(mp.end_char) + "\t" + str(annotId) + "\t" + mp.text + "\t{\"HasQuantity\": \"" + 
                    str(annotId-1) + "\"}" )
            strings.append(mpString)
            annotId+=1

            #print(me.text)
            meString = (docId + "\t" + str(annotSet) + "\tMeasuredEntity\t" + str(me.start_char) + "\t" + 
                        str(me.end_char) + "\t" + str(annotId) + "\t" + me.text + "\t{\"HasProperty\": \"" + 
                        str(annotId-1) + "\"}" )
            strings.append(meString)
            meId = annotId
            annotId+=1
        elif (len(allents['ME']) > annotSet-1):
            me = allents['ME'][annotSet-1]
            meString = (docId + "\t" + str(annotSet) + "\tMeasuredEntity\t" + str(me.start_char) + "\t" + 
                        str(me.end_char) + "\t" + str(annotId) + "\t" + me.text + "\t{\"HasProperty\": \"" + 
                        str(annotId-1) + "\"}" )
            strings.append(meString)
            meId = annotId
            annotId+=1     
        if (len(allents['QUAL']) > annotSet-1 and meId != 0):
            qual = allents['QUAL'][annotSet-1]
            qualString = (docId + "\t" + str(annotSet) + "\tQualifier\t" + str(qual.start_char) + "\t" + 
                        str(qual.end_char) + "\t" + str(annotId) + "\t" + qual.text + "\t{\"Qualifies\": \"" + 
                        str(meId) + "\"}" )
            strings.append(qualString)
            meId = annotId
            annotId+=1                           

        #print("ENT: " + me.text)
        #print("PROP: " + mp.text)
        for s in strings:
            #print(s)
            sub.write(s+"\n")
        annotSet+=1
    sub.close()

In [115]:
# This last, fairly unweildy chunk of code is:
# * collecting everything, 
# * Building the TSV strings
# * Attempting to identify a unit
# * matching and populating annotSet based on knockout logic, 
# * resorting, and populating TSV files.

# Configure header string and submission directory (latter needs to exist.)
header = "docId\tannotSet\tannotType\tstartOffset\tendOffset\tannotId\ttext\tother"
subdir = "/Users/harperco/projects/semeval/scratch/subs/baseline-2/"

for docId, allents in ents.items():
    #print(allents)
    # First we collect our Quantities
    # We want to get the strin version, the full set, and the "knockout" list.
    quantstrings = []
    quants = []
    knockout = []
    annotSet = 1
    for quant in allents['QUANT']:
        # Match units in the Quant, then find the longest unit 
        unitmatches = []
        for unit in uniqunits: 
            if unit in quant.text:
                unitmatches.append(unit)
        if len(unitmatches) > 0: 
            unit = max(unitmatches, key=len)
        # Build the quantity string, and also the dictionary for quant and knockout.
        quantstrings.append(docId + "\t" + str(annotSet) + "\tQuantity\t" + str(quant.start_char) + "\t" +
                          str(quant.end_char) + "\t1\t" + quant.text+"\t{\"unit\": \"" + unit +  "\"}")
        quants.append({"annotSet": annotSet, "annotId": 1, "start": quant.start_char, "end": quant.end_char, 
                       "text": quant.text, "type": "Quantity"}) 
        knockout.append({"annotSet": annotSet, "annotId": 1, "start": quant.start_char, "end": quant.end_char, 
                       "text": quant.text, "type": "Quantity"}) 
        annotSet+=1
    
    # So now we want to do the ents, as we need this queued up to do more matching with the MPs
    mestrings = []
    mestring = ""
    mes = []
    knockoutmes = []
    #annotSet = 1
    for me in allents['ME']:
        knockoutmes.append({"start": me.start_char, "end": me.end_char, "text": me.text, "type": "MeasuredEntity"}) 

    # Now we work through our measured properties.
    mpstrings = []
    mpstring = ""
    mps = []
    knockoutmps = []
    for mp in allents['MP']:
        if len(knockout) > 0 and len(knockoutmes) > 0:
            start = mp.start_char
            end = mp.end_char
            nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000}
            index = 0
            for q in knockout:
                dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
                mindist = min(dists)
                if mindist < nearest["dist"]:
                    nearest["dist"] = mindist
                    nearest["set"] = q["annotSet"]
                    nearest["id"] = q["annotId"]
                    nearest["index"] = index
                index+=1
            knockout.pop(nearest["index"])

            mpString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredProperty\t" + str(mp.start_char) + "\t" + 
                        str(mp.end_char) + "\t" + str(nearest["id"]+1) + "\t" + mp.text + "\t{\"HasQuantity\": \"" + 
                        str(nearest["id"]) + "\"}" )
            mpstrings.append(mpString)
            mps.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": mp.start_char, 
                        "end": mp.end_char, "text": mp.text, "type": "MeasuredProperty"})
            knockoutmps.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": mp.start_char, 
                        "end": mp.end_char, "text": mp.text, "type": "MeasuredProperty"})

            nearestme = {"dist": 100000000, "index": 100000000}
            index = 0
            if len(knockoutmes) > 0:
                for me in knockoutmes:
                    dists = [abs(start-me["start"]), abs(end-me["start"]), abs(start-me["end"]), abs(end-me["end"])]
                    mindist = min(dists)
                    if mindist < nearestme["dist"]:
                        nearestme["dist"] = mindist
                        nearestme["index"] = index
                    index+=1
                meString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredEntity\t" + str(me["start"]) + "\t" + 
                            str(me["end"]) + "\t" + str(nearest["id"]+2) + "\t" + me["text"] + "\t{\"HasProperty\": \"" + 
                            str(nearest["id"]+1) + "\"}" )   
                mestrings.append(meString)

                knockoutmes.pop(nearestme["index"])


    # Now we do any leftover MEs, which should go straight to a Quantity:

    for me in knockoutmes:
        start = me["start"]
        end = me["end"]
        nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000, "type": ""}
        index = 0                
        for q in knockout:
            dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
            mindist = min(dists)
            if mindist < nearest["dist"]:
                nearest["dist"] = mindist
                nearest["set"] = q["annotSet"]
                nearest["id"] = q["annotId"]
                nearest["index"] = index
                nearest["type"] = q["type"]
            index+=1
        if len(knockout) > 0:
            knockout.pop(nearest["index"])
            meString = (docId + "\t" + str(nearest["set"]) + "\tMeasuredEntity\t" + str(me["start"]) + "\t" + 
                        str(me["end"]) + "\t" + str(nearest["id"]+1) + "\t" + me["text"] + "\t{\"HasQuantity\": \"" + 
                        str(nearest["id"]) + "\"}" )   
            mestrings.append(meString)
            mes.append({"annotSet": nearest["set"], "annotId": nearest["id"]+1, "start": me["start"], 
                        "end": me["end"], "text": me["text"], "type": "MeasuredEntity"})
            
    #Finally, let's process our Qualifiers:
    kitchensink = [x for x in itertools.chain(quants, mps, mes)]
    qualstrings = []
    for qual in allents['QUAL']:
        start = qual.start_char
        end = qual.end_char
        nearest = {"dist": 100000000, "set": 0, "id": 0, "index": 100000000}
        index = 0
        if len(kitchensink) > 0:
            for q in kitchensink:
                dists = [abs(start-q["start"]), abs(end-q["start"]), abs(start-q["end"]), abs(end-q["end"])]
                mindist = min(dists)
                if mindist < nearest["dist"]:
                    nearest["dist"] = mindist
                    nearest["set"] = q["annotSet"]
                    nearest["id"] = q["annotId"]
                    nearest["index"] = index
                index+=1
            kitchensink.pop(nearest["index"])

            qualString = (docId + "\t" + str(nearest["set"]) + "\tQualifier\t" + str(qual.start_char) + "\t" + 
                        str(qual.end_char) + "\t" + str(nearest["id"]+1) + "\t" + qual.text + "\t{\"Qualifies\": \"" + 
                        str(nearest["id"]) + "\"}" )
            qualstrings.append(qualString)

    # Finally, we collect everythign:

    import itertools
    allstrings = [x for x in itertools.chain(quantstrings, mpstrings, mestrings, qualstrings)]
    sortedstrings = {}

    sub = open(subdir+docId + ".tsv", "w")

    for string in allstrings:
        annotSet = string.split("\t")[1]
        annotId = string.split("\t")[5]
        if annotSet not in sortedstrings:
            sortedstrings[annotSet] = {}
        sortedstrings[annotSet][annotId] = string   
    sub.write(header+"\n")
    for aset, val in sortedstrings.items():
        for aid, string in val.items():
            sub.write(string+"\n")
    sub.close()

## Results

Performance of the 2nd of these two models is currently our strongest baseline, achieveing the following scores on the evaluation data:

* Overall Score Exact Match: 0.21156036446469248 
* Overall Score F1 (Overlap): 0.23945662847323318 