### Coreference Reformat

**** This notebook has a bug (likely in extract_chains) that causes errors, signified by negative numbers, in the output data. However, none of those errors occur in the data sampled for evaluation, so it is used for now*****

In this notebook, we get the results outputted by coref_faa.py in the data/results folder, and add a column called corefs. This column follows the format expected by the cr_eval.py script in evaluations/automatic, and so makes evaluation much simpler.

The desired format is like so:\
\[ coreference_chain, coreference_chain, ... ] where coreference_chain = \[mention_span, mention_span, ...] and mention_span = \[start_word_index, end_word_index]\
Such that the coreference chain for the sentence "PILOT LANDED ON WHAT HE THOUGHT TO BE ONE FOOT HIGH GRASS. IT TURNED OUT TO BE THREE FEET HIGH. ACFT NOSED OVER.":\
Which is: \[\["PILOT", "HE"], \["ONE FOOT HIGH GRASS", "IT"]]\
Appears as: [[0,0],[4,4],[[8,11],[13,13]]]\
The word indices are based on the word tokenization used in data/FAA_data/faa.conll, which is the input data for ASP and s2e-coref. The word indices continue to increase throughout the whole doc/entry and do not reset at sentence starts

In [1]:
import pandas as pd
import re

In [2]:
result_df = pd.read_csv('../../data/results/ncoref/FAA_DataModel_20240103111938.csv')
result_df.head()

Unnamed: 0,c5,c119,c119_coref,c119_ner,c119_pos,c119_lemmatized,c119_dependency,c119_sentiment
0,19750315005389A,TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...,[],"[('TAILWHEEL COCKED', 'ORG')]","[('TAILWHEEL', 'PROPN'), ('COCKED', 'PROPN'), ...","['TAILWHEEL', 'COCKED', 'right', 'prior', 'to'...","[('TAILWHEEL', 'compound', 'COCKED'), ('COCKED...",0.142857
1,19750419011349A,TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...,[],"[('SETTLED.STUDENT THOUGHT TOW IN', 'FAC')]","[('TOW', 'NOUN'), ('PLANE', 'NOUN'), ('BECAME'...","['tow', 'plane', 'become', 'airborne', 'then',...","[('TOW', 'compound', 'PLANE'), ('PLANE', 'nsub...",-0.2
2,19751029037799A,"2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...",[],"[('2ND ILS APCH', 'ORG'), (""ACFT'S"", 'ORG'), (...","[('2ND', 'PUNCT'), ('ILS', 'PROPN'), ('APCH', ...","['2ND', 'ILS', 'APCH', ',', 'ACFT', ""'s"", 'g',...","[('2ND', 'nummod', 'APCH'), ('ILS', 'compound'...",-0.25
3,19751209037899A,PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....,[],"[('RTND SPRINGFIELD', 'ORG')]","[('PLT', 'PROPN'), ('NOTED', 'ADV'), ('SOFT', ...","['PLT', 'noted', 'soft', 'R', 'brake', 'PEDAL'...","[('PLT', 'nsubj', 'NOTED'), ('NOTED', 'ROOT', ...",0.127083
4,19750818025579A,TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...,[],"[('TAXI OFF', 'ORG')]","[('TAXI', 'VERB'), ('OFF', 'ADP'), ('HARD', 'A...","['taxi', 'off', 'hard', 'SFC', 'due', 'TFC', '...","[('TAXI', 'ROOT', 'TAXI'), ('OFF', 'prt', 'TAX...",-0.083333


**Extract lists of coreferences from the string representation in column c119_coref**

In [26]:
chain_p = re.compile(r'\[([^\[\]]+: )?(\[[^\[\]]+\])?(, [^\[\]]+: )?(\[[^\[\]]+\])?(, [^\[\]]+: )?(\[[^\[\]]+\])?(, [^\[\]]+: )?(\[[^\[\]]+\])?\]') # can capture up to 4 coreference chains
coref_p = re.compile(r'\[(.+)(, )(.+)(, )?(.+)?(, )?(.+)?(, )?(.+)?(, )?(.+)?\]')

def extract_chains(data_in):
        
    # Match coreference chains in c119_coref output
    coref_chains = []
    for match_group in re.match(chain_p, data_in).groups():
        if match_group is not None and '[' in match_group:

            # Match coreference mentions in coreference chain and store as list
            coref_chain = [group for group in re.match(coref_p, match_group).groups() if group is not None and group != ', ']
            coref_chains.append(coref_chain)

    return coref_chains

In [30]:
c119_coref_chains = []
for i in range(len(result_df)):
    coref_chains = extract_chains(result_df['c119_coref'].iat[i])
    c119_coref_chains.append(coref_chains)
c119_coref_chains[2320:2330] # sample output

[[['ACFT', 'THE ACFT']], [], [], [], [], [], [], [], [], []]

**Get FAA data in format {c5_id:{0: word0, 1: word1, ..., n: wordn}} using word tokenization from faa.conll**

In [31]:
faa = {}

with open('../../data/FAA_data/faa.conll') as f:
    text = f.read()

docs = text.split('#begin document ')

for doc in docs:
    if doc[:5] == '(faa/':
        word_count = 0
        c5_id = doc.split('_')[1][:15]
        faa[c5_id] = {}
        lines = doc.split('\n')
        for line in lines[1:]:
            if 'faa' in line:
                faa[c5_id][word_count] = line.split()[3].upper()
                word_count = word_count + 1

**Get word indices for c119_coref col in result_df**

In [32]:
def get_spans(coref_chain, words):
    ''' Input: ['MENTION1','MENTION2',...]
        Output: [[startidx_mention1, end_idxmention1], [startidx_mention2, end_idxmention2], ...]
    '''

    chain_spans = []

    resume_idx = 0
    
    for mention in coref_chain:

        mention_span = [-1, -1]
        
        if mention in ' '.join(words.values()):
            # find start of mention
            for iword, word in words.items():
                if iword >= resume_idx and mention.split()[0] == word:
                    mention_span[0] = iword

                    if words[iword + len(mention.split()) - 1] == mention.split()[-1]:
                        mention_span[1] = iword + len(mention.split()) - 1
                        resume_idx = iword + len(mention.split()) # set 'resume_idx' such that it continues for looking for mentions in the coref chain after this one, so it cannot be counted twice
                        break
                    else:
                        mention_span[0] = -2
                    # else reset and continue
        
        chain_spans.append(mention_span)

    return chain_spans

In [36]:
# Iterate through c119_coref_chains

formatted = []

for i in range(len(result_df)):

    coref_chains = c119_coref_chains[i]

    output_chains = []
    for coref_chain in coref_chains:
        
            # Get spans of words of coref chain using get_spans
            chain_spans = get_spans(coref_chain, faa[result_df['c5'][i]])

            output_chains.append(chain_spans)

    formatted.append(output_chains)

formatted[2320:2330] # sample output

[[[[4, 4], [16, 17]]], [], [], [], [], [], [], [], [], []]

**Check for errors**

In [67]:
errs = []
for iresult, result in enumerate(formatted):
    for chain in result:
        if [-1,-1] in chain: # should also look for [-2, *]
            errs.append(iresult)

In [70]:
sample = pd.read_csv('../../data/sampling/FAA_sample_100.csv')
errs_to_check = []
for err in errs:
    if err in sample['Unnamed: 0']:
        errs_to_check.append(err)

In [72]:
errs_to_check # No errors occur within the data sampled for evaluation

[]

**Add formatted column to df**

In [73]:
result_df['corefs'] = formatted

In [75]:
result_df[result_df['c5'] == '19990213001379A'] # sample output

Unnamed: 0,c5,c119,c119_coref,c119_ner,c119_pos,c119_lemmatized,c119_dependency,c119_sentiment,corefs
2318,19990213001379A,ACFT WAS TAXIING FOR TAKE OFF WHEN IT LOST CON...,"[ACFT: [ACFT, IT]]","[('ACFT', 'ORG'), ('DITCH', 'PERSON'), ('AE', ...","[('ACFT', 'PROPN'), ('WAS', 'VERB'), ('TAXIING...","['ACFT', 'be', 'taxi', 'for', 'take', 'off', '...","[('ACFT', 'nsubj', 'TAXIING'), ('WAS', 'aux', ...",-0.125,"[[[0, 0], [7, 7]]]"


In [77]:
result_df.to_csv('../../data/results/ncoref/crosslingual_coref_with_errors.csv', index=False)