### Coref mt5 Output Reformat

In this notebook, we get the results outputted by coref_mt5/main.py in the data/results folder, and process it to create another csv called coref_mt5_processed.csv, with a column called corefs. This column follows the format expected by the cr_eval.py script in evaluations/automatic, and so makes evaluation much simpler.

The desired format is like so:\
\[ coreference_chain, coreference_chain, ... ] where coreference_chain = \[mention_span, mention_span, ...] and mention_span = \[start_word_index, end_word_index]\
Such that the coreference chain for the sentence "PILOT LANDED ON WHAT HE THOUGHT TO BE ONE FOOT HIGH GRASS. IT TURNED OUT TO BE THREE FEET HIGH. ACFT NOSED OVER.":\
Which is: \[\["PILOT", "HE"], \["ONE FOOT HIGH GRASS", "IT"]]\
Appears as: [[0,0],[4,4],[[8,11],[13,13]]]\
The word indices are based on the word tokenization used in data/FAA_data/faa.json, which is used to create faa.jsonl in coref_mt5/data The word indices continue to increase throughout the whole doc/entry and do not reset at sentence starts

In [22]:
import pandas as pd
import re
import ast

In [56]:
result_df = pd.read_csv('../../data/results/coref_mt5/coref_mt5_intermediate.csv')
result_df.head()

Unnamed: 0,c5_id,input,prediction_strings,results
0,19750315005389A,"{'doc_key': 'faa/0_19750315005389A', 'sentence...",{0: ''},None [+ E]
1,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]'}",None [+ E]
2,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]', 2: 'None [+ E]'}",None [+ E]
3,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]', 2: 'None [+ E]'}",None [+ E]
4,19751029037799A,"{'doc_key': 'faa/2_19751029037799A', 'sentence...","{0: '', 1: 'None [+ E]'}",None [+ E]


In [57]:
# example coreference
result_df[result_df['c5_id'] == '19840107019539I']

Unnamed: 0,c5_id,input,prediction_strings,results
1696,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]'}",None [+ E]
1697,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",it ## . ** _ -> cabin door ## opened . | ;;
1698,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",his ## hand blew into -> pilot ## tried to clo...
1699,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",door ## not secured ** -> [1 ;;


In [62]:
def find_sublist(lst, sublst):
    """
    Find the start index of the first occurrence of the sublist in the list.

    Args:
    lst (list): The list to search within.
    sublst (list): The sublist to search for.

    Returns:
    int: The starting index of the first occurrence of the sublist in the list, or -1 if the sublist is not found.
    """
    n = len(sublst)
    for i in range(len(lst) - n + 1):
        if lst[i:i + n] == sublst:
            return i
    return -1

In [108]:
def get_coref_chains(c5_id, result_df):

    output = result_df[result_df['c5_id'] == c5_id]
    input = ast.literal_eval(list(output['input'].unique())[0]) # get dict form of input for c5_id
    sent_idx_to_doc_idx = {sent_no : {idx:idx+sum([len(input['sentences'][isent]) for isent in range(sent_no)]) for idx in range(len(input['sentences'][sent_no]))} for sent_no in input['sentences']}
    
    coref_chains = {}
    coref_chain_no = 1
    
    for isent in input['sentences'].keys():
        result = output['results'].iat[isent]
        if result != 'None [+ E]':
            pairs = result.split(';;')
            for pair in pairs:
                if len(pair) > 0:
                    a, b = pair.split('->')
            
                    # find span of a, which will be in the sentence
                    a_coref = a.split('##')[0].split()
                    a_context_right = a.split('##')[1].split('**')[0].split()
                    a_coref_start = find_sublist(input['sentences'][isent], a_coref+a_context_right)
                    a_coref_start = sent_idx_to_doc_idx[isent][a_coref_start] # translate to doc_idx
                    a_coref_end = a_coref_start + len(a_coref) - 1 # calc end idx from start
            
                    # find b
                    mo = re.match(r'\[([1-9])', b.strip()) # check if b is a reference to a previous group, i.e., '[1' means it is a ref to group 1
                    if mo:
                        ref_chain = ast.literal_eval(mo.groups()[0])
                        coref_chains[ref_chain].append([a_coref_start, a_coref_end]) # add coref a to previously existing chain
            
                    # else new coref chain
                    else:
                        
                        # parse b
                        b_coref = b.split('##')[0].split()
                        b_context_right = b.split('##')[1].split('|')[0].split()
                        for ichecksent in range(isent + 1): # check sentences before this one
                            b_coref_start = find_sublist(input['sentences'][ichecksent], b_coref+b_context_right)
                            if b_coref_start > -1:
                                b_coref_start = sent_idx_to_doc_idx[ichecksent][b_coref_start] # translate to doc_idx
                                continue
                        b_coref_end = b_coref_start + len(b_coref) - 1 # calc end idx from start
            
                        # Now save to coref_chains
                        coref_chains[coref_chain_no] = [[a_coref_start, a_coref_end], [b_coref_start, b_coref_end]]
                        coref_chain_no = coref_chain_no + 1
        
    return list(coref_chains.values())

In [114]:
coref_chains = []
for c5_id in result_df['c5_id'].unique():
    try:
        coref_chains.append(get_coref_chains(c5_id, result_df))
    except:
        coref_chains.append([])

In [116]:
# Save coref chains to csv

# get original data
original_df = pd.read_csv('../../data/FAA_data/Maintenance_Text_data_nona.csv')
processed_df = pd.DataFrame({'c5_id':list(original_df['c5'])[:len(coref_chains)], 'c119_input':list(original_df['c119'])[:len(coref_chains)], 'corefs':coref_chains})

In [118]:
processed_df

Unnamed: 0,c5_id,c119_input,corefs
0,19750315005389A,TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...,[]
1,19750419011349A,TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...,[]
2,19751029037799A,"2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...",[]
3,19751209037899A,PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....,[]
4,19750818025579A,TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...,[]
...,...,...,...
1155,19870328007669A,THE AIRCRAFT IMPACTED IN A PARKING LOT DURING ...,[]
1156,19870211003519A,BOMB THREAT CAUSED NON SCHEDULED LANDING AND E...,[]
1157,19870410009159A,THE AIRCRAFT GROUND LOOPED ON TAKEOFF ROLL. TH...,"[[[17, 17], [13, 15]]]"
1158,19870401031839A,THE PILOT LOST CONTROL OF THE AIRCRAFT AT ROTA...,[]
