### Reformat ReFinED

This notebook take the output from refined_faa.py and processes it into a form easily acceptable to evaluation scripts. The output is stored in tool_results/refined

In [52]:
import pandas as pd
import re

In [53]:
result_df = pd.read_csv('aida_wikipedia.csv')
result_df.head()

Unnamed: 0,c5,c119,c119_entity_linking
0,19750315005389A,TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...,[]
1,19750419011349A,TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...,[]
2,19751029037799A,"2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...","[['ACFT', Entity(wikidata_entity_id=Q67935434,..."
3,19751209037899A,PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....,[]
4,19750818025579A,TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...,"[['XI', Entity not linked to a knowledge base,..."


In [54]:
result_df['c119_entity_linking'].iat[0] # example

'[]'

"[['TAILWHEEL', Entity(wikidata_entity_id=Q2874355, wikipedia_entity_title=Conventional landing gear), None], ['TKOF', Entity(wikidata_entity_id=Q7690028, wikipedia_entity_title=Taylor knock-out factor), None]]"

**Extract entities and links from c119_entity_linking**

In [55]:
out_dict = {'c5_id':[],'c119_input':[],'c119_entity_linking':[], 'mentions':[],'labels':[],'entities':[],'qids':[]}
values_p = re.compile("\[?\['([^']+)', (Entity not linked to a knowledge base|Entity\([^\)]+\)), (None|[A-Z]+)\],? ?(.*)") # returns groups ent, linked_ent, label, rest
id_title_p = re.compile('Entity\(wikidata_entity_id=(Q[0-9]+)(, wikipedia_entity_title=)?([^\)]+)?\)') # returns Qid, Wikipedia title

for i in range(len(result_df)):
    
    text = result_df['c119_entity_linking'].iat[i]
    while text:
        
        mo = re.match(values_p, text)
    
        if mo:
            ent, linked_ent, label, text = mo.groups()
    
            # Put empty values where there is no data
            # Extract QID and title from linked_ent
            if linked_ent == "Entity not linked to a knowledge base":
                id = ""
                title = ""
            else:
                id_title = re.match(id_title_p, linked_ent).groups()
                if len(id_title) == 1:
                    id_title = [id_title[0], "", ""]
                id = id_title[0]
                title = id_title[2]
            if label == "None":
                label = ""
    
            out_dict['c5_id'].append(result_df['c5'].iat[i])
            out_dict['c119_input'].append(result_df['c119'].iat[i])
            out_dict['c119_entity_linking'].append(result_df['c119_entity_linking'].iat[i])
            out_dict['mentions'].append(ent)
            out_dict['labels'].append(label)
            out_dict['entities'].append(title)
            out_dict['qids'].append(id)
    
        else:
            text = None

In [56]:
out_df = pd.DataFrame(out_dict)
out_df.head()

Unnamed: 0,c5_id,c119_input,c119_entity_linking,mentions,labels,entities,qids
0,19770912040629A,APRX 1745LBS OVR MAX GWT.ENG NOT FEATHERED.AUT...,"[['APRX 1745LBS', Entity not linked to a knowl...",APRX 1745LBS,,,
1,19780108002219I,FORCED LANDING AFTER ENGINE QUIT. FOUND FROZEN...,"[['UEL', Entity not linked to a knowledge base...",UEL,,,
2,19780221000179I,PILOT MADE EMERGENCY LANDING DUE TO LOW OIL PR...,"[['PILOT', Entity not linked to a knowledge ba...",PILOT,,,
3,19780327010619I,PILOT LOST FUEL IN FLIGHT DUE TO FUEL CAPS ON ...,"[['FUEL', Entity not linked to a knowledge bas...",FUEL,,,
4,19780325010349I,ENGINE STOPPED ON FINAL APPROACH DUE TO WATER ...,"[['FUEL', Entity not linked to a knowledge bas...",FUEL,,,


**Save to output DataFrame**

In [57]:
out_df.to_csv('../../tool_results/refined/refined_aida_wikipedia.csv')