# Preprocessing Files
Different sources and tools may make use of different formats to represent information and the output of various tools may not directly correspond. In this course, we will mainly (or even exclusively) work with the conll format. Even within this format, there may be differences in tokenization, class labels used or in the number of columns provided in the output. Depending on what the difference is exactly, you may want to adapt input files or build scripts that can deal with such differences during the process. In this case, we are preparing files that present output of two different tools for evaluation, where the exact annotation scheme differs. We set this up so you can first convert the files, so that they match and then can run evaluation (covered in a different notebook). Originally, both systems had a different tokenization and they both differed from the tokenization used in training and evaluation data. The steps of making sure that the tokens align have already been taken. We left some of the basic functions used as part of this process (e.g. the verification whether tokens align) as an example.

In [1]:
import csv
from typing import List, Dict

In [2]:
def matching_tokens(conll1: List, conll2: List) -> bool:
    '''
    Check whether the tokens of two conll files are aligned
    
    :param conll1: tokens (or full annotations) from the first conll file
    :param conll2: tokens (or full annotations) from the first conll file
    
    :returns boolean indicating whether tokens match or not
    '''
    for i, row in enumerate(conll1):
        row2 = conll2[i]
        if row[0] != row2[0]:
            return False
    
    return True

In [3]:
def read_in_conll_file(conll_file: str, delimiter: str = '\t'):
    '''
    Read in conll file and return structured object
    
    :param conll_file: path to conll_file
    :param delimiter: specifies how columns are separated. Tabs are standard in conll
    
    :returns: List of splitted rows included in conll file
    '''
    conll_rows = []
    with open(conll_file, 'r') as my_conll:
        for line in my_conll:
            row = line.strip("\n").split(delimiter)
            if len(row) == 1:
                conll_rows.append([""]*rowlen)
            else:
                rowlen = len(row)
                conll_rows.append(row)
    return conll_rows

In [4]:
def alignment_okay(conll1: str, conll2: str) -> bool:
    '''
    Read in two conll files and see if their tokens align
    '''
    my_first_conll = read_in_conll_file(conll1)
    my_second_conll = read_in_conll_file(conll2)
    
    return matching_tokens(my_first_conll, my_second_conll)

In [5]:
def get_predefined_conversions(conversion_file: str) -> Dict:
    '''
    Read in file with predefined conversions and return structured object that maps old annotation to new annotation
    
    :param conversion_file: path to conversion file
    
    :returns object that maps old annotations to new ones
    '''
    conversion_dict = {}
    my_conversions = open(conversion_file, 'r')
    conversion_reader = csv.reader(my_conversions, delimiter='\t')
    for row in conversion_reader:
        conversion_dict[row[0]] = row[1]
    return conversion_dict

In [6]:
def create_converted_output(conll_rows: List, annotation_identifier: int, conversions: Dict, outputfilename: str, IOB_col : int, delimiter: str = '\t'):
    '''
    Check which annotations need to be converted for the output to match and convert them
    Saves converted data in the outputfile
    
    :param conll_rows: rows with conll annotations
    :param annotation_identifier: indicator of how to find the annotations in the object (index)
    :param conversions: pointer to the conversions that apply. This can be external (e.g. a local file with conversions) or internal (e.g. prestructured dictionary). In case of an internal object, you probably want to add a function that creates this from a local file.
    :param ourputfilename: path to outputfile
    :param IOB_col : integer indicating the index of the column in which IOB labels are located
    '''
    with open(outputfilename, 'w') as outputfile:
        for n, row in enumerate(conll_rows):
            #add column headers to the very first line
            if n == 0:
                outputfile.write(delimiter.join([str(column_index) for column_index in range(len(row))])+"\n")
            annotation = row[annotation_identifier]
            if annotation in conversions:
                #get new annotation label
                new_annotation = conversions.get(annotation)
                #if token is part of a named entity, extract the prefic
                if new_annotation != 'O':
                    #if the prefix is defined, extract it, otherwise use 'B' indicating the beginning of a named entity
                    IOB = row[IOB_col] if IOB_col != None else 'B'
                    new_annotation = f"{IOB}-{new_annotation}"
                #assign the new annotation
                row[annotation_identifier] = new_annotation
            if row[0] == "":
                outputfile.write("\n")
            else:
                outputfile.write(delimiter.join(row)+"\n")

In [7]:
def preprocess_files(conll1: str, conll2: str, column_identifiers: List, conversions: Dict, IOB_col : int):
    '''
    Guides the full process of preprocessing files and outputs the modified files.
    
    :param conll1: path to the first conll input file
    :param conll2: path to the second conll input file
    :param column_identifiers: object providing the identifiers for target column
    :param conversions: path to a file that defines conversions
    :param IOB_col : integer indicating the index of the column in which IOB labels are located
    '''
    if alignment_okay(conll1, conll2):
        conversions = get_predefined_conversions(conversions)
        my_first_conll = read_in_conll_file(conll1)
        my_second_conll = read_in_conll_file(conll2)
        create_converted_output(my_first_conll, column_identifiers[0], conversions, conll1.replace('.conll','-preprocessed.conll'), IOB_col)
        create_converted_output(my_second_conll, column_identifiers[1], conversions, conll2.replace('.conll','-preprocessed.conll'), IOB_col)
    else:
        print(conll1, conll2, 'do not align')

In [8]:
preprocess_files('../data/spacy_out.dev.conll','../data/conll2003.dev.conll', [2,3],'./settings/conversions.tsv', IOB_col=1)
preprocess_files('../data/stanford_out.dev.conll','../data/conll2003.dev.conll', [3,3],'./settings/conversions.tsv', IOB_col= None)