## ANNOTATION TOOL

### 1. Settings

Please change the settings to the needed ones.

In [1]:
annotations_path = "C:\\PythonScripting\\test\\NLP_ANNOTATION_TOOL"
output_path = "C:\\PythonScripting\\test\\NLP_ANNOTATION_TOOL\\output"
annotator = "David"
all_kpi = [0, 1, 2]
# kpi_of_interest = all_kpi
kpi_of_interest = [1, 2]

In [2]:
print(f"All KPI's are " + ", ".join([str(x) for x in all_kpi]) + ".")
print(f"KPI's of interest are " + ", ".join([str(x) for x in kpi_of_interest]) + ".")

All KPI's are 0, 1, 2.
KPI's of interest are 1, 2.


### 2. Preloading

loading packages and the existing annotations (just execute)

In [3]:
import pandas as pd
import os
import glob
import numpy as np

df_annotations = pd.read_excel(annotations_path + "\\annotations.xlsx")
df_out = df_annotations.copy()

### 3. Annotations overview

The following gives an overview of the existing predictions and which file needs a further investigation (just execute)

In [4]:
outputs = glob.glob(output_path + "\\*")
outputs = [x.rsplit("\\", 1)[1] for x in outputs]
if len(outputs) > 1:
    print(f"There are {len(outputs)} output files in the output folder:" + "\n" + "\n".join([str(x) for x in outputs]))
elif len(outputs) == 1:
    print(f"There is one output file in the output folder.")
else:
    print(f"There are no new files in the output folder.")


for output in outputs:
    print("\n Next file considered: " + output)
    df_output = pd.read_csv(output_path + "\\" + output)
    pdf_name = df_output["pdf_name"].values[0]
    df_annotations_temp = df_annotations[df_annotations["source_file"] == pdf_name]
    kpis_contained = [x for x in df_annotations_temp["kpi_id"].values if x in kpi_of_interest]
    if len(kpis_contained) > 1:
        print(
            'For pdf with name "'
            + pdf_name
            + '" in file '
            + output
            + " we have already annotations for the kpi's "
            + ", ".join([str(x) for x in kpis_contained])
            + "."
        )
    elif len(kpis_contained) == 1:
        print(
            'For pdf with name "'
            + pdf_name
            + "\ in file "
            + output
            + " we have already an annotation for the kpi "
            + ",".join([str(x) for x in kpis_contained])
            + "."
        )
    else:
        print('For pdf with name "' + pdf_name + "\" we have no annotations yet for the kpi's under investigation.")
    if kpis_contained == kpi_of_interest:
        print("DONE: All kpi's of interest have been annotated for this file.")
    else:
        print("TODO: There are open annotations for that file.")

There are 2 output files in the output folder:
test2_predictions_kpi.csv
test_predictions_kpi.csv

 Next file considered: test2_predictions_kpi.csv
For pdf with name "test2.pdf" in file test2_predictions_kpi.csv we have already annotations for the kpi's 1, 2.
DONE: All kpi's of interest have been annotated for this file.

 Next file considered: test_predictions_kpi.csv
For pdf with name "test.pdf" in file test_predictions_kpi.csv we have already annotations for the kpi's 1, 2.
DONE: All kpi's of interest have been annotated for this file.


### 4. Check new annotations

Please set the file you want to investigate.

In [5]:
output_file = "test2_predictions_kpi.csv"

List of open tasks (just execute)

In [6]:
df_output = pd.read_csv(output_path + "\\" + output_file)
pdf_name = df_output["pdf_name"].values[0]
df_annotations_temp = df_annotations[df_annotations["source_file"] == pdf_name]
kpis_contained = [x for x in df_annotations_temp["kpi_id"].values if x in kpi_of_interest]
open_kpis = [x for x in kpi_of_interest if x not in kpis_contained]
if len(open_kpis) > 1:
    print("The open kpi's are " + ", ".join([str(x) for x in open_kpis]) + ".")
elif len(open_kpis) == 1:
    print("The open kpi is " + ", ".join([str(x) for x in open_kpis]) + ".")
else:
    print("There are no open kpi's.")

There are no open kpi's.


### 4.1 Detailed investigation
Please set the kpi you want to investigate.

In [7]:
kpi_to_investigate = 2

Get the outcome of the machine (just execute)

In [8]:
# Check specific KPIs
df_output = pd.read_csv(output_path + "\\" + output_file)
df_output_check = df_output[[x == kpi_to_investigate for x in df_output["kpi_id"]]]
df_output_check.head(4)

Unnamed: 0.1,Unnamed: 0,pdf_name,kpi,kpi_id,answer,page,paragraph,source,score,no_ans_score,no_answer_score_plus_boost
8,9,test2.pdf,KPI 2,2,test2,[0],[test],Text,0,0,0
9,10,test2.pdf,KPI 2,2,test2,[0],[test],Text,1,1,1
10,11,test2.pdf,KPI 2,2,test2,[0],[test],Text,2,2,2
11,12,test2.pdf,KPI 2,2,test2,[0],[test],Text,3,3,3


### 4.2 Set answer

Define the id where one can find the correct paragraph and/or the answer. In case an optimal paragraph and/or answer does not exist, please specify it in the variables "correct_*".

In [9]:
id_correct_paragraph = 9
id_correct_answer = 10

# Only if paragraph is not contained
correct_paragraph = "David wears a red shirt."
correct_paragraph_page = 24
correct_paragraph_source = "TEXT"  # Either "TEXT" or "TABLE"

# Only if answer is not contained
correct_answer = "red"

company = "DUMMY"
year = 2022
sector = "OG"

### 4.3 Generate annotation outcome

After having set the correct annotations we can generate a new entry for the annotations file (just execute).

In [11]:
df_temp = df_annotations.head(0)
if id_correct_paragraph == -1:
    paragraph = "[" + correct_paragraph + "]"
    source_page = "[" + str(correct_paragraph_page) + "]"
    source = correct_paragraph_source
else:
    paragraph = df_output_check.loc[id_correct_paragraph, "paragraph"]
    source_page = df_output_check.loc[id_correct_paragraph, "page"]
    source = df_output_check.loc[id_correct_paragraph, "source"]

if id_correct_answer == -1:
    answer = correct_answer
else:
    answer = df_output_check.loc[id_correct_paragraph, "answer"]


new_data = [
    np.max(df_out["number"].values) + 1,
    company,
    df_output_check["pdf_name"].values[0],
    source_page,
    kpi_to_investigate,
    year,
    answer,
    source,
    paragraph,
    annotator,
    sector,
    "",
]
df_series = pd.Series(new_data, index=df_temp.columns)
df_temp = df_temp.append(df_series, ignore_index=True)
df_temp = df_temp.set_index([pd.Index([np.max(df_out.index) + 1])])
df_out = df_out.append(df_temp)
df_out.tail(4)

Unnamed: 0,number,company,source_file,source_page,kpi_id,year,answer,data_type,relevant_paragraphs,annotator,sector,issue
2,2,TEST_INC,test.pdf,[0],2,2020,test,TEXT,[test],Tester,OG,
3,3,TEST_INC_2,test2.pdf,[0],0,2020,test2,TEXT,[test],Tester,OG,
4,4,DUMMY,test2.pdf,[0],1,2022,te,Text,[test],David,OG,
5,5,DUMMY,test2.pdf,[0],2,2022,test2,Text,[test],David,OG,


Check if there are still open kpi's. If yes start again at point 4.1. (just execute)

In [12]:
df_out_temp = df_out[df_out["source_file"] == pdf_name]
kpis_contained = [x for x in df_out_temp["kpi_id"].values if x in kpi_of_interest]
open_kpis = [x for x in kpi_of_interest if x not in kpis_contained]
if len(open_kpis) > 1:
    print("The open kpi's are " + ", ".join([str(x) for x in open_kpis]) + ".")
elif len(open_kpis) == 1:
    print("The open kpi is " + ", ".join([str(x) for x in open_kpis]) + ".")
else:
    print("There are no open kpi's.")

There are no open kpi's.


### 4.4 Export outcome

In [15]:
df_out.to_excel(annotations_path + "\\annotations.xlsx", index=False)

Note: After having exported the new annotations please start with the jupyter notebook from the beginning if you want to check the annotations of another file.

## Notes:

* In output file year, company and sector are missing