# User annotation pre-processing script

This script can be used to manualy clean the user annotations for the AcX system.

### Preparation:
 - Use google-cloud-sdk to download the annotations in .json format
 - Add the .json files to single folder (I called it 'google_annotation/json_data/', but you can of course change the name and code
 - When it is the first time you clean the data, commend the following lines:
     - clean_annotations = pd.read_csv('./clean_data/clean_annotations_nl.csv')
     - already_clean_docs = set(clean_annotations['doc_id'])
     - !! ALSO, REMOVE "already_clean_docs" from dirty_docs = total_docs - already_clean_docs !!
     - undo all of this once you've cleaned up the first documents and have a clean_annotations_XX.csv file.

### Cleaning the documents:
* [Difference Finder](#Difference-Finder): The code in these cells checks the differences between the annotators of the same document. There are two common differences: 
     1. One annotator missed an acronym
     2. The annotators wrote the acronym or expansion in two different ways.
A document gets returned if one or both of these differences occurs. 

* [Manual cleaning](#Manual-cleaning)
Cleaning the documents is done using the pandas' framework. Please follow the following steps:
    1. Enter the document of intrest --> document = '....'
    2. Check the output of differences and add the row numbers in the cell below (row_num_1 & row_num_2)
    3. Use the next cell to change the values inside the data frame. You can change values by using the .iloc[.., ..] statement or add a row with the extra_row variable in combination with the append statement. 
    4. All the acronyms and expansions for the document of interest are printed below for one last check.
    5. [Saving changes](#Saving-changes) Finally, You can append and save the output to the clean_annotations_xx.csv. I would recommend doing this after every document. 

### Some other issues you might encounter 
- Mail issue: The document IDs were generated by splitting the names of the .json files based on a _ . However, some emails use an underscore, which will create improper document splits. This issue is solved by explicitly splitting on the name. Therefore, add all email addresses with an underscore to the "exception_mails" list.
- Documents with only one annotator: Some documents will not yet be annotated by two people. You can ignore these documents for now.

In [15]:
"""
Script for processing and cleaning the google drive annotations.
The output is a .csv file with the acronyms and expansions per document.
Date: 23-05-2022
"""

'\nScript for processing and cleaning the google drive annotations.\nThe output is a .csv file with the acronyms and expansions per document.\nDate: 23-05-2022\n'

In [1]:
import os
import pandas as pd
import re

## Loading the Annotations

In [2]:
rootdir = 'D:/University/Thesis/annotations/annotations'
df = pd.DataFrame(columns=['acronym', 'expansion', 'language', 'type'])

# Loading the clean annotations (Only you run this if you have clean annotations already)
clean_annotations = pd.read_csv('D:/University/thesis/annotations/Code/clean_data/clean_annotations_2.csv', encoding='utf-8')


In [3]:

def cleaning_raw_annotation(df, rootdir):
    # Creating the file directories
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            abs_path = os.path.join(subdir, file)
            individual_json = pd.read_json(abs_path, encoding="utf-8")

            # Extracting the annotators
            exception_mails = ['contact@iliyasgeorgiev.com']

            # filer for mails with an underscore
            for i in exception_mails:
                if bool(re.search(i, abs_path.split('/')[-1])):
                    annotators = i
                else:
                    annotators = abs_path.split('_')[-1]
                    annotators = annotators[:-5]

            # Extracting the document ID's
            doc_id = abs_path.split('\\')[-1]
            doc_id = doc_id.replace(annotators, '')[:-6]

            # Transforming the json  in a proper format
            individual_json = individual_json.transpose()
            individual_json.reset_index(inplace=True)
            individual_json = individual_json.rename(columns={'index':'acronym'})
            individual_json['doc_id'] = doc_id
            individual_json['annotator'] = annotators

            # adding everything together in a df
            df = pd.concat([df, individual_json], ignore_index=True)

            # fill the missing values in the language column
            df['language'].replace('', "other language", inplace=True)
            
            # remove additional white spaces
            df['acronym'] = df['acronym'].str.strip()
            df['expansion'] = df['expansion'].str.strip()

    return df



In [5]:
# Run this cell only at the beginning !!!!!!! All clean annotations will be lost if executed 
# annotation_df = cleaning_raw_annotation(df, rootdir)
annotation_df = clean_annotations

# annotation_df

In [6]:
# Corpus info
print('Number of annotators: {}'.format(len(set(annotation_df['annotator']))))
print('Number of documents: {}'.format(len(set(annotation_df['doc_id']))))
print('Number of clean documents: {}'.format(len(set(annotation_df['doc_id']))))

Number of annotators: 8
Number of documents: 102
Number of clean documents: 102


## Difference Finder <a class="anchor" id="Difference-Finder"></a>
The following code section looks for the differences between the annotated documents

In [24]:
#
total_docs =set(annotation_df['doc_id'])
already_clean_docs = set(clean_annotations['doc_id'])
print(len(already_clean_docs))
print('Number of documents that need to be cleaned: {}'.format(len(total_docs.difference(already_clean_docs))))
# dirty document info
dirty_docs = total_docs - already_clean_docs
print("Number of dirty documents:{}".format(len(dirty_docs)))


# Return all doc_id's with an uneven number of acroyms (which means...)
print("\nDOCUMENTS WITH AN UNEVEN NUMBER OF ACRONYMS:")
dirt_1 = []
for i in set(annotation_df['doc_id']):
    if i in dirty_docs:
        if len(annotation_df[annotation_df['doc_id'] == str(i)]) % 2 != 0:
            dirt_1.append(i)
            print("  -", i)
        
print("\nDOCUMENTS WITH DUPLICATE ROWS:")
dirt_2 = []
for i in set(annotation_df['doc_id']):
    sub_set_df = annotation_df[annotation_df['doc_id'] == i]
    if i in dirty_docs:
    # Delete duplicate rows
        if len(sub_set_df.drop_duplicates(subset=["acronym", "expansion"], keep=False)) != 0:
            dirt_2.append(i)
            print("  -", i)

del sub_set_df

102
Number of documents that need to be cleaned: 0
Number of dirty documents:0

DOCUMENTS WITH AN UNEVEN NUMBER OF ACRONYMS:

DOCUMENTS WITH DUPLICATE ROWS:


## Adding clean documents to the clean data

In [25]:
all_dirty =  dirt_2 + dirt_1
doc_subs = set.union(already_clean_docs, all_dirty)


all_docs = set(annotation_df['doc_id'])
good_docs = all_docs - doc_subs
print('Clean documents: \n{}'.format(good_docs))

clean_annotations = pd.DataFrame(columns=['acronym', 'expansion', 'language', 'type'])

# adding all clean documents together
clean_doc_all = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])

for i in good_docs:
    df_empty = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])
    sub_df = annotation_df[annotation_df['doc_id'] == i]
    clean_doc = pd.concat([df_empty, sub_df])
    clean_doc_all = pd.concat([clean_doc_all, clean_doc])
    
if set(clean_doc_all['doc_id']) == good_docs:
    print('\nAll good')
    

clean_annotations = pd.concat([clean_annotations, clean_doc_all])

# Save the changes in the main .csv file
# clean_annotations.to_csv('./clean_data/clean_annotations_nl.csv', index=False)

Clean documents: 
set()

All good


## Manual cleaning<a class="anchor" id="Manual-cleaning"></a>

In [9]:
print(annotation_df.iloc[18, 1])
print(annotation_df.iloc[1, 1])


Година
Celsius


In [405]:
# show the annotations that have issues
document = "Шампионска_лига_на_УЕФА"

sub_set_df = annotation_df[annotation_df['doc_id'] == document]
sub_set_df.drop_duplicates(subset=["acronym", "expansion"], keep=False).sort_values(by='acronym')

Unnamed: 0,acronym,expansion,language,type,doc_id,annotator


In [29]:
# Fill in the index to see of the values are the same
row_num_1 = 12
row_num_2 = 18

# Results
print("Are the acronyms the same:")
print(annotation_df.iloc[row_num_1, 0] == annotation_df.iloc[row_num_2, 0])
print("\nAre the expansion the same:")
print(annotation_df.iloc[row_num_1, 1] == annotation_df.iloc[row_num_2, 1])
print(annotation_df.iloc[row_num_1, 1])
print(annotation_df.iloc[row_num_2, 1])

Are the acronyms the same:
True

Are the expansion the same:
True
година
година


In [404]:
pd.set_option('display.max_rows', None)
# fill in the index of the cell you want to change
# annotation_df.iloc[1296, 1] = "Български революционен централен комитет"
# annotation_df.iloc[1290, 1] = "Multichannel Multipoint Distribution Service"
# annotation_df.iloc[1284, 1] = "Съединени Американски Щати"
# annotation_df.iloc[1293, 1] = "Телевизия"

# Use the code below if you need toadd a new row
# extra_row = {'acronym':'m²', 'expansion':'squared metre', 'language':'en', 'type':'out_expansion', 'doc_id':document, 'annotator':'contact@iliyasgeorgiev.com'}
extra_row = {'acronym':'SiO2', 'expansion':'Silicon dioxide', 'language':'en', 'type':'out_expansion', 'doc_id':document, 'annotator':'contact@iliyasgeorgiev.com'}
# extra_row = {'acronym':'IBM', 'expansion':'International Business Machines Corporation', 'language':'en', 'type':'in_expansion', 'doc_id':document, 'annotator':'contact@iliyasgeorgiev.com'}
# extra_row = {'acronym':'Хр.', 'expansion':'Христа', 'language':'bg', 'type':'out_expansion', 'doc_id':document, 'annotator':'contact@iliyasgeorgiev.com'}
# extra_row = {'acronym':'AG', 'expansion':'Aktiengesellschaft', 'language':'none_of_above', 'type':'out_expansion', 'doc_id':document, 'annotator':'contact@iliyasgeorgiev.com'}


# Show the results
# delete row
# annotation_df = annotation_df.drop([508])
# annotation_df = annotation_df.delete(508, ignore_index=True)
annotation_df = annotation_df.append(extra_row, ignore_index = True)         # <-- uncomment if you need to add a extra row
annotation_df.to_csv('./clean_data/clean_annotations_2.csv', index=False)
results = annotation_df[annotation_df['doc_id'] == document]  
results.sort_values(by='acronym')

Unnamed: 0,acronym,expansion,language,type,doc_id,annotator
1322,SiO2,Silicon dioxide,en,out_expansion,Чип,dsgeorgiev90@yahoo.com
1506,SiO2,Silicon dioxide,en,out_expansion,Чип,contact@iliyasgeorgiev.com
1323,напр.,например,bg,out_expansion,Чип,dsgeorgiev90@yahoo.com
1324,напр.,например,bg,out_expansion,Чип,georgitidorov4508@gmail.com


In [None]:
# results.drop(565, inplace=True)

## Saving the changes<a class="anchor" id=""></a>

In [22]:
# # Save the changes
df_empty = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])
clean_annotations_sub = pd.concat([df_empty, results])
clean_annotations = pd.concat([clean_annotations, clean_annotations_sub])

# Save the changes in the main .csv file
clean_annotations.to_csv('./clean_data/clean_annotations.csv', index=False)

In [408]:
set(clean_annotations['doc_id'].sort_values())


{'Conus_eugrammatus',
 'JavaScript',
 'UML',
 'Административен_акт',
 'Акционерно_дружество',
 'Американска_психологична_асоциация',
 'Амстердамски_университет',
 'БНТ_1',
 'БНТ_2',
 'Банкова_консолидационна_компания',
 'Баркод',
 'Берлин',
 'Би_Би_Си',
 'Би_Ти_Ви',
 'Бобслей',
 'Ботаника',
 'Българска_народна_банка',
 'Българска_социалистическа_партия',
 'Васил_Левски',
 'Водноелектрическа_централа',
 'Волейбол',
 'Временно_руско_управление',
 'Гъвкав_магнитен_диск',
 'Дисниленд',
 'Драйвер',
 'Дуги_оток',
 'Държавен_вестник',
 'Държавен_съвет_на_Народна_република_България',
 'Европейска_централна_банка',
 'Железният_човек',
 'Закон_за_трудовата_поземлена_собственост',
 'Златен_глобус',
 'Иван_Славков',
 'Изобразително_изкуство',
 'Интегрална_схема',
 'Интел',
 'Ирландия',
 'Капитан_Америка',
 'Княжество_България',
 'Конституционен_съд_на_България',
 'Крис_Евънс',
 'Леброн_Джеймс',
 'Лека_атлетика',
 'Лукойл_Нефтохим_Бургас',
 'Майкъл_Фелпс',
 'Марк_Ръфало',
 'Международна_автомобилна

In [None]:
# pd.set_option('display.max_rows', None)
# clean_annotations[clean_annotations['doc_id'] =='Waterschapsverkiezingen'].sort_values(by='acronym')

In [407]:
clean_annotations.drop_duplicates(keep='last').to_csv('./clean_data/clean_annotations_without_duplicates.csv', index=False)	

In [430]:
bulgarian_dataset = {}
for annotation in clean_annotations.iterrows():
    if bulgarian_dataset.get(annotation[1]['doc_id']) == None:
        if annotation[1]['type'] == 'in_expansion':
            bulgarian_dataset[annotation[1]['doc_id']] = {}
            if bulgarian_dataset[annotation[1]['doc_id']].get('in_expansion') == None:
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'] = {}
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
            else:
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
        else:
            bulgarian_dataset[annotation[1]['doc_id']] = {}
            if bulgarian_dataset[annotation[1]['doc_id']].get('out_expansion') == None:
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'] = {}
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
            else:
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
    else:
        if annotation[1]['type'] == 'in_expansion':
            if bulgarian_dataset[annotation[1]['doc_id']].get('in_expansion') == None:
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'] = {}
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
            else:
                bulgarian_dataset[annotation[1]['doc_id']]['in_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
        else:
            if bulgarian_dataset[annotation[1]['doc_id']].get('out_expansion') == None:
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'] = {}
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']
            else:
                bulgarian_dataset[annotation[1]['doc_id']]['out_expansion'][annotation[1]['expansion']] = annotation[1]['acronym']


print (bulgarian_dataset)

{'Conus_eugrammatus': {'out_expansion': {'meter': 'm', 'Celsius': '°C', 'Китайската народна република': 'КНР', 'Съединени Американски Щати': 'САЩ'}}, 'JavaScript': {'out_expansion': {'ECMAScript 6': 'ES6', 'HyperText Markup Language': 'HTML', 'JavaScript Object Notation': 'JSON', 'Съединени американски щати': 'САЩ', 'година': 'г.', 'тоест': 'т.е.', 'Microsoft Visual Basic Scripting Edition': 'VBScript'}}, 'UML': {'in_expansion': {'Object Management Group': 'OMG'}, 'out_expansion': {'Object-modeling technique': 'OMT', 'Object-Oriented Software Engineering': 'OOSE', 'Unified Modeling Language': 'UML', 'Информационните технологии': 'ИТ', 'International Business Machines Corporation': 'IBM', 'Съединени американски щати': 'САЩ'}}, 'Административен_акт': {'in_expansion': {'Административнопроцесуалния кодекс': 'АПК'}, 'out_expansion': {'Държавен вестник': 'ДВ', 'алинея': 'ал.', 'брой': 'бр.', 'тоест': 'т.е.', 'член': 'чл.'}}, 'Акционерно_дружество': {'in_expansion': {'Акционерно дружество': '