# Interannotator Agreement Study

### **1. Identify an appropriate interannotator agreement measure based on the nature of your annotation, and justify your choice.**

We are categorizing the reading materials for each course listed in the MIT OpenCourseware website as “required reading” or “optional reading”. In our case, where the annotations are binary, Scott’s π interannotator agreement measure is a suitable choice. Scott’s π is a widely used interannotator agreement measure that takes into account the possibility of chance agreement between annotators, and provides a corrected measure of agreement. Furthermore, it is particularly useful when the datasets are imbalanced, as in the case of our annotations where one category ("required reading") is potentially more prevalent than the other ("optional reading"). By using Scott’s π, we can obtain a reliable and precise measure of agreement between annotators, which is essential for ensuring the validity and consistency of our annotations.

### **2. Calculate interannotator agreement. Include a link to any code you used to calculate interannotator agreement.**

In [1]:
import pandas as pd

In [2]:
data_min=pd.read_csv('../data/dataframes/annotations_by_min.tsv',sep='\t')
data_biya=pd.read_csv('../data/dataframes/annotations_by_biya.tsv',sep='\t')
data_jessie=pd.read_csv('../data/dataframes/annotations_by_jessie.tsv',sep='\t')
data_jinhong=pd.read_csv('../data/dataframes/annotations_by_jinhong.tsv',sep='\t')
data_jae=pd.read_csv('../data/dataframes/annotations_by_jae.tsv',sep='\t')

In [3]:
data_min

Unnamed: 0,category,author,title,type,collection,year,course
0,Optional,"[{:family=>""Reynolds"", :given=>""T.D.""}, {:fami...",Unit Operations and Processes in Environmental...,,,1996,58
1,Optional,"[{:family=>""Mara"", :given=>""D.""}]",Domestic Wastewater Treatment in Developing Co...,,,2003,58
2,Optional,"[{:family=>""Viessman"", :given=>""W."", :suffix=>...",Water Supply and Pollution Control,,,2005,58
3,Optional,"[{:family=>""Tchobanoglous"", :given=>""G.""}, {:f...",Wastewater Engineering: Treatment and Reuse,,,2003,58
4,Optional,"[{:family=>""Staff"", :given=>""M.W.H.""}]",Water Treatment: Principles and Design,,,2005,58
...,...,...,...,...,...,...,...
1710,Required,"[{:family=>""Spielman"", :given=>""Daniel""}]",Chapter 16: Spectral Graph Theory,book,,2007,7
1711,Optional,"[{:given=>""Battiston""}, {:others=>true}]",DebtRank: Too Central to Fail? Financial Networks,,,2012,7
1712,Optional,"[{:given=>""Akbarpour""}, {:others=>true}]",Just a Few Seeds More: Value of Network Inform...,,,2018,7
1713,Optional,"[{:given=>""Shah""}, {:given=>""Zaman""}]",Rumors in a Network: Who’s the Culprit?,,,2011,7


In [17]:
def get_agreement(df):
    """Return the agreement score of the dataframe

    Parameters
    ----------
    df : pandas dataframe
        the dataframe for agreement score calculation

    Returns
    -------
    double
        the agreement score of the dataframe
    """
    
    total = df.sum().sum()

    A_0 = (df.loc['Required']['Required'] + df.loc['Optional']['Optional']) / total
    A_e = 0

    col_num = df.shape[0]

    for i in range(col_num):
        A_e += ((df.iloc[i,:].sum() + df.iloc[:,i].sum()) / (total*2))**2
    agreement = (A_0 - A_e)/ (1-A_e)
    
    return agreement

In [5]:
def get_dataframe_for_calculation(df_1, df_2):
    """Return the dataframe for agreement calculation

    Parameters
    ----------
    df_1 : pandas dataframe of annotator 1
        the dataframe for produced by annotator 1
    df_2 : pandas dataframe of annotator 2
        the dataframe for produced by annotator 2

    Returns
    -------
    pandas dataframe
        the dataframe for agreement score calculation
    """
    
    # inner join
    merged_df = df_1.merge(df_2, on=['title', 'course', 'author', 'type', 'year', 'collection'])
    merged_df["X_Y"] = tuple(zip(merged_df["category_x"], merged_df["category_y"]))
    
    # initialize dataframe
    df_t = pd.DataFrame({'Required': [0, 0, 0],
                         'Optional': [0, 0, 0],
                         'UNK': [0, 0, 0]},
                         index=['Required', 'Optional', 'UNK'])
    
    # count RR/RO/OR/OO
    dic = merged_df["X_Y"].value_counts().to_dict()
    for key, val in dic.items():
        df_t.loc[key[0]][key[1]] = val
    
    # left join for UNK
    left_df = df_1.merge(df_2, on=['title', 'author', 'type', 'year', 'collection'], how='left')
    left_df = df_1.merge(df_2.drop_duplicates(), on=['title', 'author', 'type', 'year', 'collection'],how='left', indicator=True)
    left_df = left_df[left_df['_merge'] == 'left_only']
    
    dic_left = left_df["category_x"].value_counts().to_dict()
    for key, val in dic_left.items():
        df_t.loc[key]['UNK'] = val
    
    # right join for UNK
    right_df = df_1.merge(df_2.drop_duplicates(), on=['title', 'author', 'type', 'year', 'collection'],how='right', indicator=True)
    right_df = right_df[right_df['_merge'] == 'right_only']
    
    dic_right = right_df["category_y"].value_counts().to_dict()
    for key, val in dic_right.items():
        df_t.loc['UNK'][key] = val
        
    return df_t

In [6]:
get_dataframe_for_calculation(data_jae, data_biya)

Unnamed: 0,Required,Optional,UNK
Required,1774,97,257
Optional,6,829,92
UNK,252,153,0


In [7]:
get_agreement(get_dataframe_for_calculation(data_jae, data_biya))

0.5436060176122685

In [8]:
get_dataframe_for_calculation(data_jinhong, data_min)

Unnamed: 0,Required,Optional,UNK
Required,528,19,28
Optional,12,615,5
UNK,322,228,0


In [9]:
get_agreement(get_dataframe_for_calculation(data_jinhong, data_min))

0.44040682105304646

### **3. Discuss whether you think your annotation is reliable or not. If your annotator agreement is low, please talk about why it is low, and what you might do to improve it.**

- Interannotator agreement scores typically range from -1 to 1, where a score of -1 indicates complete disagreement, 0 indicates no agreement beyond chance, and 1 indicates perfect agreement. 

- Biya-Jae refers to Bingyang Hou and Jae Ihn as a pair of annotators. Min-Jinhong refers to Min Zeng and Jinhong Liu as a pair of annotators.

- In the case of our annotation project, based on the interannotator agreement score of 0.54 calculated using Scott's π measurement for the annotator pair Biya-Jae, it can be concluded that the annotation is reliable. It is important to note that a score of 0.8 is typically considered fantastic, while a score of 0.5 is considered pretty good in this field. Therefore, the obtained agreement score of 0.54 falls within the range of being considered reliable. However, it is also essential to recognize that there may still be room for improvement in the annotation process, and it may be beneficial to continue monitoring and assessing more datasets in the future if we would like to refine our project after graduation. Overall, the score provides evidence that the annotation performed by Jae and Biya is reasonably reliable.

- Based on the interannotator agreement score of 0.44 calculated using Scott's π measurement for the annotator pair Min-Jinhong, it can be concluded that the annotation is relatively not reliable. The lower score suggests a significant level of disagreement between the two annotators, indicating that the annotation process may have been problematic. Possible factors contributing to this lower score could be unclear guidelines or ambiguous data. Moving forward, it is essential to consider ways to improve the annotation process, such as providing clearer guidelines or better training for the annotators. Overall, the interannotator agreement score indicates that for the annotation performed by Min and Jinhong, steps should be taken to address the underlying issues that contributed to the lower score.

# Experimenting with Annotation Options (OPTIONAL)

In [10]:
data_prefilter_biya=pd.read_csv('../data/dataframes/pre_filter_annotations_by_biya.tsv',sep='\t')
data_prefilter_jae=pd.read_csv('../data/dataframes/pre_filter_annotations_by_jae.tsv',sep='\t')

data_prefilter_jinhong=pd.read_csv('../data/dataframes/pre_filter_annotations_by_jinhong.tsv',sep='\t')
data_prefilter_min=pd.read_csv('../data/dataframes/pre_filter_annotations_by_min.tsv',sep='\t')

In [11]:
data_prefilter_jae

Unnamed: 0,category,reading,course
0,Required,"Grassberger, P., and I. Procaccia. “Characteri...",1022
1,Required,"Sauer, T., J. A. Yorke, and M. Casdagli. “Embe...",1022
2,Required,"Ziehmann, C., L. A. Smith, and J. Kurths. “Loc...",1022
3,Required,"Smith, L. A., C. Ziehmann, and K. Fraedrich. “...",1022
4,Required,"Gilmour, I., L. A. Smith, and R. Buizza. “Line...",1022
...,...,...,...
5298,Optional,"Boville, B., and P. Gent. “The NCAR Climate Sy...",1013
5299,Optional,"Solomon, A., and P. H. Stone. “Equilibration i...",1013
5300,Optional,"Gleckler, P. “Surface energy balance errors in...",1013
5301,Required,"Pedlosky, Joseph. Waves in the Ocean and Atmos...",1007


In [12]:
def get_dataframe_for_calculation_prefilter(df_1, df_2):
    """Return the dataframe for agreement calculation

    Parameters
    ----------
    df_1 : pandas dataframe of annotator 1
        the dataframe for produced by annotator 1
    df_2 : pandas dataframe of annotator 2
        the dataframe for produced by annotator 2

    Returns
    -------
    pandas dataframe
        the dataframe for agreement score calculation
    """
    
    # inner join
    merged_df = df_1.merge(df_2, on=['course', 'reading'])
    merged_df["X_Y"] = tuple(zip(merged_df["category_x"], merged_df["category_y"]))
    
    # initialize dataframe
    df_t = pd.DataFrame({'Required': [0, 0, 0],
                         'Optional': [0, 0, 0],
                         'UNK': [0, 0, 0]},
                         index=['Required', 'Optional', 'UNK'])
    
    # count RR/RO/OR/OO
    dic = merged_df["X_Y"].value_counts().to_dict()
    for key, val in dic.items():
        df_t.loc[key[0]][key[1]] = val
    
    # left join for UNK
    left_df = df_1.merge(df_2, on='reading', how='left')
    left_df = df_1.merge(df_2.drop_duplicates(), on='reading',how='left', indicator=True)
    left_df = left_df[left_df['_merge'] == 'left_only']
    
    dic_left = left_df["category_x"].value_counts().to_dict()
    for key, val in dic_left.items():
        df_t.loc[key]['UNK'] = val
    
    # right join for UNK
    right_df = df_1.merge(df_2.drop_duplicates(), on='reading',how='right', indicator=True)
    right_df = right_df[right_df['_merge'] == 'right_only']
    
    dic_right = right_df["category_y"].value_counts().to_dict()
    for key, val in dic_right.items():
        df_t.loc['UNK'][key] = val
        
    return df_t

In [13]:
get_dataframe_for_calculation_prefilter(data_prefilter_biya, data_prefilter_jae)

Unnamed: 0,Required,Optional,UNK
Required,2693,6,619
Optional,168,1366,293
UNK,790,276,0


In [14]:
get_agreement(get_dataframe_for_calculation_prefilter(data_prefilter_biya, data_prefilter_jae))

0.4043043569401901

In [15]:
get_dataframe_for_calculation_prefilter(data_prefilter_jinhong, data_prefilter_min)

Unnamed: 0,Required,Optional,UNK
Required,573,17,84
Optional,23,642,17
UNK,390,294,0


In [16]:
get_agreement(get_dataframe_for_calculation_prefilter(data_prefilter_jinhong, data_prefilter_min))

0.36498653473378395

We have conducted testing to improve the interannotator agreement score for our project. Our team member Jae Ihn created an annotation filter using Ruby language, which resulted in a significant improvement in the score for one pair of annotators (Biya-Jae) from `0.40` to `0.54`. Also, the score for another pair of annotators (Min-Jinhong) improved from `0.36` to `0.44`, suggesting that there may be other factors that need to be addressed to further improve the interannotator agreement score. One potential factor could be the limitations of the parser in handling special characters, which led to the result that we have to discard some data. To address this issue, we can learn more about Ruby and develop a better filter to catch more special cases from the input annotation files. Overall, our approach to testing and refining our annotation process through experimentation with the annotation filter method shows a possibility of the continuous improvement and chances of achieving high-quality annotation results.