# Interannotator Agreement Study

Measure: Although we explored both kappa and pi, our team agreed that pi would be a good measure as it takes into account because the assumption for S, which is that all coders annotation have a uniform distribution, if not likely true in our case. Moreover, the crowdsourcing nature of Mechanical Turk makes evaluating using kappa not possible.

In [1]:
import pandas as pd
from collections import defaultdict
import nltk
import json
from nltk.metrics.agreement import AnnotationTask

In [2]:
def convert2annotation_dict(fname):
    annotated=pd.read_csv(fname)
    annotated=annotated.to_dict('records')
    blank_annotation={'Content':defaultdict(lambda: defaultdict(list)),
                      'Course delivery':defaultdict(lambda: defaultdict(list)),
                      'Difficulty':defaultdict(lambda: defaultdict(list)),
                      'Workload':defaultdict(lambda: defaultdict(list)),
                      'Usefulness':defaultdict(lambda: defaultdict(list))}
    for i in annotated:
        identified_facets=json.loads(i['Answer.taskAnswers'])[0]['category']['labels']
        for facet in blank_annotation.keys():#initialize all to False by default
            blank_annotation[facet][i['HITId']][i['WorkerId']]='F'
        for identified in identified_facets:
            blank_annotation[identified][i['HITId']][i['WorkerId']]='T'
    return blank_annotation

In [3]:
def convert2triplets(annotation_of_HITId):
    trips=[]
    for HITId in annotation_of_HITId.keys():
        ann_work=annotation_of_HITId[HITId]
        ann_val=list(ann_work.values())
        for worker_idx in range(len(ann_val)):
            trips.append(('C{}'.format(worker_idx),HITId,ann_val[worker_idx]))
    return trips

In [4]:
def find_3(trips):
    agg=defaultdict(int)
    for rec in trips:
        agg[rec[1]]+=1
    l3=[i[0] for i in agg.items() if i[1]==3]
    return l3

In [5]:
def calculate_pi(blank_annotation):
    Facet_pi={}
    for Facet in blank_annotation.keys():
        trips=convert2triplets(blank_annotation[Facet])
        l3=find_3(trips)
        trips_f=[i for i in trips if i[1] in l3] 
        annotation_task = AnnotationTask(trips_f)
        Facet_pi[Facet]=annotation_task.pi()
    return Facet_pi

In [6]:
def calculate_kp(blank_annotation):
    Facet_pi={}
    for Facet in blank_annotation.keys():
        trips=convert2triplets(blank_annotation[Facet])
        l3=find_3(trips)
        trips_f=[i for i in trips if i[1] in l3] 
        annotation_task = AnnotationTask(trips_f)
        Facet_pi[Facet]=annotation_task.kappa()
    return Facet_pi

In [7]:
pi_sampled_reviews_annotated=calculate_pi(convert2annotation_dict('sampled_reviews_annotated.csv'))
pi_sampled_review_annotated_0=calculate_pi(convert2annotation_dict('sampled_review_annotated_0.csv'))
pi_sampled_review_annotated_1=calculate_pi(convert2annotation_dict('sampled_review_annotated_1.csv'))

In [8]:
kp_sampled_reviews_annotated=calculate_kp(convert2annotation_dict('sampled_reviews_annotated.csv'))
kp_sampled_review_annotated_0=calculate_kp(convert2annotation_dict('sampled_review_annotated_0.csv'))
kp_sampled_review_annotated_1=calculate_kp(convert2annotation_dict('sampled_review_annotated_1.csv'))

In [9]:
pi_sampled_reviews_annotated

{'Content': -0.06944444444444459,
 'Course delivery': -0.14230769230769238,
 'Difficulty': 0.292857142857143,
 'Workload': -0.018518518518518705,
 'Usefulness': -0.004347826086956478}

In [10]:
pi_sampled_review_annotated_0

{'Content': -0.21008403361344544,
 'Course delivery': -0.10052910052910057,
 'Difficulty': -0.23076923076923078,
 'Workload': 0.0,
 'Usefulness': -0.22222222222222232}

In [11]:
pi_sampled_review_annotated_1

{'Content': -0.008403361344537785,
 'Course delivery': -0.11111111111111101,
 'Difficulty': -0.20000000000000007,
 'Workload': -0.14285714285714285,
 'Usefulness': -0.2187500000000001}

In [12]:
kp_sampled_reviews_annotated

{'Content': -0.06560187216615483,
 'Course delivery': -0.09308775731310952,
 'Difficulty': 0.24806201550387597,
 'Workload': 0.0013506098847759402,
 'Usefulness': 0.04323827478345696}

In [13]:
kp_sampled_review_annotated_0

{'Content': -0.20839160839160842,
 'Course delivery': -0.0900537634408602,
 'Difficulty': -0.1472261072261072,
 'Workload': 0.023833309547595254,
 'Usefulness': -0.2185909328766472}

In [14]:
kp_sampled_review_annotated_1

{'Content': 0.03857907112017762,
 'Course delivery': -0.10974739546168118,
 'Difficulty': -0.1624288865668176,
 'Workload': -0.12360931833548468,
 'Usefulness': -0.12087628200160168}

In [15]:
pi_full_reviews_annotated=calculate_pi(convert2annotation_dict('reviews_full_annotated.csv'))
kp_full_reviews_annotated=calculate_kp(convert2annotation_dict('reviews_full_annotated.csv'))

In [16]:
## Our final result
pi_full_reviews_annotated

{'Content': -0.05073439412484699,
 'Course delivery': 0.07680593889787751,
 'Difficulty': 0.08123172870785426,
 'Workload': 0.02020202020202022,
 'Usefulness': 0.05876010781671176}

In [17]:
kp_full_reviews_annotated

{'Content': -0.050426254188860764,
 'Course delivery': 0.07705805314453486,
 'Difficulty': 0.0812049889045453,
 'Workload': 0.0203497054778624,
 'Usefulness': 0.05875687152163197}

# Interannotator Study - Discussion

## How reliable is the annotation?
It is not very reliable due to the nature of crowd sourcing approach as well as the nature of the task. Some factors that may have reduced the reliability of the annotation:
- We cannot guarantee the annotation was done after completely reading and understanding the instruction given.
- Annotation labels are general. Although our intention is to capture the general aspects mentioned in the course reviews, this kind of generality makes annotation prone to subjectivity.

## What can be done to improve?
With enough time, it is best to train individual annotators on the task to ensure annotation quality.