<a href="https://colab.research.google.com/github/hsschachter/annotation_project/blob/main/Data_Splitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Splitter Script

This script is designed to help you split your data into n files (where n is the number of people in your group) and ensure that the files align with the requirement in the [AP4 Guideline](https://bcourses.berkeley.edu/courses/1542734/assignments/8862636).

As a reminder, these are the requirements for the data:

* each document should be labeled **independently by two group members**

* each group member should annotate **250 documents**

This script will ensure your data meets those criteria.

To get started make sure your dataset:

* has at least 500 documents.
* has **exactly** two columns: **ID** and **text**


Run this notebook to check that your annotated data is in the proper format.  There are two things you need to do:

1. Change this file to point to your data.

In [None]:
path_to_file="500_sampled_quotes.tsv"

2. Now execute the rest of the cells below.  If this throws any errors, or notes any failures, go back and correct your data to be in the proper format.

In [None]:
import random
random.seed(1)
from random import shuffle

def check_file(filename, num_annotators=4):
    num_docs=125*num_annotators
    data=[]
    n_rows=0
    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            cols=line.rstrip().split("\t")
            assert len(cols) == 2, "%s does not have 2 columns" % cols
            assert len(cols[0]) > 0, "ID #%s# in row %s is empty" % (cols[0], idx)
            assert len(cols[1]) > 0, "text #%s# in row %s is empty" % (cols[1], idx)
            n_rows+=1
            data.append((cols[0], cols[1]))

        assert n_rows >= num_docs, "You must have at least %s documents; this file only has %s" % (num_docs, n_rows)

        print("This file looks to be in the correct format; %s data points" % n_rows)


    shuffle(data)
    data=data[:num_docs]
    annotators={}
    annotator_workload={}
    data_assignments={}

    for i in range(num_annotators):
        annotators[i]=[]

    pairs=[]
    for i in range(num_annotators):
        for j in range(i+1, num_annotators):
            pairs.append((i,j))

    indexes=[]
    for i in range(int(num_docs/len(pairs))):
        for p in pairs:
            indexes.append(p)

    if num_annotators == 4:
        indexes.append(pairs[0])  # 0,1
        indexes.append(pairs[-1]) # 2,3

    data2ann={}

    for datum in data:
        data_idx=datum[0]

        idx = random.choice(indexes)

        annotators[idx[0]].append(datum)
        annotators[idx[1]].append(datum)

        indexes.remove(idx)

        data2ann[data_idx]=idx

    for ann_idx in annotators:
        print("annotator", ann_idx, len(annotators[ann_idx]), "data points to annotate")

    for i in annotators:
        with open("output_annotation_file_%s.txt" % i, "w") as out:
            out.write("ID\tLabel\tText\n")

            for idd, text in annotators[i]:
                out.write("%s\t\t%s\n" % (idd, text))




In [None]:
check_file(path_to_file, num_annotators=4)

This file looks to be in the correct format; 1000 data points
annotator 0 250 data points to annotate
annotator 1 250 data points to annotate
annotator 2 250 data points to annotate
annotator 3 250 data points to annotate


# Output

If the cell above ran successfully, you should find `num_annotators` files in the same directory as this script.  Give one file to each annotator in your group; this is the file you should submit to bCourses (with the label column containing your *individual* annotation for that text following your guidelines).