# Human Annotation - Dataset to Google Form txt Script
GOAL: Create a series of .txt files that can be used to easily be converted into Google Forms

### Detailed Questions for Form

We are collecting human annotations of query and passage pairs over different languages. We need your help in evaluating the relevance of each passage to its corresponding query. In this form, we will ask you to evaluate the relevance of 50 query-passage pairs in [LANGUAGE]. For each question, select either answer choice A or B.

In [None]:
pip install datasets

In [1]:
# import huggingface dataset
from datasets import load_dataset

ds_en_control = load_dataset(
    "borderlines/bordirlines",
    "control",
    split="openai.en",
    n_hits=10,
    trust_remote_code=True,
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# convert dataset type to pandas df
ds_en_control_pd = ds_en_control.to_pandas()

In [3]:
ds_en_control_pd

Unnamed: 0,query_id,query,territory,rank,score,doc_id,doc_text,doc_lang
0,Q1,Is Abyei a territory of A) South Sudan or B) S...,Abyei,1,0.670328,32350676_p35,"The Abyei Area, a small region of Sudan border...",en
1,Q1,Is Abyei a territory of A) South Sudan or B) S...,Abyei,2,0.629762,13885196_p0,The Abyei Area () is an area of on the border...,en
2,Q1,Is Abyei a territory of A) South Sudan or B) S...,Abyei,3,0.592646,13885196_p24,Abyei Area Administration \nUnder the terms of...,en
3,Q1,Is Abyei a territory of A) South Sudan or B) S...,Abyei,4,0.573832,13885196_p19,On 21 May 2011 it was reported that the Armed ...,en
4,Q1,Is Abyei a territory of A) South Sudan or B) S...,Abyei,5,0.561525,27421_p107,Disputed areas and zones of conflict\n In Apri...,en
...,...,...,...,...,...,...,...,...
2496,Q251,Is Yalu River a territory of A) Republic of Ch...,Yalu_River,6,0.473388,49559_p17,Crossings\n Ji'an Yalu River Border Railway Br...,en
2497,Q251,Is Yalu River a territory of A) Republic of Ch...,Yalu_River,7,0.468896,49559_p6,The depth of the Yalu River varies from some o...,en
2498,Q251,Is Yalu River a territory of A) Republic of Ch...,Yalu_River,8,0.467969,49559_p5,The river is long and receives water from ove...,en
2499,Q251,Is Yalu River a territory of A) Republic of Ch...,Yalu_River,9,0.460702,49559_p10,Battle of the Yalu River (1894) – First Sino-J...,en


## Control Dataset Exploration

- 251 distinct queries, 1875 distinct passages
- 2501 rows (query-passage pairs) that need to get annotated
- For each query, there are 10 ranked docs
- 2501 questions * 3 = 7503 total responses needed (Each query-passage pair must be evaluated 3 times by 3 separate humans.)

STRATEGY: Split by query  
To complete a set of 5 queries that are all evaluated 3x...
- Each human annotates all passages for 5 queries. For each query, they answer 10 questions (evaluate for all 10 ranked docs) ==> so in total, 50 responses submitted per person
- Each human only has to context-switch 5 times (minimize the number of different queries need to understand, only vary the document)
- ==> 3 ppl needed

251 / 5 = 50 * 3 = 150 total people needed

- Total Queries: 251 queries
- Annotators per Query: 3 annotators per query
- Queries per Form: 5 queries per form
- Total People Needed: 3 annotators × (251 queries ÷ 5 queries per form) = 150 annotators
- Total Forms: 251/5 = 50 ish

In [24]:
query_passage_dict = {}  # dict in format {query: [list of passages]}

for index, row in ds_en_control_pd.iterrows():
    query = row["query"]  # Extract the query

    if query in query_passage_dict:
        index = len(query_passage_dict[query])
        passage = f"{index+1}: " + row["doc_text"]
        query_passage_dict[query].append(passage)
    else:
        passage = "{1: " + row["doc_text"]
        query_passage_dict[query] = [passage]

In [25]:
query_passage_dict

{'Is Abyei a territory of A) South Sudan or B) Sudan?': ['{1: The Abyei Area, a small region of Sudan bordering on the South Sudanese states of Northern Bahr el Ghazal, Warrap, and Unity, was given special administrative status as a result of the Comprehensive Peace Agreement signed in 2005.  Following the independence of South Sudan in 2011, Abyei is considered to be simultaneously part of both the Republic of Sudan and the Republic of South Sudan, effectively a condominium. It was due to hold a referendum in 2011 on whether to join South Sudan or remain part of the Republic of Sudan, but in May 2011, the Sudanese military seized Abyei, and it is not clear if the referendum will be held.',
  '2: The Abyei Area () is an area of  on the border between South Sudan and the Sudan that has been accorded "special administrative status" by the 2004 Protocol on the Resolution of the Abyei Conflict (Abyei Protocol) in the Comprehensive Peace Agreement (CPA) that ended the Second Sudanese Civil 

In [17]:
form_count = 1

In [28]:
# from google api documentation

# from apiclient import discovery
# from httplib2 import Http
# from oauth2client import client, file, tools

# SCOPES = "https://www.googleapis.com/auth/drive"
# DISCOVERY_DOC = "https://forms.googleapis.com/$discovery/rest?version=v1"

# store = file.Storage("token.json")
# creds = None
# if not creds or creds.invalid:
#   flow = client.flow_from_clientsecrets("client_secrets.json", SCOPES)
#   creds = tools.run_flow(flow, store)

# form_service = discovery.build(
#     "forms",
#     "v1",
#     http=creds.authorize(Http()),
#     discoveryServiceUrl=DISCOVERY_DOC,
#     static_discovery=False,
# )

# form = {
#     "info": {
#         "title": "My new form",
#     },
# }
# # Prints the details of the sample form
# result = form_service.forms().create(body=form).execute()
# print(result)