# Table of contents
- [Setup](#setup)
    - [Purpose](#purpose)
    - [Libraries](#libraries)
- [40 random research papers](#40randomresearchpapers)

<a name='setup'></a>
# 0. Setup  

<a name='purpose'></a>
## 0.1. Purpose  

The purpose of this notebook is following: 

The following steps are performed: 
- Randomly select 40 articles to be manually labelled
- Gather the extracted URLs and sentences (this is done in 'Code/datasets_v5.ipynb')
- The sentences are manually coded by three people
- Gather the manually coded data
    - Investigate inter-labeller agreement
    - Save most-agreed-upon label for each sentence
- Fine-tune SciBERT (this is done in 'Code/datasets_v5.ipynb')

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 
import io

# Random 
import random

# Groundtruth articles 
I need to get the list of groundtruth research papers to make sure I do not select them for manual labelling. 

In [2]:
# Path to the 'Label' directory
groundtruth_path = os.path.join(os.pardir, 'Data/articles_groundtruth_urls_and_sentences.csv')
groundtruth_articles = pd.read_csv(groundtruth_path)

In [3]:
groundtruth_dois = groundtruth_articles['DOI'].unique()

<a name='40randomresearchpapers'></a>
# 40 random research papers 
I want to manually label the sentences containing URLs from 40 randomly selected research papers. First, I need to get the research papers. 

In [4]:
def get_random_dois(json_file_path, num_samples, dois_to_exclude=None, random_seed=42):
    """
    Get a list of random DOIs from a JSON file while ensuring that they are not in the groundtruth DOI list.

    Parameters:
    json_file_path (str): The path to the JSON file containing DOI data.
    num_samples (int): The number of random DOIs to sample.
    dois_to_exclude (list, optional): A list of DOIs to exclude from the random sampling. Default is None.
    random_seed (int, optional): Seed for the random number generator for reproducibility. Default is 42.

    Returns:
    list: A list of unique random DOIs not in the groundtruth DOI list.
    """
    # Set the random seed for reproducibility
    random.seed(random_seed)

    # Load the DOI data from the JSON file
    with open(json_file_path, 'r') as json_file:
        doi_data = json.load(json_file)
        doi_list = doi_data['DOIs']

    doi_list = [doi for doi in doi_list if doi not in groundtruth_dois]

    # Ensure the number of requested samples does not exceed available DOIs
    num_samples = min(num_samples, len(doi_list))

    # Get a sample of DOIs
    random_dois = random.sample(doi_list, num_samples)

    return random_dois

In [5]:
# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'
samples = 40

# Get 40 random DOIs with a specific random seed (1)
random_dois = get_random_dois(json_file_path, samples, groundtruth_dois)

random_dois

['10.1016/j.neuroimage.2022.119290',
 '10.1016/j.neuroimage.2022.118980',
 '10.1016/j.neuroimage.2022.119191',
 '10.1016/j.neuroimage.2022.119140',
 '10.1016/j.neuroimage.2022.119203',
 '10.1016/j.neuroimage.2022.119494',
 '10.1016/j.neuroimage.2022.119708',
 '10.1016/j.neuroimage.2022.119245',
 '10.1016/j.neuroimage.2022.119418',
 '10.1016/j.neuroimage.2022.119616',
 '10.1016/j.neuroimage.2022.118920',
 '10.1016/j.neuroimage.2022.119122',
 '10.1016/j.neuroimage.2021.118811',
 '10.1016/j.neuroimage.2022.119353',
 '10.1016/j.neuroimage.2022.119094',
 '10.1016/j.neuroimage.2022.119626',
 '10.1016/j.neuroimage.2022.119742',
 '10.1016/j.neuroimage.2022.119135',
 '10.1016/j.neuroimage.2022.119331',
 '10.1016/j.neuroimage.2022.119278',
 '10.1016/j.neuroimage.2021.118798',
 '10.1016/j.neuroimage.2022.119055',
 '10.1016/j.neuroimage.2022.119507',
 '10.1016/j.neuroimage.2022.119642',
 '10.1016/j.neuroimage.2022.119627',
 '10.1016/j.neuroimage.2022.119668',
 '10.1016/j.neuroimage.2022.119528',
 

Now, I will save all of the extracted URLs and sentences (which is performed in Code/datasets_v5) to csv, so that the sentences can be labelled manually by a group of annotators. 

In [6]:
path_filtered_urls = os.path.join(os.pardir, 'Data/articles_filtered_urls.csv')
filtered_urls = pd.read_csv(path_filtered_urls)

In [7]:
filtered_urls = filtered_urls[filtered_urls['URL'].notna()]

In [8]:
random_articles = filtered_urls[filtered_urls['DOI'].isin(random_dois)]

In [12]:
len(random_articles)

129

In [9]:
path = os.path.join(os.pardir, 'Data/articles_manually_labelled.txt')

with open(path, "w") as file:
    file.write(random_articles.to_string(index=False))

# Inter-coder agreement 

In [10]:
# Fetch the labelled data 
# How much do they agree? 
# Save the label most people voted on - if it's four different labels, pick a random 
# Save the data and use it to finetune SciBERT 