# Table of contents
- [Setup](#setup)
    - [Purpose](#purpose)
    - [Libraries](#libraries)
- [Quality assurance](#qualityassurance) 

<a name='setup'></a>
# 0. Setup  

<a name='purpose'></a>
## 0.1. Purpose  

The purpose of this notebook is following: 

The following steps are performed: 
- Randomly select 10 articles
- Two people manually extract all URLs and the sentences containing the URLs
    - The results are in 'Data/URLs_manuallyExtracted_QC_1' and 'Data/URLs_manuallyExtracted_QC_2'
- Gather the automatically extracted URLs and sentences (this is done in 'Code/datasets_v5.ipynb')
    - The results are in 'Data/articles_filtered_urls.csv'
- Investigate
    - Inter-extractor agreement
- Report results  

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 
import io

# Random 
import random

# 1. Ten randomly selected articles 

I want to manually extract the URLs and sentences containing URLs from 10 randomly selected research papers. First, I need to get the research papers and make sure they are not an article used to establish the ground truth. 

In [2]:
# Path to the groundtruth data directory
groundtruth_path = os.path.join(os.pardir, 'Data/articles_groundtruth_urls_and_sentences.csv')
groundtruth_articles = pd.read_csv(groundtruth_path)

In [3]:
groundtruth_dois = groundtruth_articles['DOI'].unique()

In [10]:
def get_random_dois(json_file_path, num_samples, dois_to_exclude=None, random_seed=40):
    """
    Get a list of random DOIs from a JSON file while ensuring that they are not in the groundtruth DOI list.

    Parameters:
    json_file_path (str): The path to the JSON file containing DOI data.
    num_samples (int): The number of random DOIs to sample.
    dois_to_exclude (list, optional): A list of DOIs to exclude from the random sampling. Default is None.
    random_seed (int, optional): Seed for the random number generator for reproducibility. Default is 40.

    Returns:
    list: A list of unique random DOIs not in the groundtruth DOI list.
    """
    # Set the random seed for reproducibility
    random.seed(random_seed)

    # Load the DOI data from the JSON file
    with open(json_file_path, 'r') as json_file:
        doi_data = json.load(json_file)
        doi_list = doi_data['DOIs']

    doi_list = [doi for doi in doi_list if doi not in groundtruth_dois]

    # Ensure the number of requested samples does not exceed available DOIs
    num_samples = min(num_samples, len(doi_list))

    # Get a sample of DOIs
    random_dois = random.sample(doi_list, num_samples)

    return random_dois

In [11]:
# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'
samples = 12

# Get 12 random DOIs 
random_dois = get_random_dois(json_file_path, samples, groundtruth_dois)

random_dois

['10.1016/j.neuroimage.2022.119254',
 '10.1016/j.neuroimage.2022.119133',
 '10.1016/j.neuroimage.2022.119360',
 '10.1016/j.neuroimage.2022.119742',
 '10.1016/j.neuroimage.2022.119294',
 '10.1016/j.neuroimage.2022.118992',
 '10.1016/j.neuroimage.2022.119077',
 '10.1016/j.neuroimage.2021.118868',
 '10.1016/j.neuroimage.2022.118960',
 '10.1016/j.neuroimage.2022.119769',
 '10.1016/j.neuroimage.2022.119048',
 '10.1016/j.neuroimage.2022.119199']

I will save all of the PDFs to these articles and share them with the person who is going to function as my quality control. 
The PDFs were shared in a folder alongside an excel sheet, which the coder and I filled out in the following manner: 

| Title          | Description      |
| --------------- | ---------------- |
| DOI            | The article's DOI                 |
| URL            | The URL. It can be start with http, www, or simply be .com or similar. It has to occur in the text, i.e., from title to "References". If there are no URLs, leave this field and the others blank.                 |
| Sentence(s)    | Copy of the full sentence(s) that contain the URL. Put quotations (either "" or '') around. If there are multiple sentences, put a comma between them.                 |

---

We will extract the URLs and sentences together from 10.1016/j.neuroimage.2022.119254 and 10.1016/j.neuroimage.2022.119133, and we will extract the URLs and sentences from the other ten articles individually. 

<a name='qualityassurance'></a>
# Quality assurance (QA)