# Table of contents 
- [Setup](#setup)
    - [Purpose](#purpose)
    - [Libraries](#libraries) 
- [](#)
- [](#)
- [References](#references)


<a name='setup'></a>
# 0. Setup 

This notebook contains the code to investigate the extent of dataset reuse; specifically the reuse of the datasets used in the articles published in NeuroImage in 2022. 

<a name='purpose'></a>
## 0.1. Purpose 

The purpose of this notebook is to investigate how many other articles across articles available in OpenAlex use the same datasets as those used by the researchers who published their work in NeuroImage in 2022. 
This notebook draws on the work by Theó Sourget (2023a, 2023b), who similarly investigated the usage of a select couple of datasets. 

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd 
import numpy as np 
import requests
import csv
import re

# 1. NeuroImage 

In [None]:
# Reuse of the links: how many use each of the links in urls = pd.read_csv('../Data/material_URLs.csv')

In [None]:
# Case exploration: how many places are HCP datasets hosted? Or maybe another dataset

Some of the URLs take you to the same dataset, e.g., *adni.loni.usc.edu* and *adni.loni.usc.edu/* so I will merge the URLs that this applies to. However, some of the URLs share a root, but shows you different datasets, e.g., all *github.com* and *openneuro.org* links.

In [None]:
urls = pd.read_csv('../Data/material_URLs.csv')

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)

In [None]:
urls

In [None]:
def find_root_and_derivatives(urls):
    """ 
    This function performs some initial cleaning of the URLs, removing the beginning (https:// or wwww) and puts all text into lowercase. 
    Parameters: 
    :param urls(list): Sorted list of unique URLs from a dataframe
    """    
    # Remove "https://", "http://", and "www." from the URLs
    clean_urls = [url.replace("https://", "").replace("http://", "").replace("www.", "").lower() for url in urls]
    
    url_dict = {}
    for url in clean_urls:
        root = url.split('/')[0]
        # root = '/'.join(url.split('/'))
        if root in url_dict:
            url_dict[root].append(url)
        else:
            url_dict[root] = [url]

    grouped_df = pd.DataFrame(url_dict.items(), columns=['Root', 'Derivatives']).sort_values('Root', ascending=True)

    return grouped_df

In [None]:
test = find_root_and_derivatives(urls['Material_URL'])

In [None]:
test

Investigating the 315 unique and valid URLs, there are 137 root websites, some leading to the same dataset, and others to different datasets. 
- IDA (Image & Data Analysis) features 25 studies, including ADNI, AIBL, and a HCP project as well as others (at another place on their website, they say they feature 151 studies)
- db.hummanconnectome.org (ConnectomeDB) gives people access to the Human Connectome Project (HCP), both lifespan and hcp-aging
- fcon_1000.projects.nitrc.org links to 1000 Functional Connectomes Project, INDI-prospective, INDI-retrospective, and Preprocessed Connectome Project
- HCP is actually Human Connectome Projects, and there are 20 connectome studies supported on humanconnectome.org (they call themselves Connectome Coordination Facility (CCF)), leading to e.gg., HCP young adult, aging, development, Amish, epilepsy, and other connectomes.
- nda.nih.gov features different datasets, including a couple of HCPs 
- datalad, github, openneuro, openfmri, osf, and zenodo link to numerous different datasets 

In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('max_colwidth')

# 2. 

In [None]:
def load_datasets_info():
    """Load dataset information from csv at ../Data/data/datasets.csv and convert the doi to openalex_id
    @return:
        -dictionnary with dataset name as key and dictionary of information as value
    """
    datasets_info = {}
    with open('./Resources/data/datasets.csv') as ds_csv:
        ds_reader = csv.DictReader(ds_csv)
        for ds in ds_reader:
            datasets_info[ds["name"]] = {
                                            "doi":ds["doi"],
                                            "title":ds["paper_title"],
                                            "name":ds["name"],
                                            "aliases":ds["aliases"].split(","),
                                            "url":ds["url"]
                                         }
    return datasets_info


def search_dataset_in_section(paper_path,section_name,dataset_infos,field="name"):
    res = {ds_name:False for ds_name in dataset_infos}
    try:
        text = pdf_util.extract_section(paper_path,section_name)
        if text:
            text = " ".join(text)
            for ds_name in dataset_infos:
                searched_str = dataset_infos[ds_name][field]
                if re.search(f"(?<![^_\\W]){searched_str}(?![^_\\s\\d\\.\\),'])",text):
                    res[ds_name] = True
                else:
                    for alias in dataset_infos[ds_name]["aliases"]:
                        if re.search(f"(?<![^_\\W]){alias}(?![^_\\s\\d\\.\\),'])",text):
                            res[ds_name] = True
                            break
    except:
        print(f"Unreadable paper:{paper_path}")
    
    return res
def main():
    #Load dataset_info
    ds_info = load_datasets_info()

    #Load list of downloaded pdf 
    papers_pdf_path = glob.glob("./Results/extraction/fulltext/*.pdf")
    data_csv = []
    
    print("Search in abstract started")
    for paper in tqdm(papers_pdf_path):
        paper_name = paper.split("/")[-1].removesuffix("pdf")
        #Get the abstract part
        search_res = search_dataset_in_section(paper,"abstract",ds_info)
        search_res["name"] = paper_name
        data_csv.append(search_res)
    
    with open("./Results/extraction/fulltext_datasets_abstract.csv","w",newline="") as ft_ds_ref:
        fields = ["name"]
        fields += [ds_name for ds_name in ds_info]
        
        writer = csv.DictWriter(ft_ds_ref, fieldnames=fields)
        # Write the header row (column names)
        writer.writeheader()
        for paper in data_csv:
            writer.writerow(paper)
    
    data_csv = []
    print("Search in references started")
    for paper in tqdm(papers_pdf_path):
        paper_name = paper.split("/")[-1].removesuffix("PDF")
        #Get the abstract part
        search_res = search_dataset_in_section(paper,"references",ds_info,"title")
        search_res["name"] = paper_name
        data_csv.append(search_res)
            
    with open("./Results/extraction/fulltext_datasets_references.csv","w",newline="") as ft_ds_ref:
        fields = ["name"]
        fields += [ds_name for ds_name in ds_info]
        
        writer = csv.DictWriter(ft_ds_ref, fieldnames=fields)
        # Write the header row (column names)
        writer.writeheader()
        for paper in data_csv:
            writer.writerow(paper)
    
    # data_csv = []
    # print("Search in results started")
    # for paper in tqdm(papers_pdf_path):
    #     paper_name = paper.split("/")[-1].removesuffix("PDF")
    #     #Get the abstract part
    #     search_res = search_dataset_in_section(paper,"results",ds_info)
    #     search_res["name"] = paper_name
    #     data_csv.append(search_res)
            
    # with open("./Results/extraction/fulltext_datasets_results.csv","w",newline="") as ft_ds_ref:
    #     fields = ["name"]
    #     fields += [ds_name for ds_name in ds_info]
        
    #     writer = csv.DictWriter(ft_ds_ref, fieldnames=fields)
    #     # Write the header row (column names)
    #     writer.writeheader()
    #     for paper in data_csv:
    #         writer.writerow(paper)

    data_csv = []
    print("Search in method started")
    for paper in tqdm(papers_pdf_path):
        paper_name = paper.split("/")[-1].removesuffix("PDF")
        #Get the abstract part
        search_res = search_dataset_in_section(paper,"method",ds_info)
        search_res["name"] = paper_name
        data_csv.append(search_res)
            
    with open("./Results/extraction/fulltext_datasets_method.csv","w",newline="") as ft_ds_ref:
        fields = ["name"]
        fields += [ds_name for ds_name in ds_info]
        
        writer = csv.DictWriter(ft_ds_ref, fieldnames=fields)
        # Write the header row (column names)
        writer.writeheader()
        for paper in data_csv:
            writer.writerow(paper)

if __name__ == "__main__":
    main()

<a name='references'></a>
# References
- Sourget, T. (2023a). Public_Medical_Datasets_References [Jupyter Notebook]. https://github.com/TheoSourget/Public_Medical_Datasets_References (Original work published 2023)
- Sourget, T. (2023b). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget