# Table of contents 
- [Setup](#setup) 
    - [Target](#target)
    - [Libraries](#libraries)
- [Ground truth URLs and sentences](#groundtruthURLsandsentences)
- [URLs and sentences](#URLsandsentences)
    - [Process URLs](#processURLs)
    - [URLs in NeuroImage 2022 articles](#URLsinNeuroImage2022articles)
- [Unsupervised text classification using word embeddings](#unsupervisedtextclassificationusingwordembeddings) 
    - [Class concepts and class labels](#classconceptsandclasslabels)
    - [Sentence classification](#sentenceclassification) 
- [Validate](#validate) 
- [Datasets](#datasets)
- [References](#references) 
    
<br>
<br>

[Old code](#oldcode)
- [Gather datasets](#gatherdatasets)
    - [Get text sections](#gettextsections)
        - [Section patterns v1](#sectionpatternsv1)
        - [Section patterns v2](#sectionpatternsv2)
        - [Section patterns v3](#sectionpatternsv3)
    - [Preprocessing text sections](#preprocessingtextsections) 
        - [Start patterns](#startpatterns)
        - [Clean text](#cleantext)
    - [Get datasets](#getdatasets)
        - ['Availability' pattern](#availabilitypattern)
        - [Other section patterns](#othersectionpatterns)
- [References](#references)

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='target'></a> 
## 0.1. Target
The goal is the use pypdf to locate and extract the datasets used for analysis in the research articles. Based on an initial review of nine random 

Steps: 
- URLs and sentences: Get all URLs and the sentences that contain them. 
    - Preprocess URLs
- Word embedding: 
    - Class concepts and class labels 
    - Sentence classification
- Datasets: 

<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [13]:
import pandas as pd
import numpy as np

import json 
import os 
import re 
import io

# Read PDFs
import pypdf 
# Extract URLs from text 
import urlextract 
# Sentence - and thus URL - classification, and related imports 
import fasttext
import fasttext.util
# Random 
import random

# 1. Ground truth URLs and sentences 
<a name = 'groundtruthURLsandsentences'></a>

Based on my exploration of ten randomly picked articles, 75% of the articles contained URLs - of the articles that did not contain any URLs, the majority either used self-collected data or no datasets at all. This means that extracting the URLs will also extract the datasets used for analysis in the article. 


I will test the functions using the groundtruth texts as my validation set. When manually extracting the datasets from the ten groundtruth texts, we should get the following datasets (NB! Currently, I have not distinguished between links that leads the reader to data and links that leads the reader to code - this will come later): 


Similar to my processing, I will perform the following manually: 
- For each article, save only unique URLs (i.e., if the same URL is mentioned more than one time, save the first mention and sentence.) 
- Remove the following URLs: 
    - 'www.elsevier.com/locate/neuroimage'
    - URLs containing the DOI of the article
    - Creative Commons licenses

In [10]:
# List of groundtruth DOI values to filter 
groundtruth_dois = [
    '10.1016/j.neuroimage.2021.118839',
    '10.1016/j.neuroimage.2021.118854',
    '10.1016/j.neuroimage.2022.119030',
    '10.1016/j.neuroimage.2022.119050',
    '10.1016/j.neuroimage.2022.119240',
    '10.1016/j.neuroimage.2022.119443',
    '10.1016/j.neuroimage.2022.119526',
    '10.1016/j.neuroimage.2022.119549',
    '10.1016/j.neuroimage.2022.119646',
    '10.1016/j.neuroimage.2022.119676',
] 

In [11]:
groundtruth_urls = [
    {
        'DOI': '10.1016/j.neuroimage.2021.118839',
        'URL': 'http://neuroimage.usc.edu/brainstorm',
        'Sentence': 'Subsequently the results were loaded in a Matlab Tool Box, Brainstorm (Tadel et al. 2011), an accredited software freely available for download online under the GNU general public license (http://neuroimage.usc.edu/brainstorm).',
        'Class': 'Code' 
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/data-releases',
        'Sentence': 'We applied our GFA extension to the publicly available resting-state functional MRI (rs-fMRI) and non-imaging measures (e.g., demograph-ics, psychometrics and other behavioural measures) obtained from 1003 subjects (only these had rs-fMRI data available) of the 1200-subject data release of the HCP (https://www.humanconnectome.org/study/hcp-young-adult/data-releases).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation',
        'Sentence': 'The data used in this study was downloaded from the Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://github.com/ferreirafabio80/gfa',
        'Sentence': 'The GFA models and experiments were implemented in Python 3.9.1 and are available here: https://github.com/ferreirafabio80/gfa.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'marmosetbrainconnectome.org',
        'Sentence': 'To accelerate such progress, we present the Marmoset Functional Brain Connectivity Resource (marmosetbrainconnectome.org), currently consisting of over 70 h of resting-state fMRI (RS-fMRI) data acquired at 500 μm isotropic resolution from 31 fully awake marmosets in a common stereotactic space."',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://www.marmosetbrainconnectome.org/download.html',
        'Sentence': '(B) The data download page (https://www.marmosetbrainconnectome.org/download.html) allows the user to download all raw (BIDS standard formated) (Gorgolewski et al., 2016) and pre-processed data.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://rii-mango.github.io/Papaya/',
        'Sentence': 'The resource makes use of the Papaya viewer (https://rii-mango.github.io/Papaya/), with several additional features (illustrated in Fig. 1C & D), including (1) calculation of surface over-lay maps on-demand based on the threshold chosen in volume space, (2) the ability to display atlas borders in surface space, (3) support for rotating the underlying volume, overlaying functional connectiv-ity map, and atlas boundaries together – such obliquing of the images can be of utility for presurgical planning, and (4) the ability to choose between group- and subject-level topologies.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://gitlab.com/cfmm/marmoset',
        'Sentence': 'The development of the Marmoset Functional Connectivity Resource is described in full detail at https://gitlab.com/cfmm/marmoset.',
        'Class': 'Other' # actually 'Code', "All code for the online viewer is available at: https://gitlab.com/cfmm/marmoset"
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://gitlab.com/cfmm/marmoset-connectivity',
        'Sentence': 'Users can also download all code used to generate the functional connectivity maps from https://gitlab.com/cfmm/marmoset-connectivity.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'marmosetbrain.org.',
        'Sentence': 'Green labeled ROIs indicate FC data, whereas purple labeled ROIs show where tracer data is publicly available within area TE3 from marmosetbrain.org.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'URL': 'http://audition.ens.fr/adc/NoiseTools/',
        'Sentence': 'EEG analysis used FieldTrip (Oostenveld et al., 2011), Noise-Tools (De Cheveigne and Parra, 2014; http://audition.ens.fr/adc/NoiseTools/), and custom-written scripts in Matlab.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'URL': 'zenodo.org',
        'Sentence': 'Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi:10.5281/zenodo.6110595).',
        'Class': ''
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'www.cni.stanford.edu',
        'Sentence': 'MRI data were acquired on a 3T Discovery MR750 scanner (Gen-eral Electric Healthcare, Milwaukee, WI, USA) equipped with a 32-channel head coil (Nova Medical, Wilmington, MA, USA) at the Cen-ter for Cognitive and Neurobiological Imaging at Stanford Univer-sity (www.cni.stanford.edu).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'http://github.com/vistalab/vistasoft/mrDiffusion',
        'Sentence': 'The T1w images were first aligned to the canonical ac-pc orienta-tion. Diffusion weighted images were pre-processed with Vistasoft (http://github.com/vistalab/vistasoft/mrDiffusion), an open-source software package implemented in MATLAB R2012a (Mathworks, Natick, MA).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'http://www.fil.ion.ucl.ac.uk/spm/',
        'Sentence': 'Each diffusion weighted image was registered to the mean of the b=0 images and the mean b=0 image was registered automatically to the participant’s T1w image, using a rigid body transformation (imple-mented in SPM8, http://www.fil.ion.ucl.ac.uk/spm/; no warping was applied).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'https://github.com/mezera/mrQ',
        'Sentence': 'Quantitative T1 (relaxation time, seconds) maps were calculated us-ing mrQ, (https://github.com/mezera/mrQ), an open-source software package implemented in MATLAB R2012a (Mathworks, Natick, MA).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'https://github.jyeatman/AFQ',
        'Sentence': 'Automated Fiber Quantification (AFQ; https://github.jyeatman/ AFQ; (Yeatman, Dougherty, Myall, et al., 2012)), a software package implemented in MATLAB R2012a (Mathworks, Natick, MA), was used to isolate and characterize white matter metrics from three dorsal tracts (Arc-L and bilateral SLF) and four ventral white matter tracts (bilat-eral ILF and bilateral UF).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/gazx2/',
        'Sentence': 'EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/eucqf/',
        'Sentence': 'EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/thsqg/',
        'Sentence': 'EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/bndjg/',
        'Sentence': 'EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/guwnm/',
        'Sentence': 'Code used to reproduce the plots in Fig. 1, as well as averaged ERP data, is available from osf.io/guwnm/.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://db.humanconnectome.org/data/projects/HCP_1200',
        'Sentence': '200 unrelated subjects were selected from the Human Con-nectome Project (HCP) 1200 Subjects Data Release with avail-able resting (task-free) and task fMRI data from a 3T MRI scan-ner (https://db.humanconnectome.org/data/projects/HCP_1200).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/document/wu-minn-hcp-consortium-open-access-data-use-terms',
        'Sentence': 'This study agreed to the Open Access Data Use Terms (https://www.humanconnectome.org/study/hcp-young-adult/document/wu-minn-hcp-consortium-open-access-data-use-terms) and was exempt from the UCSF IRB because investigators could not readily ascertain the identities of the individuals to whom the data belonged.',
        'Class': 'Other'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/',
        'Sentence': 'We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/) and AFNI (https://afni.nimh.nih.gov/) for additional fMRI preprocessing.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://afni.nimh.nih.gov/',
        'Sentence': 'We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/) and AFNI (https://afni.nimh.nih.gov/) for additional fMRI preprocessing.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://www.brainnetome.org/',
        'Sentence': 'Maps were averaged within 273 regions of interest by combining a parcella-tion of 210 cortical regions and 36 subcortical regions from the Brainnetome atlas (Fan et al., 2016) (http://www.brainnetome.org/) and 27 cerebellar regions from the SUIT atlas (Diedrichsen, 2006) (http://www.diedrichsenlab.org/imaging/suit.htm).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://www.diedrichsenlab.org/imaging/suit.htm',
        'Sentence': 'Maps were averaged within 273 regions of interest by combining a parcella-tion of 210 cortical regions and 36 subcortical regions from the Brainnetome atlas (Fan et al., 2016) (http://www.brainnetome.org/) and 27 cerebellar regions from the SUIT atlas (Diedrichsen, 2006) (http://www.diedrichsenlab.org/imaging/suit.htm).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://www.fil.ion.ucl.ac.uk/spm/software/spm12/',
        'Sentence': 'Task condition block regressors were convolved with a hemo-dynamic response function using the ‘spm_get_bf’ function in SPM12 (https://www.fil.ion.ucl.ac.uk/spm/software/spm12/).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://github.com/rmarkello/abagen',
        'Sentence': 'We compared each gradient map to Allen Human Brain spatial gene expression patterns using the ‘abagen’ package (https://github.com/rmarkello/abagen) (Arnatkevici ̆ūtė et al., 2019; Hawrylycz et al., 2012).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://brainsmash.readthedocs.io/en/latest/',
        'Sentence': 'These surrogate gradient maps were estimated using BrainSMASH (https://brainsmash.readthedocs.io/en/latest/).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://sites.google.com/site/bctnet/',
        'Sentence': 'Graph the-ory analyses were run using the Brain Connectivity Toolbox (BCT; https://sites.google.com/site/bctnet/).',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://human.brain-map.org/',
        'Sentence': 'Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://github.com/jbrown81/gradients',
        'Sentence': 'All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119549',
        'URL': np.nan,
        'Sentence': np.nan,
        'Class': np.nan
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'URL': np.nan,
        'Sentence': np.nan,
        'Class': np.nan
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://www.shutterstock.com',
        'Sentence': 'Both experiments employed static images (modified from Shutterstock, https://www.shutterstock.com).',
        'Class': 'Other' # Method? 
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://clippingmagic.com',
        'Sentence': 'All image transformations were done with Clipping Magic (https://clippingmagic.com), ImageMagick, GIMP, Microsoft Paint, the MATLAB SHINE toolbox, and custom MATLAB code.',
        'Class': 'Code' # Method? 
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'http://www.nitrc.org/projects/jip',
        'Sentence': 'Functional volumes were realigned and motion-corrected with the Statistical Parametric Mapping software (SPM12, RRID: SCR_007037), followed by non-rigid co-registration (using JIP, http://www.nitrc.org/projects/jip, RRID: SCR_009588) to the high-resolution anatomical template of the skull-stripped brain of each monkey.',
        'Class': 'Other' # Method? 
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://caffe.berkeleyvision.org/model_zoo.html',
        'Sentence': 'Another version of pre-trained AlexNet was im-ported from Caffe Model Zoo (https://caffe.berkeleyvision.org/model_zoo.html).',
        'Class': 'Unsure'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://osf.io/b8pfa/?view_only=b6dbb5dd6a044989a7eecdc99facb43c',
        'Sentence': 'Preprocessed fMRI data are available at https://osf.io/b8pfa/?view_only=b6dbb5dd6a044989a7eecdc99facb43c.',
        'Class': 'Data'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://github.com/Yozafirova/monkey-fMRI-codes',
        'Sentence': 'Codes for the fMRI data analysis at https://github.com/Yozafirova/monkey-fMRI-codes and for the CNN data analysis at https://github.com/RajaniRaman/face_body_integration.',
        'Class': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://github.com/RajaniRaman/face_body_integration',
        'Sentence': 'Codes for the fMRI data analysis at https://github.com/Yozafirova/monkey-fMRI-codes and for the CNN data analysis at https://github.com/RajaniRaman/face_body_integration.',
        'Class': 'Code'
    },
]

# Convert the list of dictionaries to a DataFrame
manual_groundtruth_urls = pd.DataFrame(groundtruth_urls)

In [12]:
# Path to the 'Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# File path
file_path = os.path.join(data_dir, 'articles_groundtruth_urls_and_sentences.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
manual_groundtruth_urls.to_csv(file_path, index=False, mode='w')

In [13]:
# Check for NaN values in the 'URL' column
manual_groundtruth_urls[manual_groundtruth_urls['URL'].isna()]

Unnamed: 0,DOI,URL,Sentence,Class
34,10.1016/j.neuroimage.2022.119549,,,
35,10.1016/j.neuroimage.2022.119646,,,


In [14]:
# Group by 'DOI' and count the number of URLs, setting NaN counts to 0
url_counts = manual_groundtruth_urls.groupby('DOI')['URL'].count().fillna(0)

# Sort the counts in descending order
url_counts = url_counts.sort_values(ascending=False)

url_counts

DOI
10.1016/j.neuroimage.2022.119526    12
10.1016/j.neuroimage.2022.119676     7
10.1016/j.neuroimage.2022.119030     6
10.1016/j.neuroimage.2022.119240     5
10.1016/j.neuroimage.2022.119443     5
10.1016/j.neuroimage.2021.118854     3
10.1016/j.neuroimage.2022.119050     2
10.1016/j.neuroimage.2021.118839     1
10.1016/j.neuroimage.2022.119549     0
10.1016/j.neuroimage.2022.119646     0
Name: URL, dtype: int64

<a name='URLsandsentences'></a>
# 2. URLs and sentences
I use the work of Sourget (2023) to search the PDFs for their datasets: 

I use the Python library *urlextract* by Lipovský (2022) to extract the URLs. 

I perform some initial cleaning of the sentences extracted from the PDFs, specifically removing multiple spaces with a single space, removing all \n characters, and remove leadning and trailing spaces after a number of special characters, incl. -, (, ), /, ., _ , and between : /.

The functions: 
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
- *clean_text* 
- *split_text_into_sentence* 
- *extract_links* uses the urlextract library (Lipovský 2022). 
- *get_urls_and_sentences* calls on *extract_links* and gets both URLs and sentences containing the URL. 
- *extract_and_transform_urls_from_dataframe* 
    

<br>

References: 
- Lipovský, J. (2022). urlextract: Collects and extracts URLs from given text. (1.8.0) [Python]. https://github.com/lipoja/URLExtract
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [15]:
def get_content(pdf_path, alt_pdf_path):
    """Get sentences that contain URLs. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param alt_pdf_path (str): Alternative path to the PDF file. 
    
    Returns: 
    :return: Dataframe or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)

        # Extract sentences containing urls
        df = get_urls_and_sentences(pdf_text)
        pdf_file.close()
        if df is not None:  # Check if a DataFrame is returned
            return df

    except FileNotFoundError:
        try:
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return pd.DataFrame({"url": [np.nan], "sentences": [np.nan]})
        except FileNotFoundError:
            return pd.DataFrame({"url": [np.nan], "sentences": [np.nan]})
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")
    
    # If no URLs were found, return an empty DataFrame
    return pd.DataFrame(columns=["url", "sentences"])


############### SENTENCES ################################################
def clean_text(text): 
    """This function performs a very simple initial cleaning of the extracted sentences. 
    This includes removing multiple spaces with a single space, removing all \n characters, 
    and remove leadning and trailing spaces after a number of special characters, 
    incl. -, (, ), /, ., _ , and between : / 
    """
    return text.replace('   ', ' ').replace('  ', ' ').replace('\n', '').replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_') 

def get_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences


############### LINKS ################################################
def get_urls(text):
    """This function returns all unique urls in a text that have certain characters stripped from 
    the end of them (including ',', '.', and ')')
    """
    # Instance of the URLExtract class
    extractor = urlextract.URLExtract()
    
    # Create a set to store unique URLs
    unique_urls = set()
    
    for url in extractor.gen_urls(text):
        # Apply additional processing to the URL, e.g., removing characters at the end
        processed_url = url.rstrip('.').rstrip(')').rstrip(',')
        unique_urls.add(processed_url)
    
    # Convert the set back to a list if needed
    unique_url_list = list(unique_urls)
    
    return unique_url_list

def get_sentences_with_urls(sentences):
    # Instance of the URLExtract class
    extractor = urlextract.URLExtract()
    # Extract all sentences with URLs
    sentences_with_urls = []
    stop_processing = False  # Flag to stop processing when "References" is found

    for sentence in sentences:
        if stop_processing:
            break  # Stop processing when "References" is found
        if extractor.has_urls(sentence):
            sentences_with_urls.append(sentence)
        if "References" in sentence:
            stop_processing = True  # Set the flag to stop processing

    return sentences_with_urls
    
def get_urls_and_sentences(text):
    """
    """
    # Lists to store the extracted URLs and their corresponding sentences
    url_list = []
    sentence_list = []
    # Clean sentences 
    cleaned_text = clean_text(text)
    # Extract links
    links = get_urls(cleaned_text)
    # Extract sentences 
    sentences = get_sentences(cleaned_text)
    # Extract sentences with links 
    sentences_w_links = get_sentences_with_urls(sentences)

    # Process each URL 
    for link in links:
        sentences_for_url = [sentence for sentence in sentences_w_links if link in sentence]
        if sentences_for_url:  # Only add the URL if there are associated sentences
            url_list.append(link)
            sentence_list.append(sentences_for_url)
    
    # If no URLs were found, return an empty DataFrame
    if not url_list:
        return pd.DataFrame(columns=["url", "sentences"])

    return pd.DataFrame({"url": url_list, "sentences": sentence_list})


############### DATAFRAME WITH DOI, URL, SENTENCES ################################################
def process_groundtruth_DOIs(groundtruth_dois, articles_directory, editorialboard_directory, json_file_path):
    """This function processes a list of DOIs, extracts urls and sentences from PDFs, 
    and create a DataFrame.

    Parameters:
    :param groundtruth_dois (list): List of DOIs to process.
    :param articles_directory (str): Path to the directory containing PDF articles.
    :param editorialboard_directory (str): Path to the directory with editorial board articles.
    :param json_file_path (str): Path to a JSON file.

    Returns:
    :return pd.DataFrame: A DataFrame containing processed information from the DOIs.
    """
    results_list = []

    with open(json_file_path, 'r') as json_file:
        for doi in groundtruth_dois:
            doi_replaced = doi.replace('/', '.')
            pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

            # Call the get_content function for each DOI
            url_df = get_content(pdf_path, editorialboard_directory)

            # Append the DOI to the URL DataFrame
            url_df['DOI'] = doi
            results_list.append(url_df)

        """ TO PROCESS JUST ONE DOI: 
        doi = '10.1016/j.neuroimage.2022.119030'
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        url_df = get_content(pdf_path, editorialboard_directory)
        # Append the DOI to the URL DataFrame
        url_df['DOI'] = doi
        results_list.append(url_df)
        """
            
    # Concatenate the list of DataFrames into a single DataFrame
    results_df = pd.concat(results_list, ignore_index=True)

    # Rename the columns as needed
    results_df.rename(columns={'url': 'URL', 'sentences': 'Sentences'}, inplace=True)

    return results_df


def process_DOIs(articles_directory, editorialboard_directory, json_file_path):
    """This function processes a list of DOIs, extracts urls and sentences from PDFs, 
    and create a DataFrame.

    Parameters:
    :param articles_directory (str): Path to the directory containing PDF articles.
    :param editorialboard_directory (str): Path to the directory with editorial board articles.
    :param json_file_path (str): Path to a JSON file.

    Returns:
    :return pd.DataFrame: A DataFrame containing processed information from the DOIs.
    """
    results_list = []

    with open(json_file_path, 'r') as json_file:
        doi_data = json.load(json_file)
        for doi in doi_data['DOIs']:
            doi_replaced = doi.replace('/', '.')
            pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

            # Call the get_content function for each DOI
            url_df = get_content(pdf_path, editorialboard_directory)

            # Append the DOI to the URL DataFrame
            url_df['DOI'] = doi
            results_list.append(url_df)
            
    # Concatenate the list of DataFrames into a single DataFrame
    results_df = pd.concat(results_list, ignore_index=True)

    # Rename the columns as needed
    results_df.rename(columns={'url': 'URL', 'sentences': 'Sentences'}, inplace=True)

    return results_df

In [16]:
# Path to the directory containing PDFs
articles_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
editorialboard_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

In [17]:
automatic_groundtruth_df = process_groundtruth_DOIs(groundtruth_dois, articles_directory, editorialboard_directory, json_file_path)

In [18]:
automatic_groundtruth_df

Unnamed: 0,URL,Sentences,DOI
0,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118839
1,www.elsevier.com/locate/neuroimage,[NeuroImage 248 (2022) 118839 Contents lists a...,10.1016/j.neuroimage.2021.118839
2,http://neuroimage.usc.edu/brainstorm,"[2011), an accredited software freely availabl...",10.1016/j.neuroimage.2021.118839
3,https://doi.org/10.1016/j.neuroimage.2021.118839,[Corticospinal projections has been shown also...,10.1016/j.neuroimage.2021.118839
4,https://www.humanconnectome.org/study/hcp-youn...,[Data and code availability The data used in t...,10.1016/j.neuroimage.2021.118854
...,...,...,...
65,https://github.com/RajaniRaman/face_body_integ...,[Codes for the fMRI data analysis at https://g...,10.1016/j.neuroimage.2022.119676
66,https://www.shutterstock.com,[Both experiments employed static images (modi...,10.1016/j.neuroimage.2022.119676
67,https://github.com/Yozaﬁrova/monkey-fMRI-codes,[Codes for the fMRI data analysis at https://g...,10.1016/j.neuroimage.2022.119676
68,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119676


Brief exploration of the URLs extracted from the groundtruth articles. 

In [19]:
# How many URLs are saved per DOI
doi_counts = automatic_groundtruth_df['DOI'].value_counts()

# Print the number of rows for each unique DOI
print(doi_counts)

10.1016/j.neuroimage.2022.119526    15
10.1016/j.neuroimage.2022.119676    10
10.1016/j.neuroimage.2022.119030     9
10.1016/j.neuroimage.2022.119443     8
10.1016/j.neuroimage.2022.119240     7
10.1016/j.neuroimage.2021.118854     6
10.1016/j.neuroimage.2022.119050     5
10.1016/j.neuroimage.2021.118839     4
10.1016/j.neuroimage.2022.119549     3
10.1016/j.neuroimage.2022.119646     3
Name: DOI, dtype: int64


In [20]:
automatic_groundtruth_df[automatic_groundtruth_df['DOI'] == '10.1016/j.neuroimage.2022.119526']

Unnamed: 0,URL,Sentences,DOI
39,https://github.com/jbrown81/gradients,"[All code (latent space derivation, dynamical ...",10.1016/j.neuroimage.2022.119526
40,https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/,[We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/f...,10.1016/j.neuroimage.2022.119526
41,https://www.humanconnectome.org/study/hcp-youn...,[This study agreed to the Open Access Data Use...,10.1016/j.neuroimage.2022.119526
42,https://www.ﬁl.ion.ucl.ac.uk/spm/software/spm12/,[Task condition block regressors were convolve...,10.1016/j.neuroimage.2022.119526
43,www.elsevier.com/locate/neuroimage,[NeuroImage 261 (2022) 119526 Contents lists a...,10.1016/j.neuroimage.2022.119526
44,https://github.com/rmarkello/abagen,[Genetic spatial correlation We compared each ...,10.1016/j.neuroimage.2022.119526
45,http://www.diedrichsenlab.org/imaging/suit.htm,[Maps were averaged within 273 regions of inte...,10.1016/j.neuroimage.2022.119526
46,https://afni.nimh.nih.gov/,[We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/f...,10.1016/j.neuroimage.2022.119526
47,http://www.brainnetome.org/,[Maps were averaged within 273 regions of inte...,10.1016/j.neuroimage.2022.119526
48,https://sites.google.com/site/bctnet/,[Graph the-ory analyses were run using the Bra...,10.1016/j.neuroimage.2022.119526


In [21]:
# Count the rows with URLs containing 'creativecommons.org'
cc_license_count = len(automatic_groundtruth_df[automatic_groundtruth_df['URL'].str.contains('creativecommons.org')])

# Print the count
print(f"Number of rows with links to Creative Commons license: {cc_license_count}")

Number of rows with links to Creative Commons license: 10


## 2.1. Process URLs
<a name = 'processURLs'></a>

Before this point, I already performed a few preprocessing steps of the URLs: 
- in *get_urls*, I returned only unique URLs. 
- in *process_url*, I stripped URLs of the characters '.' and ')', if they were at the end of the link. 

At this point, I have only unique URLs for each DOI. But I want to remove some URLs that I know do not point to datasets. As such, this processing step is: 
* Remove URLs: 
    * 'www.elsevier.com/locate/neuroimage' - this link is placed outside of the article's text. 
    * URLs containing the DOI of the article - this link is placed outside of the article's text. 
    * Creative Commons licenses 

I want to check any links that are common between the articles to see if there are some NeuroImage or Elsevier specific links that can be removed. 

In [31]:
# Print the rows with duplicate URLs, indicating which DOIs share the same URL. 
# Check the column 'url' for duplicates using subset='url' and keep=False keeps all occurrences of the duplicates.
duplicate_urls = automatic_groundtruth_df[automatic_groundtruth_df.duplicated(subset='URL', keep=False)]

# Print the rows with duplicate URLs
duplicate_urls

Unnamed: 0,URL,Sentences,DOI
0,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118839
1,www.elsevier.com/locate/neuroimage,[NeuroImage 248 (2022) 118839 Contents lists a...,10.1016/j.neuroimage.2021.118839
5,www.elsevier.com/locate/neuroimage,[NeuroImage 249 (2022) 118854 Contents lists a...,10.1016/j.neuroimage.2021.118854
9,http://creativecommons.org/licenses/by/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118854
12,www.elsevier.com/locate/neuroimage,[NeuroImage 252 (2022) 119030 Contents lists a...,10.1016/j.neuroimage.2022.119030
18,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119030
20,www.elsevier.com/locate/neuroimage,[NeuroImage 253 (2022) 119050 Contents lists a...,10.1016/j.neuroimage.2022.119050
23,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119050
24,www.elsevier.com/locate/neuroimage,[NeuroImage 256 (2022) 119240 Contents lists a...,10.1016/j.neuroimage.2022.119240
30,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119240


Remove links: 

In [32]:
def filter_urls(df, urls_to_remove):
    """
    """
    filtered_df = df.copy()
    # Collect unique DOIs 
    unique_dois = filtered_df['DOI'].unique()
     # Ensure 'URL' and 'DOI' columns are of string type
    filtered_df['URL'] = filtered_df['URL'].astype(str)
    filtered_df['DOI'] = filtered_df['DOI'].astype(str)
    
    # Remove URLs from list of URLs to remove 
    filtered_df = filtered_df[~filtered_df['URL'].isin(urls_to_remove)]

    # Remove URLs referring to the Creative Commons license 
    filtered_df = filtered_df[~filtered_df['URL'].str.contains('creativecommons.org')]
    
    # Remove URLs containing the DOI 
    for doi in filtered_df['DOI'].unique():
        filtered_df = filtered_df[~filtered_df['URL'].str.contains('doi')]
    
    # Check if all unique DOIs are still present, if not, add them with NaN values
    missing_dois = set(unique_dois) - set(filtered_df['DOI'].unique())
    for missing_doi in missing_dois:
        filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)
    
    return filtered_df

In [41]:
# URLs to remove
urls_to_remove = [
    'www.elsevier.com/locate/neuroimage',  # URL to remove
]

automatic_groundtruth_urls = filter_urls(automatic_groundtruth_df, urls_to_remove)

  filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)
  filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)


In [43]:
len(automatic_groundtruth_urls)

42

In [44]:
# How many URLs are saved per DOI
doi_counts = automatic_groundtruth_urls.groupby('DOI')['URL'].count().fillna(0) 

doi_counts = doi_counts.sort_values(ascending=False)

doi_counts

DOI
10.1016/j.neuroimage.2022.119526    12
10.1016/j.neuroimage.2022.119676     7
10.1016/j.neuroimage.2022.119030     6
10.1016/j.neuroimage.2022.119443     5
10.1016/j.neuroimage.2022.119240     4
10.1016/j.neuroimage.2021.118854     3
10.1016/j.neuroimage.2022.119050     2
10.1016/j.neuroimage.2021.118839     1
10.1016/j.neuroimage.2022.119549     0
10.1016/j.neuroimage.2022.119646     0
Name: URL, dtype: int64

In [36]:
# Calculate the average number of URLs per DOI
average_urls_per_doi = doi_counts.mean()
print("Average URLs per DOI:", average_urls_per_doi)

Average URLs per DOI: 4.0


The counts don't fully match up with my manual exploration. 
- 10.1016/j.neuroimage.2022.119240 has one less than my manual count, specifically the link: 'https://github.jyeatman/AFQ' 

In [37]:
manual_groundtruth_urls[manual_groundtruth_urls['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,DOI,URL,Sentence,Class
12,10.1016/j.neuroimage.2022.119240,www.cni.stanford.edu,MRI data were acquired on a 3T Discovery MR750...,Data
13,10.1016/j.neuroimage.2022.119240,http://github.com/vistalab/vistasoft/mrDiffusion,The T1w images were first aligned to the canon...,Code
14,10.1016/j.neuroimage.2022.119240,http://www.fil.ion.ucl.ac.uk/spm/,Each diffusion weighted image was registered t...,Code
15,10.1016/j.neuroimage.2022.119240,https://github.com/mezera/mrQ,"Quantitative T1 (relaxation time, seconds) map...",Code
16,10.1016/j.neuroimage.2022.119240,https://github.jyeatman/AFQ,Automated Fiber Quantification (AFQ; https://g...,Code


In [38]:
automatic_groundtruth_urls[automatic_groundtruth_urls['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,URL,Sentences,DOI
12,https://github.com/mezera/mrQ,"[Quantitative T1 (relaxation time, seconds) ma...",10.1016/j.neuroimage.2022.119240
13,http://github.com/vistalab/vistasoft/mrDiﬀusion,[Diﬀusion weighted images were pre-processed w...,10.1016/j.neuroimage.2022.119240
14,www.cni.stanford.edu,[MRI data acquisition and processing MRI data ...,10.1016/j.neuroimage.2022.119240
15,http://www.ﬁl.ion.ucl.ac.uk/spm/,[Each diﬀusion weighted image was registered t...,10.1016/j.neuroimage.2022.119240


Looking at the links before filtering them, the link was not caught: 

In [39]:
automatic_groundtruth_df[automatic_groundtruth_df['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,URL,Sentences,DOI
24,www.elsevier.com/locate/neuroimage,[NeuroImage 256 (2022) 119240 Contents lists a...,10.1016/j.neuroimage.2022.119240
25,https://doi.org/10.1016/j.neuroimage.2022.119240,[The sample included FT and PT children across...,10.1016/j.neuroimage.2022.119240
26,https://github.com/mezera/mrQ,"[Quantitative T1 (relaxation time, seconds) ma...",10.1016/j.neuroimage.2022.119240
27,http://github.com/vistalab/vistasoft/mrDiﬀusion,[Diﬀusion weighted images were pre-processed w...,10.1016/j.neuroimage.2022.119240
28,www.cni.stanford.edu,[MRI data acquisition and processing MRI data ...,10.1016/j.neuroimage.2022.119240
29,http://www.ﬁl.ion.ucl.ac.uk/spm/,[Each diﬀusion weighted image was registered t...,10.1016/j.neuroimage.2022.119240
30,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119240


Compared to the manual extraction, 39 of the 40 links were picked up on by the code. I will move on with the current code and extract all URLs from the entire corpus of NeuroImage 2022 articles. 

<a name='URLsinNeuroImage2022articles'></a>
## 2.2. URLs in NeuroImage 2022 articles 

Some expectations based on the ground truth sample: 
- 4 URLs per article on average (ranges from 1 to 12)
- Two out of ten articles will not contain any URLs (after filtering). 
    - 20 % of the articles seems quite high. 
    - 20% of 834 is 166.34. 
- With 834 articles total, based on the average: 
    - Before filtering: seven URLs per article on average (four average plus at least three URLs I filter out), i.e., 5.838 URLs 
    - After filtering: 3.336 URLs 

In [25]:
# Path to the directory containing PDFs
articles_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
editorialboard_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

# URLs to remove
urls_to_remove = [
    'www.elsevier.com/locate/neuroimage',  # URL to remove
]

In [None]:
df = process_DOIs(articles_directory, editorialboard_directory, json_file_path)

In [None]:
len(df['DOI'].unique())

In [None]:
len(df)

In [None]:
filtered_df = filter_urls(df, urls_to_remove)

In [None]:
filtered_df

In [None]:
len(filtered_df['DOI'].unique())

In [None]:
# URLs, including articles with NAN values in the URL column 
len(filtered_df['DOI'])

In [None]:
# URLs, excluding articles with NAN values in the URL column 
len(filtered_df[~filtered_df['URL'].isna()])

In [None]:
count_of_nan_dois = filtered_df['URL'].isna().sum()
print("Count of DOIs with NaN values in the 'URL' column:", count_of_nan_dois)

In [None]:
urls_df = filtered_df[~filtered_df['URL'].isna()]

In [None]:
# Define the path to the 'Code-git/Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# Define the file path
path_all_urls = os.path.join(data_dir, 'articles_all_urls.csv')
path_filtered_urls = os.path.join(data_dir, 'articles_filtered_urls.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
df.to_csv(path_all_urls, index=False, mode='w')
filtered_df.to_csv(path_filtered_urls, index=False, mode='w')

<a name='unsupervisedtextclassificationusingwordembeddings'></a>
# 3. Unsupervised text classification using word embeddings


As my overarching goal is to get the datasets, I want to use the sentences that contain the URLs to help classify the URLs. E.g., from the article with DOI 10.1016/j.neuroimage.2022.119526: 
* Data: "Subjects and data 200 unrelated subjects were selected from the Human Con-nectome Project (HCP) 1200 Subjects Data Release with avail-able resting (task-free) and task fMRI data from a 3T MRI scan-ner (https://db.humanconnectome.org/data/projects/HCP_1200)."
* Code: "We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/) and AFNI (https://afni.nimh.nih.gov/) for additional fMRI preprocessing." 
* Data: "All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients."

I combine the approaches and steps presented by Halford (2020) and Kosar et al. (2022)

I follow the approach and steps presented by M. Halford on their blogpost, where they perform unsupervised text classification with word embeddings: 
* Import spaCy and model (en_core_web_lg, which is the large English model)
* Clean the text (remove punctuation marks, unnecessary whitespace, and carriage returns, and lowercase all the text). 
* Tokenize and lammentize the text. 
* Implement the two method improvements by creating a **class concept** and augmenting the concept with **additional class label instances** (Kosar et al. 2022). 
* 


Prepping for embedding and analysis: 
CLEANING (steps from: Haj-Yahia, Z., Sieg, A., & Deleris, L. A. (2019). Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 371–379. https://doi.org/10.18653/v1/P19-1036)
- 


---

I’m going to be using spaCy for manipulating word embeddings. I’ve decided to use the en_core_web_lg embeddings, which contains Word2vec embeddings that were fitted on Common Crawl data. The embeddings can be downloaded using python: spacy download en_core_web_lg

---

I want to separate the urls by the content of the sentences in which they appear. My separation will be very basic: 

* Data: data(-base, -set), image(s), (neuro)(map(s)), atlas, (DOI(s))
* Code: tool(-kit, -box), script(s), algorithm, software, package, plugin, function, (analysis, result(s))
* Other



References: 
- Halford, M. (2020, October 3). Unsupervised text classification with word embeddings. https://maxhalford.github.io/blog/unsupervised-text-classification/
- Kalai, A. T. & Brown University (Directors). (2019, April 18). An ICERM Public Lecture - Bias in bios: Fairness in a high-stakes machine-learning setting. https://www.youtube.com/watch?v=IDNXZitcQng
- Kosar, A., Pauw, G. D., & Daelemans, W. (2022). Unsupervised Text Classification with Neural Word Embeddings. _Computational Linguistics in the Netherlands Journal_, _12_, 165–181.


Hybrid approach: 
- Use FastText to create class concepts and labels 
- Use ELMo or BERT to analyze the sentences - this is a contextualized approach, which I think will yield better results. 


## Class concepts and labels  

I use the pre-trained English vectors that were trained on Common Crawl data and Wikipedia using fastText (Grave et al. 2018), available on this link: https://fasttext.cc/docs/en/crawl-vectors.html. 

    *NB! The pre-trained vectors called 'crawl-300d-2M.vec' with 600B tokens (Mikolov et al. 2017) could not be loaded properly, because they are in type .vec and not .bin, which I need - this vectors are available on this link: https://fasttext.cc/docs/en/english-vectors.html*

References: 
* Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in Pre-Training Distributed Word Representations (arXiv:1712.09405). arXiv. https://doi.org/10.48550/arXiv.1712.09405
- Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning Word Vectors for 157 Languages (arXiv:1802.06893). arXiv. https://doi.org/10.48550/arXiv.1802.06893

In [7]:
def load_vectors(fname):
    """This function is directly copied from fastText. (n.d.). English word vectors · fastText. Retrieved October 19, 2023, from https://fasttext.cc/index.html.
    The vectors were created by Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in Pre-Training Distributed Word Representations (arXiv:1712.09405). arXiv. https://doi.org/10.48550/arXiv.1712.09405.
    """
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

In [9]:
# Path to the FastText word vectors file
model_path = '../Data/fastText_model_folder/crawl-300d-2M.vec' 

# Load the word vectors
vectors = load_vectors(model_path)

I use these vectors: 
https://fasttext.cc/docs/en/crawl-vectors.html 

Reference: 
- Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning Word Vectors for 157 Languages (arXiv:1802.06893). arXiv. https://doi.org/10.48550/arXiv.1802.06893

In [14]:
fasttext.util.download_model('en', if_exists='ignore')  # English
model = fasttext.load_model('cc.en.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz


 (0.36%) [>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>  

 (0.81%) [>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]> 

 (1.20%) [>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]> 

 (1.51%) [>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]   

 (1.76%) [>                                                  ]>                                                  ]>                                                  ]                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>  

 (2.10%) [=>                                                 ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>                                                  ]>

 (2.35%) [=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>

 (2.60%) [=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>

 (2.99%) [=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>

 (3.35%) [=>                                                 ]=>                                                 ]                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>  

 (3.74%) [=>                                                 ]=>                                                 ]>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>   

 (4.17%) [==>                                                ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>                                                 ]=>

 (4.72%) [==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==> 

 (5.01%) [==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]                                                ]==>  

 (5.41%) [==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==

 (5.62%) [==>                                                ]==>                                                ]>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]>                                                ]==>                                                ]==>                                                ]==>                                                ]==>  

 (6.17%) [===>                                               ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==>                                                ]==

 (6.62%) [===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>  

 (7.03%) [===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]                                               ]                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>      

 (7.44%) [===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]==

 (7.76%) [===>                                               ]>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===> 

 (8.26%) [====>                                              ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]===>                                               ]>                                               ]===> 

 (8.52%) [====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]]====>                                              ]]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]

 (8.95%) [====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]==

 (9.29%) [====>                                              ]====>                                              ]                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]                                              ]====>                                              ]====>       

 (9.61%) [====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]                                              ]>                                              ]>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>          

 (9.78%) [====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]=

 (10.14%) [=====>                                             ]===>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]>                                              ]====>                                              ]>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>                                              ]====>          

 (10.39%) [=====>                                             ]=====>                                             ]>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>

 (10.86%) [=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]>                                             ]=====>                                             ]=====>

 (11.25%) [=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=

 (11.68%) [=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>                                             ]=====>



































































































































































































































































































































































































































































































































































In [15]:
model

<fasttext.FastText._FastText at 0x7f8b5b748fa0>

In [16]:
model.get_nearest_neighbors('data')

[(0.697354257106781, 'data.Data'),
 (0.676213264465332, 'Data'),
 (0.6645808219909668, 'data.This'),
 (0.6600629687309265, 'data.The'),
 (0.6488742232322693, 'datat'),
 (0.6467322111129761, 'data.So'),
 (0.6350638270378113, 'data.But'),
 (0.6328354477882385, 'datasets'),
 (0.626289963722229, 'data.Now'),
 (0.6201826930046082, 'data.That')]

I will initialize and load the spaCy language model called "en_core_web_lg," which provides word embeddings. This is the largest model spaCy offers for English. It includes more word vectors and covers a broader range of concepts and terms compared to the small and medium sized models. 

I will use the pre-trained word embeddings from this model to build a class concept, retrieve related class labels, and then to classify the sentences. 

On their website, they write: "For pipelines with default vectors, md has a reduced word vector table with 20k unique vectors for ~500k words and lg has a large word vector table with ~500k entries.

For pipelines with floret vectors, md vector tables have 50k entries and lg vector tables have 200k entries." (https://spacy.io/models)

In [None]:
# Initialize a counter for unique words
unique_words_count = 0

# Iterate through the vocabulary and count unique words
for word in nlp.vocab:
    if word.is_alpha:  # Check if the word is alphabetic (a unique word)
        unique_words_count += 1

print(f"Number of unique words in spaCy's vocabulary: {unique_words_count}")

<a name='classconceptsandclasslabels'></a> 
## 3.1. Class concepts and class labels 

I compare with my ground truth sample, as I have manually done this - the validation step comes later. 

As suggested by Kosar et al. (2022), there are two ways to improve class label representation in the pre-trained models, namely to substitute the class label vector representation with a class concept vector representation, and to augment the class vector with additional class label instances. I will make a class concept vector based on the word 'data' and the first ten other closely related words, and I will include these ten words as additional class labels. 

References: 
- Kosar, A., Pauw, G. D., & Daelemans, W. (2022). Unsupervised Text Classification with Neural Word Embeddings. _Computational Linguistics in the Netherlands Journal_, _12_, 165–181.

In [None]:
def create_class_concept(class_label, num_related_words=10):
    """This function uses the preloaded spaCy model and then tokenize the class label 
    to obtain its word embeddings. It iterates through the spaCy vocabulary to find 
    the words most similar to the class label and sort them by similarity. The function filters 
    out common words to get labels that are more descriptive. 
    The class label is combined with the top related words to create the class concept.     
    """
    # Tokenize the class label
    class_tokens = nlp(class_label)
    
    # Find the word embeddings for the class label and related words
    class_embeddings = class_tokens.vector
    related_words = []

    # Sort words in the vocabulary by their similarity to the class label
    for word in nlp.vocab:
        if not word.is_oov and word.is_alpha:  # Check if the word is in the vocabulary and is alphabetic
            similarity = class_tokens.similarity(word)
            related_words.append((word, similarity))
    
    # Filter out common words
    related_words = [(word, similarity) for word, similarity in related_words if word.text not in spacy.lang.en.stop_words.STOP_WORDS]
    
    # Sort the words by similarity and take the top N related words
    related_words.sort(key=lambda x: x[1], reverse=True)
    top_related_words = [word.text for word, similarity in related_words[:num_related_words]]

    # Combine the class label and related words into the class concept
    class_concept = " ".join([class_label] + top_related_words)
    
    return class_concept

In [None]:
# Example usage
class_label = "data"
class_concept = create_class_concept(class_label)
class_concept

In [None]:
test = most_similar('data')

In [None]:
test

In [None]:
categories = {
    "data": ["data", "database", "dataset", "images", "neuromaps", "maps", "atlas", "DOI"],
    "code": ["code", "tool", "toolkit", "algorithm", "software", "package", "plugin", "function"],
    # "other": ["analysis", "results"]
}

<a name='sentenceclassification'></a>
## 3.2. Sentence classification

Steps: 
* Classify the sentences from the ground truth sample and see how well it performs. 
* Classify all sentences. 
* Analyse the performance using a validation set of 30 articles that are manually annotated by myself and two other people. 


In [None]:
# Initialize lists to store preprocessed sentences and tokenized text
preprocessed_sentences = []
tokenized_text = []

# Process each sentence
for sentence in sentences:
    # Apply spaCy's pipeline to preprocess and analyze the sentence
    doc = nlp(sentence)
    
    # Remove stop words and punctuations, and lemmatize
    cleaned_words = [word.lemma_ for word in doc if word.is_alpha and word.text.lower() not in STOP_WORDS]
    
    preprocessed_sentences.append(" ".join(cleaned_words))
    tokenized_text.append(cleaned_words)

# Convert tokenized text to sentence vectors using word embeddings
sentence_vectors = []
for tokens in tokenized_text:
    vectors = [word.vector for word in nlp(" ".join(tokens))]
    if vectors:
        sentence_vector = np.mean(vectors, axis=0)
        sentence_vectors.append(sentence_vector)
    else:
        # Handle empty sentences if necessary
        sentence_vectors.append(np.zeros(word.vector.shape))

# Load spaCy's pre-trained word vectors and calculate category vectors
# Once you have actual word vectors loaded into word_vectors, you can proceed with the classification code that uses these vectors to classify sentences into categories based on cosine similarity.

word_vectors = {word.text: word.vector for word in nlp.vocab if word.has_vector}

category_vectors = {
    category: np.mean([word_vectors[word] for word in category_words if word in word_vectors], axis=0)
    for category, category_words in categories.items()
}

# Initialize Nearest Neighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(list(category_vectors.values()))  # Fit with your category vectors

# Classify each sentence
classified_sentences = []

for sentence_vector in sentence_vectors:
    closest_label = neigh.kneighbors([sentence_vector], return_distance=False)[0, 0]
    label_names = list(categories.keys())
    classified_sentences.append(label_names[closest_label])

# Now classified_sentences contains the closest labels for each sentence

In [None]:
classified_sentences

In [None]:
# Function to calculate cosine similarity between sentence vector and category vectors
def classify_sentence(sentence, word_vectors, category_vectors, threshold=0.6):
    """
    Cosine similarity values range from -1 (completely dissimilar) to 1 (completely similar). 
    If you set the threshold to 0.7, it may be too strict, and as a result, most sentences may not meet this high similarity threshold.
    """
    tokens = sentence.split()
    # Create a list of vectors for each token
    token_vectors = [word_vectors.get(token, np.zeros(300)) for token in tokens]
    # Remove zero vectors
    token_vectors = [vector for vector in token_vectors if not np.all(vector == 0)]
    
    if not token_vectors:
        print("No token_vectors")
        # Handle the case where there are no valid token vectors
        return "Mixed/Uncertain"

    sentence_vector = np.mean(token_vectors, axis=0)
    similarities = {category: cosine_similarity([sentence_vector], [category_vector]) for category, category_vector in category_vectors.items()}
    best_category, best_similarity = max(similarities.items(), key=lambda x: x[1])
    
    if best_similarity >= threshold:
        return best_category
    else:
        return "Mixed/Uncertain"

# Load spaCy's pre-trained word vectors and calculate category vectors
# Once you have actual word vectors loaded into word_vectors, you can proceed with the classification code that uses these vectors to classify sentences into categories based on cosine similarity.

word_vectors = {word.text: word.vector for word in nlp.vocab if word.has_vector}

category_vectors = {
    category: np.mean([word_vectors[word] for word in category_words if word in word_vectors], axis=0)
    for category, category_words in categories.items()
}

# Classify sentences
for sentence in sentences:
    category = classify_sentence(sentence, word_vectors, category_vectors)
    print(f"Sentence: '{sentence}' - Category: {category}")

<a name='validate'></a>
# 4. Validate

In [None]:
def get_random_dois(json_file_path, num_samples=30, random_seed=42):
    # Set the random seed for reproducibility
    random.seed(random_seed)

    # Load the DOI data from the JSON file
    with open(json_file_path, 'r') as json_file:
        doi_data = json.load(json_file)
        doi_list = doi_data['DOIs']

    # Get a sample of DOIs
    random_dois = random.sample(doi_list, num_samples)

    return random_dois

In [None]:
# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

# Get 30 random DOIs with a specific random seed (42)
random_dois = get_random_dois(json_file_path, num_samples=30, random_seed=1)

random_dois

In [None]:
# Find overlapping DOIs using set intersection
overlapping_dois = set(groundtruth_dois) & set(random_dois)

# Convert the result to a list
overlapping_dois_list = list(overlapping_dois)

# Print the overlapping DOIs
print(overlapping_dois_list)

The 30 articles for validation are: 

['10.1016/j.neuroimage.2022.119453',
 '10.1016/j.neuroimage.2022.119221',
 '10.1016/j.neuroimage.2022.119373',
 '10.1016/j.neuroimage.2022.119496',
 '10.1016/j.neuroimage.2022.119227',
 '10.1016/j.neuroimage.2022.119408',
 '10.1016/j.neuroimage.2022.119286',
 '10.1016/j.neuroimage.2022.119709',
 '10.1016/j.neuroimage.2022.119210',
 '10.1016/j.neuroimage.2022.119660',
 '10.1016/j.neuroimage.2021.118745',
 '10.1016/j.neuroimage.2022.119515',
 '10.1016/j.neuroimage.2022.119714',
 '10.1016/j.neuroimage.2022.118963',
 '10.1016/j.neuroimage.2022.118906',
 '10.1016/j.neuroimage.2022.119554',
 '10.1016/j.neuroimage.2022.119437',
 '10.1016/j.neuroimage.2022.118972',
 '10.1016/j.neuroimage.2022.119347',
 '10.1016/j.neuroimage.2022.119087',
 '10.1016/j.neuroimage.2022.119507',
 '10.1016/j.neuroimage.2022.119619',
 '10.1016/j.neuroimage.2021.118784',
 '10.1016/j.neuroimage.2022.119584',
 '10.1016/j.neuroimage.2021.118810',
 '10.1016/j.neuroimage.2022.119500',
 '10.1016/j.neuroimage.2022.119589',
 '10.1016/j.neuroimage.2022.119502',
 '10.1016/j.neuroimage.2021.118823',
 '10.1016/j.neuroimage.2021.118820']

<a name='references'></a>
# References

- Lipovský, J. (2022). urlextract: Collects and extracts URLs from given text. (1.8.0) [Python]. https://github.com/lipoja/URLExtract
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

---
<a name = 'oldcode'></a>
# OLD CODE 

<a name='gatherdatasets'></a>
# 1. Gather datasets 

PLAN OF ATTACK TO EXPLORE: 
* IF - Locate 'Data availability' (or similar) section and look for links - if multiple, save all of them and look at surrounding words for context 
* ELSE If there is no 'Data availability' (or similar) section 
	* Look at wording in section 2.1 
<br>
<br>

<a name='gettextsections'></a>
## 1.1. Get text sections 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [None]:
def get_content(pdf_path, alt_pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content = get_section(pdf_text, section_patterns)
        if content: 
            return content 
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            print(alternative_pdf_path)
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section(article, section_patterns):
    """Get sections from a research paper based on patterns.
    This function is losely interpreted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    specifically PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
    
    Parameters: 
    :param article (str): Text contents of the research paper.
    :param section_patterns (list of lists): A list of lists where each inner list represents the start and end patterns.
    
    Returns: 
    :return: The extracted section text.
    """
    article_lower = article.lower()  # Convert contents to lowercase

    # Attempt to find the section based on the current patterns (case-insensitive)
    for start_patterns, end_patterns in section_patterns:
        for start_pattern in start_patterns:
            start_pattern = re.compile(re.escape(start_pattern), re.IGNORECASE)
            match_start = start_pattern.search(article_lower)
            if match_start:
                idx0 = match_start.start()
                for end_pattern in end_patterns:
                    end_pattern = re.compile(re.escape(end_pattern), re.IGNORECASE)
                    match_end = end_pattern.search(article_lower[idx0:])
                    if match_end:
                        end_idx = idx0 + match_end.end()
                        section = article[idx0:end_idx]  # Extract the matched section
                        return section

    # If no match is found, return an empty string
    return ""

In [None]:
# Path to the directory containing PDFs
pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
alternative_pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

<a name='sectionpatternsv1'></a>
### 1.1.1. Section patterns v1 
Before I continue working on extracting the dataset names and potential links from the sections, I am curious to see how the section pattern performs. 

I investigate the first ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [None]:
section_patterns = [
    (["Data and Code Availability", "Data Availability"], ["3", "CRediT authorship contribution statement", "Acknowledgements", "References"]),
    (["2.1"], ["2.2"]),
    (["Resource", "3.1 'Resource'"], ["3.2"]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1"], ["2"]),
    (["Abstract"], ["1", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][:10]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [None]:
results_df

*In the following description, I refer to the index of the articles in results_df.*

Observations from the text sections extracted with section_patterns: 
* In 1, 2, 4, 7, and 9, **the text is cut short because there's a mention of a number 3** within the section (in a link, in a release number, etc.). 
* In 2, they call it: 'Data/code availability statement'
* In 2 and 9, the **end of the section can be 'Acknowledgements'**.
* In 3 and 6, the **end of the section can be 'Declaration of Competing Interest'**.
* In 4, 5, and 8, the **section ends with 'Credit authorship contribution statement'**.
* In 5, we see that the use of a **URL does not necessarily mean that it's pointing to data (in this case, it's code and software)**. 
* In 6, we see that **the formulation of the text is important** (as the github link both contains data and code, but that is tricky to see). 
* In 7 and 8, they **mention which dataset they used, but do not link it**. 
* In 9, it says: "The review summarizes data but does not contain new data." (this is important if I want to look into and further filter the documents for significance testing). 

<br>
From this investigation I can see that I need to edit the section patterns. Ideas: 

- Maybe the end of a section can be \n\n? 
- Section end '3' should be called '3. ' - maybe this will fix some 
- Add variations: 
    - Section starts: 
        - Data/code availability statement
    - Section ends: 
        - [data and code] Declaration of Competing Interest
        - [data and code] Acknowledgements
        - [data and code] Credit authorship contribution statement
<br>
<br>

FOR FUTURE STEPS: 
- URLs do not necessarily link to the data. 
- A git repository can contain both data and code - but not always. 
- The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase). 
- QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

<a name='sectionpatternsv2'></a>
### 1.1.2. Section patterns v2 
Based on my exploration on the performance of the first section patterns, I can see that they need to be rewritten. For version 2, I made a few edits: 
* Add variations
    * Section starts: 
        * Data/code availability statement 
    * Section ends: 
        * '\n\n' (this could be a general way to end the section) 
        * [data and code] Declaration of Competing Interest
        * [data and code] Acknowledgements
        * [data and code] Credit authorship contribution statement
* Change pattern containing numbers (e.g., '3' is now '3. ')
<br>
I investigate the next ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [None]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

In [None]:
results_df2

*In the following description, I refer to the index of the articles in results_df2.*

Observations from the text sections extracted with section_patterns_v2: 
- In 0, there are links, but these are not to the dataset - they write "The used data can be shared with other researchers upon reasonable request." 
- In 0 and 7, the next section is called 'Supplementary materials' - which means that my attempt at \n\n did not work.  
- In 2, the only mention of data was picked up in section 2.1.
- In 2, the 'Declaration  of Competing  Interest' was not picked up - it looks like it's because there are double spaces between the words. 
- In 3, the 'Credit authorship  contribution  statement' is not picked - double spaces?
- In 6, the data section is called 'Code and data availability' - but it was picked up by 'data availability'. 
- In 6, there are multiple links mentioned - one for data (an atlas), one for the code, and one for the data. 
    - NB! When copying the URL for the data, it is broken up by the formatting: https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release - this is also the case for the atlas. 
- In 7 and 8, there are spaces in the URL. 
- In 8, the following section 'Declaration of Competing Interest' was not picked up. 
- In 9, the introduction was picked up: but it does not look like any data is analysed in this article. 

<br>
From this investigation I can see that I need to edit the section patterns further. 

<br>
<br>
TO DO: 

- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words mess these up
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


### 1.1.3. Section patterns v3 

I want to make a regex_pattern work, as it seems like a double space after 

TO DO: 
- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words messed these up. 
- Section_patterns I'm worried about: 
    - Section_start: '2.1.' - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Make the titles case sensitive, and it seems like most only capitalize the first word (see investigation in ../Code/articles_groundtruth.ipynb under 'Ground truth/Investigation/Section titles')


In [None]:
def get_content_regex(pdf_path, alt_pdf_path, section_patterns):
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content, matched_start_pattern, start_pattern, end_pattern = get_section_regex(pdf_text, section_patterns)
        
        if content:
            return content, matched_start_pattern, start_pattern, end_pattern
        else:
            # Handle the case where no content is found
            return content, matched_start_pattern, start_pattern, end_pattern
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board', '', '', ''
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board', '', '', ''
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section_regex(article, section_patterns):
    """This function extracts text sections from articles based on provided regex patterns.

    Parameters:
    :param article (str): The text of the article.
    :param section_patterns (list of tuple): A list of tuples containing start and end regex patterns.

    Returns:
    :returns tuple: A tuple containing the extracted section text, the matched start pattern, and the matched end pattern.
               If no section is found, it returns ('', '', '').
    """
    matched_pattern = None  # Variable to store the matched start pattern
    start_match = None      # Variable to store the specific matched start pattern
    end_match = None        # Variable to store the specific matched end pattern
    
    # Iterate through each pattern pair
    for start_pattern, end_pattern in section_patterns:
        # Find all matches of the start pattern in the article
        start_matches = re.finditer(start_pattern, article)

        # Iterate through each start match
        for match in start_matches:
            start_idx = match.start()  # Get the start position of the start match

            # Search for the end pattern starting from the end position of the start match
            end_match = re.search(end_pattern, article[start_idx:])
            
            if end_match:
                end_idx = start_idx + end_match.start()  # Calculate the end position of the section
                section_text = article[start_idx:end_idx].strip()  # Extract the section text

                # Store the matched start and end patterns
                matched_pattern = start_pattern
                start_match = match
                end_match = end_match

                # Return the section text and matched patterns
                return section_text, matched_pattern, start_match, end_match

    # If no match is found, return an empty string and the last matched patterns
    return '', '', '', ''

In [None]:
section_patterns_regex = [
    (r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability', r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'),
    (r'\n?2\. | \n?2\.1\.', r'\s*?\n?3\.\s*?\n?'),
    # (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n ')
]

In [None]:
# Empty list to store individual results
results_list_regex = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first X DOIs
    first_dois = doi_data['DOIs'][11:21] # 0:11 to compare with results_df, 11:21 to compare with results_df2

    for doi in first_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results_list_regex.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
results_df_regex = pd.DataFrame(results_list_regex)

results_df: 0-10, i.e., [0:11]

results_df2: 11-20, i.e., [11:21]

In [None]:
results_df2['Section'].loc[0]

In [None]:
results_df_regex['Section'].loc[0]

**Fixed issues**: 
- Edited get_content_regex function to be case sensitive instead of insensitive 
    - When searching using all lowercase, results_df2['Section'].loc[2], this is cut short
        - From 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the funding  body or institute,  and with the institutional  ethics \napproval.  Parts of the data are conﬁdential  and additional  ethical ap- \nproval may be needed  for re-use. \n'
        - To: 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the'
        
**Persisting issues**: 
- Reading the PDF 
    - By page-shift, the header is picked up (results_df_regex['Section'].loc[6], DOI  10.1016/j.neuroimage.2022.118986)
    - Double (or more) spaces
    - \n characters 
- Section titles 
    - There are variations of section_start titles that I have not included in my pattern, e.g., "Data Availability", which I discovered in articles_groundtruth
    - There are infinitely many undiscovered section_end titles, that I have not included in my pattern. 

NB! THIS TAKES MORE THAN AN HOUR TO RUN!
started at 16.09 - saw it was done at 18.15 - but checked at 17:40+, where it hadn't finished 

In [None]:
# Empty list to store individual results
results = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    for doi in doi_data['DOIs']:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content_regex function for each DOI 
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
articles_dataset_sections = pd.DataFrame(results)

NB! The code above takes between one and two hours to run. 

In [None]:
# articles_dataset_sections

In [None]:
# Define the path to the 'Code-git/Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# Define the file path
file_path = os.path.join(data_dir, 'articles_dataset_sections.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
articles_dataset_sections.to_csv(file_path, index=False, mode='w')  

<a name='preprocessingtextsections'></a>
## 1.2. Preprocessing text sections
Before I continue to the extraction of the datasets from the text sections, I want to clean the current data a bit. This includes: 
- Clean the matching start patterns 
- Clean the extracted text sections, including 
    - Remove characters like '\n' 
    - Remove double (or more) spaces 

In [None]:
# Path to the CSV file
csv_file_path = os.path.join(os.pardir, 'Data/articles_dataset_sections.csv') 

# Read the CSV file into a DataFrame
articles_dataset_sections = pd.read_csv(csv_file_path)

In [None]:
articles_dataset_sections

<a name='startpatterns'></a>
### 1.2.1. Start patterns 

In [None]:
def extract_matched_text(text):
    """This function extracts matched text from a string containing a regular expression 
    match object and performs data cleaning.

    Parameters:
    :param text (str): A string containing a regular expression match object (e.g., "<re.Match object; span=(start, end), match='text'>").

    Returns:
    :returns: If a match is found in the input text, the function returns the matched text after performing the following operations:
        Stripping leading and trailing spaces from the matched text.
        Replacing '\n' (newline) characters with empty strings.
    :returns: If no match is found or the resulting matched text is empty, the function returns NaN.
    """
    
    match = re.search(r"match='(.*?)'", str(text))
    if match:
        matched_text = match.group(1).strip().replace('\\n', '').replace('  ', ' ').replace('   ', ' ')
        if matched_text:
            return matched_text
        else:
            return np.nan
    else:
        return np.nan

In [None]:
# Apply the function to clean up the 'Start_pattern' column
articles_dataset_sections['Start_pattern_clean'] = articles_dataset_sections['Start_pattern'].apply(extract_matched_text)

Overview of how many articles matches each of the section patterns. 

In [None]:
# Group by 'Matched_pattern' and count the number of rows in each group
pattern_counts = articles_dataset_sections['Matched_pattern'].value_counts()

# Count NaN values and add it to the pattern_counts Series
nan_count = articles_dataset_sections['Matched_pattern'].isna().sum()
pattern_counts['NaN'] = nan_count

# Create a DataFrame to store the results
articles_section_patterns = pd.DataFrame({
    'Matched_pattern': pattern_counts.index,
    'Count': pattern_counts.values
})

# Print the result DataFrame
print(articles_section_patterns)

# Calculate and print the total count
total_count = articles_section_patterns['Count'].sum()
print("Total Count:", total_count)

I was only expecting to see 19 articles with NaN as a matched pattern (since there are 19 editorial board papers). 

In [None]:
# Filter and display rows where 'Start_pattern_clean' is None
no_pattern = articles_dataset_sections[articles_dataset_sections['Start_pattern_clean'].isna()]
len(no_pattern)

In [None]:
# Filter rows where 'Section' is not 'Editorial board'
no_pattern[no_pattern['Section'] != 'Editorial board']

There should only be 19 articles where there is no pattern-match, as there are 19 'Editorial Board' articles. The articles that were not filtered properly by my code are: 
- 10.1016/j.neuroimage.2022.119560
    - This has a section called 'Data Availability'
- 10.1016/j.neuroimage.2021.118776
    - This article does not have any distinct sections. It presents all the articles in the particular volume of Neuroimaging. 
- 10.1016/j.neuroimage.2022.119154
    - This article does not have any distinct sections. It is a commentary.     
- 10.1016/j.neuroimage.2022.118921
    - This article does not have any distinct sections. It is a corrigendum. 
<br>
<br>

Of the four articles that did not contain one of my start patterns, only one should have been picked up. The rest seems to have been properly filtered. 
<br>
<br>

### 1.2.3. Clean text 
I will do a very simple initial cleaning of the extracted text sections: 
- Replace multiple spaces with a single space
    - [.replace('   ', ' ').replace('  ', ' ')]
- Remove all \n characters 
    - [.replace('\n', '')]
- Remove leading and trailing spaces after the following characters: -, (, ), /, ., _ , and between : / 
    - [.replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_')] 


In [None]:
articles_dataset_sections['Section']

In [None]:
for i in range(len(articles_dataset_sections['Section'])):
    articles_dataset_sections['Section'].loc[i] = articles_dataset_sections['Section'].astype(str).loc[i].replace('   ', ' ').replace('  ', ' ').replace('\n', '').replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_') 

In [None]:
articles_dataset_sections['Section']

# NB! PROBLEM WITH LINKS

osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/ - BUT THE WORD 'AND' IS NOT A PART OF THE LINK 

In [None]:
row = articles_dataset_sections[articles_dataset_sections['DOI'] == '10.1016/j.neuroimage.2022.119443']

In [None]:
row['Section'].values

Additionally, I want to remove the section titles from the text, as they can cause issues with the code I will be writing for extracting the datasets. 

In [None]:
articles_dataset_sections[['Section', 'Start_pattern_clean']]

In [None]:
def remove_starting_pattern(row):
    """This function removes the matching start_pattern text from the extracted section texts. 
    E.g., if the start pattern is 'Data and code availability', and the extracted section text 
    is 'Data and code availability The data incorporat...', the returned clean_text will be 
    'The data incorporat...'. 
    """
    section = row['Section']
    start_pattern = str(row['Start_pattern_clean']) 
    section = section.replace(start_pattern, '')
    
    return section

In [None]:
# Apply the function to each row
articles_dataset_sections['Section_wo_pattern'] = articles_dataset_sections.apply(remove_starting_pattern, axis=1)

In [None]:
articles_dataset_sections[['Start_pattern_clean', 'Section', 'Section_wo_pattern']]

<a name='getdatasets'></a>
## 1.3. Get datasets
I need to extract the datasets from the text sections we extracted above. 

Based on my previous observations, I will start the extraction with the following notions in mind: 
- Not open access datasets (meaning either fully private or available upon request) 
    - Markers include words such as "request", "no data", "new data", "not be shared" . E.g., 
        - "Data and code are available upon request."
        - "Data and code availability statement All individual-level raw data used in this study cannot be shared because of the ethical code of Tokyo Metropolitan University. How-ever, the acquired metadata (e.g., group level activation maps) are available upon request. The corresponding author should be contacted by email for all data requests."
        - "No data were acquired for this study."
        - "The review summarizes data but does not contain new data."
        - "The data and code presented here are available upon request to the corresponding author."
- Open access datasets (meaning it's available to everyone with a link or title of the dataset)
    - Markers include hyperlinks and capitalized words 
        - Hyperlink 
        - Capitalized words 
    - Word like "code", "data", or "package" is typically featured in the sentences with links, pointing to what the link refers to. 
    
- Issues (**code**)
    - The URL can be broken up by spaces due to line changes in the PDF. Do we stop at the parenthesis, comma or another symbol that might end the URL? 
        - EXAMPLES 
    - Not all links point to the dataset - some are to the code, e.g., 
        - "Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/. Notears method was implemented using Python available at https://github.com/xunzheng/notears . The proposed joint DAG method was implemented with Python and the code is available at https://github.com/gmeng92/joint-notears . The cohort data is accessible through the website (https://coins.trendscenter.org/) of COINS (COllaborative Infor-matics Neuroimaging Suite) database (Scott et al., 2011)."
        - "Data and code availability The data incorporated in the primary analysis were gathered from the public UK Biobank resource and will be made pub-licly available together with the code used to generate the data through the UK Biobank Returns Catalogue (https://biobank.ndph. ox.ac.uk/showcase/docs.cgi?id = 1). ABCD study data release 3.0 is available for approved researchers in NIMH Data Archive (NDA DOI:10.151.54/1,519,007). Code for conducting discovery and replication is available at https: //github.com/robloughnan/MOSTest _ generalization . Code for simu-lations is available at https://github.com/precimed/mostest/tree/master/simu."    
    
- Issues (**analysis**)
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
    - What if there are multiple sections and the text is slightly different (e.g., 10.1016/j.neuroimage.2022.118986)
<br>
<br>

TO DO Columns: 
- (DONE) Section text 
- (DONE) Section pattern (multiple reasons: 
    - 1) I can get a sense of whether the data statement is common in NeuroImage, 
    2) I can go back and handle potential more difficult cases) 
- Extracted dataset 

Validation dataset: 

In [None]:
# List of groundtruth DOI values to filter 
validation_dois = [
    '10.1016/j.neuroimage.2021.118839',
    '10.1016/j.neuroimage.2021.118854',
    '10.1016/j.neuroimage.2022.119030',
    '10.1016/j.neuroimage.2022.119050',
    '10.1016/j.neuroimage.2022.119240',
    '10.1016/j.neuroimage.2022.119443',
    '10.1016/j.neuroimage.2022.119526',
    '10.1016/j.neuroimage.2022.119549',
    '10.1016/j.neuroimage.2022.119646',
    '10.1016/j.neuroimage.2022.119676',
] 

# Filter rows based on DOI values
validation_set = articles_dataset_sections[articles_dataset_sections['DOI'].isin(validation_dois)]

<a name='availabilitypattern'></a>
### 1.3.1. 'Availability' pattern 

I will start by examining and dealing with the text sections that were filtered by the first section_pattern, namely: 

    r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability' 
 
The corresponding ending pattern: 

    r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'
 

In [None]:
articles_dataset_sections['Matched_pattern'].loc[0]

In [None]:
pat_1 = articles_dataset_sections[articles_dataset_sections['Matched_pattern'] == '(?<![\\\'"]) \\s*?\\n?Data\\s+and\\s+code\\s+availability |(?<![\\\'"]) \\s*?\\n?Data\\s+availability |(?<![\\\'"]) \\s*?\\n?Data/code\\s+availability']

A total of 563 articles have a section where the title matches the pattern. 

In [None]:
pat_1[['Section_wo_pattern', 'Start_pattern_clean']]

In [None]:
############### SENTENCES ################################################
def split_text_into_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences


############### LINKS ################################################
def extract_links(text):
    """This function identifies and extracts URLs (web links) from a given text using a 
    regular expression pattern. It also cleans and formats the extracted links by 
    removing leading and trailing spaces. The pattern accounts for various URL formats, 
    including those starting with "http://" or "https://," DOI format, and domain names 
    with specific characters, e.g., 'osf.io'.

    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: links  
    """
    # ORIGINAL 
    #url_pattern = r'''(https?://[^\s)(]+|\bdoi:\s*\d+(?:\.\d+)*(?:/[a-zA-Z0-9\./_\-]+)?|[a-z]+\.[a-z]+[a-zA-Z0-9\./_\-]*)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'''
    # NEW 
    
    #url_pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?|osf\.io/[a-z0-9/]+/|doi:\s*10\.\d+/\S+|www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'
    
    url_pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\(\S+\))?|osf\.io/[a-z0-9/]+|(?:www\.)?[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+|doi:\s*10\.\d+/\S+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'

    matches = re.findall(url_pattern, text)
    cleaned_links = ["".join(match).strip() for match in matches]
    return cleaned_links


############### CAPITALIZED ################################################
def extract_capitalized_words(text):
    """This function detects and extracts capitalized words from a text, e.g., 'Human 
    Connectome Project'. It also includes capitalized words followed by parentheses. 
    The regular expression pattern captures words with mixed case and optional hyphens. 
    It identifies words that are part of a capitalized notation and may be followed by text 
    within parentheses, e.g., "In this sentence Dataset Example (www.linktodataset.com) 
    would be extracted" returns "Dataset Example (www.linktodataset.com)"
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: all capitalized words  
    """
    capitalized_pattern = r'([A-Z][a-zA-Z\-]+(?:\s+[A-Z][a-zA-Z\-]+)*(?:\s*\(.*?\)))(?=\s*\.|\s|$)'
    return re.findall(capitalized_pattern, text)


############### DATASETS ################################################
def get_datasets(text):
    """
    """
    # Initialize lists to store extracted datasets and their corresponding sentences
    extracted_datasets = []
    dataset_sentences = []
    
    # Split the text into sentences
    sentences = split_text_into_sentences(text)
    
    # Extract links and capitalized words
    links = extract_links(text)
    capitalized_words = extract_capitalized_words(text)
    
    for sentence in sentences:
        datasets_in_sentence = []
        
        # Check if the sentence contains any capitalized words
        for cap_word in capitalized_words:
            if cap_word in sentence:
                datasets_in_sentence.append(cap_word)
        
        # Check if the sentence contains a link
        for link in links:
            if link in sentence:
                # Check if the link is already captured as a capitalized word in the same sentence
                if not any(link in cap_word for cap_word in capitalized_words):
                    datasets_in_sentence.append(link)
        
        # Check if the sentence contains the word "request"
        if "request" in sentence.lower():
            datasets_in_sentence.append("Request")
        
        if datasets_in_sentence:
            # If any datasets were found in the sentence, add them and the sentence itself
            extracted_datasets.extend(datasets_in_sentence)
            dataset_sentences.extend([sentence] * len(datasets_in_sentence))
    
    # If no dataset was found, return "N/A"
    if not extracted_datasets:
        return "N/A"
    
    #df = pd.DataFrame({'dataset': extracted_datasets, 'dataset_sentence': dataset_sentences})
    #return df

    return extracted_datasets, dataset_sentences


###############
def extract_and_add_datasets(row, text_column):
    """This function needs a description 
    
    Parameters: 
    :param row: 
    :param text_column: 
    
    Returns: 
    :return: 
    """
    result = get_datasets(row[text_column])
    
    if result is None:
        return None
    
    if len(result) == 2:
        datasets, sentences = result
    else:
        # Handle the case where get_datasets didn't return the expected two values
        datasets, sentences = ["N/A"], ["N/A"]
    
    rows_list = []
    for dataset, sentence in zip(datasets, sentences):
        new_row = row.copy()
        new_row['dataset'] = dataset
        new_row['dataset_sentence'] = sentence
        rows_list.append(new_row)
    
    return rows_list


pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?|osf\.io/[a-z0-9/]+/|doi:\s*10\.\d+/\S+|www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'

This pattern captures the following formats:
- URLs starting with http:// or https://, including paths, and optional (dataset ...) parts.
- osf.io/.../ format links.
- DOI links in the format doi: 10.xxxxx/xxxx.
- URLs starting with www. and followed by domain and path.

The (?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$)) part at the end is used to capture various possible endings. 

Here's how the pattern works:

    (https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?: Captures HTTP/HTTPS links with paths and optional (dataset ...) parts.
    osf\.io/[a-z0-9/]+/: Captures osf.io/.../ format links.
    doi:\s*10\.\d+/\S+: Captures DOI links.
    www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+: Captures links starting with www. and followed by domain and path.

<a name='testingavailabilitypattern'></a>
#### 1.3.1.1. Testing 'Availability' pattern

I will test the functions using the groundtruth texts as my validation set. 
When manually extracting the datasets from the ten groundtruth texts, we should get the following datasets (NB! Currently, I have not distinguished between links that leads the reader to data and links that leads the reader to code - this will come later): 
<br>
<br>

| DOI                                   | Dataset                                      | Dataset_sentence                                                                                                                                                                                            |
|---------------------------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 10.1016/j.neuroimage.2022.119526       | Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil)                       | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | Allen Hu-man Brain Atlas (http://human.brain-map.org/)                                | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | https://github.com/jbrown81/gradients                                    | All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients. |
| 10.1016/j.neuroimage.2022.119443       | osf.io/gazx2/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/eucqf/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/thsqg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/bndjg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/guwnm/                               | Code used to reproduce the plots in Fig. 1 , as well as averaged ERP data, is available from osf.io/guwnm/.                                      |
| 10.1016/j.neuroimage.2022.119240       | Request                                           | statement Data used in this study are available from the corresponding author upon reasonable request.                                                                         |
| 10.1016/j.neuroimage.2022.119050       | zenodo.org (doi: 10.5281/zenodo.6110595) | Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi: 10.5281/zenodo.6110595).                         |
| 10.1016/j.neuroimage.2021.118854       | Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation) | The data used in this study was downloaded from the Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation). |
|                                       | https://github.com/ferreirafabio80/gfa | The GFA models and experiments were implemented in Python 3.9.1 and are available here: https://github.com/ferreirafabio80/gfa.                                          |



In [None]:
# Filter rows based on the groundtruth DOI values
validation_set = pat_1[pat_1['DOI'].isin(validation_dois)]

In [None]:
# Initialize an empty list to store the rows
rows_list = []
# Column name to use for text extraction
text_column = 'Section_wo_pattern'

# Iterate through each row of the original DataFrame
for index, row in validation_set.iterrows():
    # Call the custom function to extract datasets and add new rows
    new_rows = extract_and_add_datasets(row, text_column)
    
    # Append the new rows to the list
    rows_list.extend(new_rows)

# Create the final DataFrame from the list of rows
validation_df = pd.DataFrame(rows_list)

In [None]:
validation_df

<a name='extractingallavailabilitydatasets'></a>
#### 1.3.1.2. Extracting all availability datasets
I will now run the code on all articles that matched with the availability pattern.  

In [None]:
pat_1.columns

In [None]:
# Initialize an empty list to store the rows
rows_list = []
# Column name to use for text extraction
text_column = 'Section_wo_pattern'

# Iterate through each row of the original DataFrame
for index, row in pat_1.iterrows():
    # Call the custom function to extract datasets and add new rows
    new_rows = extract_and_add_datasets(row, text_column)
    
    # Append the new rows to the list
    rows_list.extend(new_rows)

# Create the final DataFrame from the list of rows
articles_datasets = pd.DataFrame(rows_list)

I want to separate the code links from the data links by simply searching the 'dataset_sentence' to see if it contains code or not. 
- If it contains either data or (data and code), I save the articles as articles_dataset
- If it contains code and not (data and code), I save the articles as articles_code 

In [None]:
print("Links, capitalized words, and other in total: ", len(articles_datasets))

In [None]:
# Create a regex pattern for variations of "data" (match "data" as a standalone word or within other words)
data_pattern = r'\w*data\w*'
code_pattern = r'\w*code\w*'

# Create a mask for rows containing variations of "data"
data_mask = articles_datasets['dataset_sentence'].str.contains(data_pattern, case=False, regex=True, flags=re.IGNORECASE)

# Create a mask for rows containing "code" but not "data"
code_mask = (articles_datasets['dataset_sentence'].str.contains(code_pattern, case=False, regex=True, flags=re.IGNORECASE)) & (~data_mask)

# Create a mask for rows that do not fit either of the mentioned masks
other_mask = ~data_mask & ~code_mask

# Separate rows into articles_dataset, articles_code, and articles_other
articles_dataset = articles_datasets[data_mask]
articles_code = articles_datasets[code_mask]
articles_other = articles_datasets[other_mask]

# Reset the index for all dataframes
articles_dataset.reset_index(drop=True, inplace=True)
articles_code.reset_index(drop=True, inplace=True)
articles_other.reset_index(drop=True, inplace=True)

# Print the counts for each dataframe
print(f"Articles with 'data' or both 'data' and 'code': {len(articles_dataset)}")
print(f"Articles with 'code' but not 'data': {len(articles_code)}")
print(f"Articles that do not fit either mask: {len(articles_other)}")

Data: database, dataset, image(s), (neuro)(map(s)), DOI(s), atlas, (freely available)
Other: tool(kit, box), scripts, results, algorithm, software, package, plugin, function, analysis, 

Still not all links are finished: 
['https://osf',
        '2) are available on the Open Science Framework repository: https://osf.io/95ftn/?view_only = 9a1a085583544c3eac44d1c75870599c.'],
         ['https://www',
        'humanconnectome.org/and https://www.developingconnectome.'],
       ['https://www.developingconnectome',
        'humanconnectome.org/and https://www.developingconnectome.'],
       ['projects.nitrc.org/indi/indiPRIME.html',
        'projects.nitrc.org/indi/indiPRIME.html.'],
         ['https://www',
        ' statement MRI images can be downloaded from HCP website: https://www.'],
         ['https://github',
        'Processing and analysis scripts used in this study are available at: https://github.com/ofgulban/meso-MRI (v1.0.2 saved at https://zenodo.org/record/7210802).'],
        ['https://doi.org/10.5281/zenodo',
        'The interactive web application accompanying Fig. 2 is published at https://doi.org/10.5281/zenodo.6579997 and is hosted at https://representational-dynamics.herokuapp.com/.'],
        ['https://www.lead-dbs',
        ' statements The open source Matlab toolboxes that were used in this study can be obtained from: Lead-DBS: https://www.lead-dbs.org SPM12: http://www.ﬁl.ion.ucl.ac.uk/spm Fieldtrip: http://ﬁeldtriptoolbox.org Custom-written Matlab scripts are available for sharing upon re-quest.'],
       ['http://www.ﬁl.ion.ucl.ac',
        ' statements The open source Matlab toolboxes that were used in this study can be obtained from: Lead-DBS: https://www.lead-dbs.org SPM12: http://www.ﬁl.ion.ucl.ac.uk/spm Fieldtrip: http://ﬁeldtriptoolbox.org Custom-written Matlab scripts are available for sharing upon re-quest.'],
        ['https://github',
        'Volumetric PET receptor images can be found on neuromaps (https://netneurolab.github.io/neuromaps/(Markello et al., 2022)) and at https://github.com/netneurolab/hansen_receptors (Hansen et al., 2021).'],
        ['https://netneurolab.github',
        '(2021) and is available in neuromaps (https://netneurolab.github.io/neuromaps/) (Markello et al., 2022).'],
        ['https://github',
        'All processing was performed using the abagen toolbox (https://github.com/netneurolab/abagen (Markello et al., 2021)).'],
        ['https://github',
        'We created a surface-based representation of the parcellation on the FreeSurfer fsaverage left hemi-sphere surface, via ﬁles from the Connectome Mapper toolkit (https://github.com/LTS5/cmp).'],
        ['https://meg.univ-amu',
        'The toolboxes used in this work are available at https://meg.univ-amu.fr/wiki/Main_Page and https://ins-amu.fr/software.'],

In [None]:
articles_other[['dataset', 'dataset_sentence']].values