# Table of contents 
- [Replicability](#replicability)
- [Libraries](#libraries) 
- [Ground truth](#groundtruth) 
    - [Ten random articles from ten random volumes](#tenrandomarticlesfromtenrandomvolumes) 
    - [Manual extraction](#manualextraction)
    - [Investigation](#investigation)
        - [Statistics](#statistics) 
        - [Sections](#sections)
        - [Section titles](#sectiontitles)
        - [Links](#links)
    - [Main observations](#mainobservations) 
- [References](#references)
<br>
<br>

<a name='replicability'></a>
## 0. Replicability
The numbers I achieved upon running this notebook the first time cannot be replicated, as I forgot to set a seed number to save the state of the random function. As such, this notebook will not generate the same random numbers, but I made sure to write the numbers that I achieved. 
<br>
<br>

<a name='libraries'></a>
## 1. Libraries 

In [1]:
import pandas as pd 
import numpy as np 

import csv 
import os 

import random as rand 
import re 

<a name='groundtruth'></a>
## 2. Ground truth

To establish a ground truth about the articles and where the datasets are mentioned in them, I pick ten randomly selected articles from ten randomly selected volumes. 

I have adapted and edited an annotation scheme by Sourget (2023). I will go through each of the ten articles and make note of which datasets are mentioned. For each dataset used for analysis/experimentation/etc, I will write the following: 

- [title, str] The title of the article 
- [DOI, str] The article's DOI 
- [dataset_used, str] The title of the dataset used. If there is no title, but: 
    - **Description** of patient selection, inclusion criteria, or similar, I will write 'Self-collected'. 
    - **Mentions** of author(s) whose article(s) or dataset(s) have been used for a analysis/review/other, I will write the name of the author(s). 
- [dataset_link, str] The link to the dataset (if there is one). If there is/are: 
    - **Multiple variations** of the dataset-link, e.g., 'marmosetbrainconnectome.org' and 'marmosetbrainconnectome.org/download.html', I will include the the less specific link, which, in this example, is the first. 
    - **DOI**, e.g., "Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi:10.5281/zenodo.6110595).", I will include the DOI. 
    - **Title but no link**, I will write 'No link'
    - **Self-collected data**, I will see if the data is 'Not shared', 'Available upon request', or 'N/A', if there is no information regarding this. 
- [reference, str] The citation of the dataset (or the article introducing the dataset), as it's in the article's references. 'N/A' if there is no mention in the references. 
- [inline_mention, bool] True if the name of the dataset or the link to the dataset is mentioned in the article's body. 
- [footnote_mention, bool] True if the name of the dataset or the link to the dataset is in one of the article's footnotes. 
- [description_mention, bool] True if the name of the dataset or the link to the dataset is in one of the article's figure or table descriptions. 
- [dataset_section, list] List noting the title of the section(s) of the article in which the dataset is mentioned. 
- [link_section, list] List nothing the title of the section(s) of the article in which the link to the dataset is mentioned. 
<br>
<br>

References: 
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
<br>
<br>

<a name='tenrandomarticlesfromtenrandomvolumes'></a>
### 2.1. Ten random articles from ten random volumes 

In [2]:
randomlist = []
for i in range(0, 10):
    n = rand.randint(246, 264)
    randomlist.append(n)
print(randomlist)

[253, 248, 255, 247, 256, 253, 253, 252, 250, 256]


For each of the ten volumes [248, 252, 253, 256, 259, 261, 262, 263, 264], I will generate a random number representing which article to use. However, the random number is dependent on the number of available articles in each volume. 

In [3]:
# The number of articles in each journal
journal_volumes = {
    248: 14,
    252: 30,
    253: 37,
    256: 43,
    259: 33,
    261: 25,
    262: 33,
    263: 77,
    264: 94
}

In [4]:
# Function to pick a random article for each journal
def pick_random_article(journal_volumes):
    """This function iterates through each volumes and uses random.randint(1,num_articles)
    to pick a random article number for each volume within the range of available articles. 
    
    Parameters:
    :param journal_volumes: dictionary where the keys are the journal volumes, and the values are the number of articles in each journal.
    """
    for volume, num_articles in journal_volumes.items():
        random_article = rand.randint(1, num_articles)  # Generate a random number
        print(f'Volume {volume}, article {random_article}')

# Call the function to pick random articles
pick_random_article(journal_volumes)

Volume 248, article 8
Volume 252, article 6
Volume 253, article 26
Volume 256, article 28
Volume 259, article 22
Volume 261, article 3
Volume 262, article 32
Volume 263, article 32
Volume 264, article 38


I got the following results: 
- Volume 248, article 5
- Volume 252, article 13
- Volume 253, article 19
- Volume 256, article 34
- Volume 259, article 30
- Volume 261, article 16
- Volume 262, article 3
- Volume 263, article 46
- Volume 264, article 7
<br>
<br>

<a name='manualextraction'></a>
### 2.2. Manual extraction 
I will now investigate each of the aforementioned ten articles and take note of the title, DOI, dataset used, if there is a link to the dataset, if the dataset is mentioned in the references, in the dataset is mentioned in the text, in a footnote, in any figure or table descriptions, and in which sections these mentions happen. 

In [5]:
# Create an empty DataFrame
articles = pd.DataFrame(columns=[
    'title',
    'DOI',
    'dataset_used',
    'dataset_link',
    'references_mention', 
    'inline_mention',
    'footnote_mention',
    'description_mention',
    'section'
])

In [6]:
# List of dictionaries representing article data
article_data_list = [
    {
        'title': 'Motor impairment evoked by direct electrical stimulation of human parietal cortex during object manipulation',
        'DOI': '10.1016/j.neuroimage.2021.118839',
        'dataset_used': 'Self-collected',
        'dataset_link': 'N/A',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2. Materials and methods', '2.1. Patient selection and inclusion criteria'],
        'link_section': []
    },
    {
        'title': 'An open access resource for functional brain connectivity from fully awake marmosets',
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'dataset_used': 'Marmoset Functional Brain Connectivity Resource',
        'dataset_link': 'marmosetbrainconnectome.org',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['Abstract', '1. Introduction', '2. Methods', '2.1. Animals', '3. Results', '3.1. Resource', '4. Discussion', 'Schaeffer data availability statement'],
        'link_section': ['Abstract', '1. Introduction', '2.9. 25 µm marmoset brain connectome anatomical template', 'Fig.', '3. Results', '3.1. Resource', '4. Discussion', 'Schaeffer data availability statement']
    }, 
    {
        'title': 'An open access resource for functional brain connectivity from fully awake marmosets',
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'dataset_used': 'marmosetbrain.org',
        'dataset_link': 'marmosetbrain.org',
        'reference': 'P. Majka, et al. (2020). Open access resource for cellular-resolution analyses of corticocortical connectivity in the marmoset monkey',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['2. Methods', '2.8. Cross-modality comparisons', '3. Results', '3.7. Comparison with tracer-based cellular connectivity', '4. Discussion'],
        'link_section': ['2.8. Cross-modality comparisons', '3.7. Comparison with tracer-based cellular connectivity', 'Fig. ', 'Discussion']
    },
    {
        'title': 'Non-invasive recording of high-frequency signals from the human spinal cord',
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'dataset_used': 'Self-collected',
        'dataset_link': '10.5281/zenodo.6110595',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2. Materials and methods', '2.1. Subjects', 'Data and code availability'],
        'link_section': ['2.5. Data and code availability', 'Data and code availability']
    },
    {
        'title': 'White matter properties underlying reading abilities differ in 8-year-old children born full term and preterm: A multi-modal approach',
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'dataset_used': 'Self-collected',
        'dataset_link': 'Available upon reasonable request',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['Abstract', '1. Introduction', '2. Materials and methods', '2.1. Participants', 'Data availability statement'],
        'link_section': ['Data availability statement']
    },
    {
        'title': 'Bring a map when exploring the ERP data processing multiverse: A commentary on Clayson et al. 2021',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'dataset_used': 'Feuerriegel et al. (2021a)',
        'dataset_link': 'osf.io/gazx2/',
        'reference': 'D.C. Feuerriegel, et al. (2021) Electrophysiological correlates of confidence differ across correct and erroneous perceptual decisions',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Accounting for bias in multiverse analyses', 'Data and code availability statement'],
        'link_section': ['Data and code availability statement']
    },
    {
        'title': 'Bring a map when exploring the ERP data processing multiverse: A commentary on Clayson et al. 2021',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'dataset_used': 'Feuerriegel et al. (2021b)',
        'dataset_link': 'osf.io/eucqf/',
        'reference': 'D. Feuerriegel et al. (2021). Tracking dynamic adjustments to decision making and performance monitoring processes in conflict tasks',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Accounting for bias in multiverse analyses', 'Data and code availability statement'],
        'link_section': ['Data and code availability statement']
    },
    {
        'title': 'Bring a map when exploring the ERP data processing multiverse: A commentary on Clayson et al. 2021',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'dataset_used': 'Kappenman et al. (2021)',
        'dataset_link': 'osf.io/thsqg/',
        'reference': 'E.S. Kappenman et al. (2021) ERP CORE: An open resource for human event-related potential research',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Accounting for bias in multiverse analyses', 'Data and code availability statement'],
        'link_section': ['Data and code availability statement']
    },
    {
        'title': 'Bring a map when exploring the ERP data processing multiverse: A commentary on Clayson et al. 2021',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'dataset_used': 'Bode and Stahl (2014)',
        'dataset_link': 'osf.io/bndjg/',
        'reference': 'S. Bode and J. Stahl (2014). Predicting errors from patterns of event-related potentials preceding an overt response',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Accounting for bias in multiverse analyses', 'Data and code availability statement'],
        'link_section': ['Data and code availability statement']
    },
    {
        'title': 'Bring a map when exploring the ERP data processing multiverse: A commentary on Clayson et al. 2021',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'dataset_used': 'N/A',
        'dataset_link': 'osf.io/guwnm/',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['1. Accounting for bias in multiverse analyses', 'Data and code availability statement'],
        'link_section': ['Data and code availability statement']
    },
    {
        'title': 'A dynamic gradient architecture generates brain activity states',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'dataset_used': 'Human Connectome Project',
        'dataset_link': 'https://db.humanconnectome.org/data/projects/HCP_1200',
        'reference': 'D.M. Barch et al. (2013). Function in the human connectome: Task-fMRI and individual differences in behavior',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['1. Introduction', '2. Materials and methods', '2.1. Subjects and data', '5. Conclusion', 'Data and code availability'],
        'link_section': ['2. Materials and methods', '2.1. Subjects and data', '2.5. Task fMRI analysis']
    },
    {
        'title': 'A dynamic gradient architecture generates brain activity states',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'dataset_used': 'Allen Human Brain Atlas',
        'dataset_link': 'http://human.brain-map.org/',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2.6. Genetic spatial correlation', '3.2. Correspondence with spatial gene expression patterns', 'Data and code availability'],
        'link_section': ['Data and code availability']
    },
    {
        'title': 'A dynamic gradient architecture generates brain activity states',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'dataset_used': 'Brainnetome atlas',
        'dataset_link': 'http://www.brainnetome.org/',
        'reference': 'L. Fan et al. (2016). The human brainnetome atlas: a new brain atlas based on connectional architecture',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2.3. Gradient analysis and reproducibility', '2.6. Genetic spatial correlation'],
        'link_section': ['2.3. Gradient analysis and reproducibility']
    },
    {
        'title': 'A dynamic gradient architecture generates brain activity states',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'dataset_used': 'SUIT atlas',
        'dataset_link': 'http://www.diedrichsenlab.org/imaging/suit.htm',
        'reference': 'J. Diedrichsen (2006). A spatially unbiased atlas template of the human cerebellum',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2.3. Gradient analysis and reproducibility', '2.6. Genetic spatial correlation'],
        'link_section': ['2.3. Gradient analysis and reproducibility']
    },
    {
        'title': 'A dynamic gradient architecture generates brain activity states',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'dataset_used': 'Processed data',
        'dataset_link': 'https://github.com/jbrown81/gradients',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['Data and code availability'],
        'link_section': ['Data and code availability']
    },
    {
        'title': 'Multisensory integration of anticipated cardiac signals with visual targets affects their detection among multiple visual stimuli',
        'DOI': '10.1016/j.neuroimage.2022.119549',
        'dataset_used': 'Self-collected',
        'dataset_link': 'N/A',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2. Materials and methods', '2.1. Participants', 'Data availability'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Carmona et al., 2019',
        'dataset_link': 'N/A',
        'reference': 'S. Carmona et al. (2019). Pregnancy and adolescence entail similar neuroanatomical adaptations: A comparative analysis of cerebral morphometric changes',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Hoekzema et al., 2017',
        'dataset_link': 'N/A',
        'reference': 'E. Hoekzema et al. (2017). Pregnancy leads to long-lasting changes in human brain structure',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Hoekzema et al., 2020',
        'dataset_link': 'N/A',
        'reference': 'E. Hoekzema et al. (2020). Becoming a mother entails anatomical changes in the ventral striatum of the human brain that facilitate its responsiveness to offspring cues',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Kim et al., 2010',
        'dataset_link': 'N/A',
        'reference': 'P. Kim et al. (2010). The plasticity of human maternal brain: longitudinal changes in brain anatomy during the early postpartum period',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Lisofsky et al., 2019',
        'dataset_link': 'N/A',
        'reference': 'N. Lisofsky et al. (2019). Postpartal neural plasticity of the maternal brain: early renormalization of pregnancy-related decreases?',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Luders et al., 2021a',
        'dataset_link': 'N/A',
        'reference': 'E. Luders (2021). Gray matter increases within subregions of the hippocampal complex after pregnancy',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Luders et al., 2021b',
        'dataset_link': 'N/A',
        'reference': 'E. Luders et al. (2021). Postpartum gray matter changes in the auditory cortex',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Luders et al., 2021c',
        'dataset_link': 'N/A',
        'reference': 'E. Luders (et al. 2021). Significant increases of the amygdala between immediate and late postpartum: pronounced effects within the superficial subregion',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Luders et al., 2018',
        'dataset_link': 'N/A',
        'reference': 'E. Luders et al. (2018). Potential brain age reversal after pregnancy: younger brains at 4-6 weeks postpartum',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Luders et al., 2020',
        'dataset_link': 'N/A',
        'reference': 'E. Luders et al. (2020). From baby brain to mommy brain: Widespread gray matter gain after giving birth',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Martinez-Garcia et al., 2021',
        'dataset_link': 'N/A',
        'reference': 'M. Martinez-Garcia et al. (2021). Do pregnancy-induced brain changes reverse? The brain of a mother six years after parturition',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'The neuroanatomy of pregnancy and postpartum',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'dataset_used': 'Oatridge et al., 2002',
        'dataset_link': 'N/A',
        'reference': 'A. Oatridge et al. (2002). Change in brain size during and after pregnancy: study in healthy women and women with preeclampsia',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': True,
        'dataset_section': ['1. Introduction', '2. Applied methods'],
        'link_section': []
    },
    {
        'title': 'Keep the head in the right place: Face-body interactions in inferior temporal cortex',
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'dataset_used': 'Self-collected',
        'dataset_link': 'https://osf.io/b8pfa/?view_only=b6dbb5dd6a044989a7eecdc99facb43c',
        'reference': 'N/A',
        'inline_mention': True,
        'footnote_mention': False,
        'description_mention': False,
        'dataset_section': ['2. Methods', '2.1. Subjects', 'Data Availability'],
        'link_section': ['Data Availability']
    },
]

# Create the DataFrame from the list of dictionaries
articles = pd.DataFrame(article_data_list)

In [7]:
articles

Unnamed: 0,title,DOI,dataset_used,dataset_link,reference,inline_mention,footnote_mention,description_mention,dataset_section,link_section
0,Motor impairment evoked by direct electrical s...,10.1016/j.neuroimage.2021.118839,Self-collected,,,True,False,False,"[2. Materials and methods, 2.1. Patient select...",[]
1,An open access resource for functional brain c...,10.1016/j.neuroimage.2022.119030,Marmoset Functional Brain Connectivity Resource,marmosetbrainconnectome.org,,True,False,True,"[Abstract, 1. Introduction, 2. Methods, 2.1. A...","[Abstract, 1. Introduction, 2.9. 25 µm marmose..."
2,An open access resource for functional brain c...,10.1016/j.neuroimage.2022.119030,marmosetbrain.org,marmosetbrain.org,"P. Majka, et al. (2020). Open access resource ...",True,False,True,"[2. Methods, 2.8. Cross-modality comparisons, ...","[2.8. Cross-modality comparisons, 3.7. Compari..."
3,Non-invasive recording of high-frequency signa...,10.1016/j.neuroimage.2022.119050,Self-collected,10.5281/zenodo.6110595,,True,False,False,"[2. Materials and methods, 2.1. Subjects, Data...","[2.5. Data and code availability, Data and cod..."
4,White matter properties underlying reading abi...,10.1016/j.neuroimage.2022.119240,Self-collected,Available upon reasonable request,,True,False,True,"[Abstract, 1. Introduction, 2. Materials and m...",[Data availability statement]
5,Bring a map when exploring the ERP data proces...,10.1016/j.neuroimage.2022.119443,Feuerriegel et al. (2021a),osf.io/gazx2/,"D.C. Feuerriegel, et al. (2021) Electrophysiol...",True,False,True,[1. Accounting for bias in multiverse analyses...,[Data and code availability statement]
6,Bring a map when exploring the ERP data proces...,10.1016/j.neuroimage.2022.119443,Feuerriegel et al. (2021b),osf.io/eucqf/,D. Feuerriegel et al. (2021). Tracking dynamic...,True,False,True,[1. Accounting for bias in multiverse analyses...,[Data and code availability statement]
7,Bring a map when exploring the ERP data proces...,10.1016/j.neuroimage.2022.119443,Kappenman et al. (2021),osf.io/thsqg/,E.S. Kappenman et al. (2021) ERP CORE: An open...,True,False,True,[1. Accounting for bias in multiverse analyses...,[Data and code availability statement]
8,Bring a map when exploring the ERP data proces...,10.1016/j.neuroimage.2022.119443,Bode and Stahl (2014),osf.io/bndjg/,S. Bode and J. Stahl (2014). Predicting errors...,True,False,True,[1. Accounting for bias in multiverse analyses...,[Data and code availability statement]
9,Bring a map when exploring the ERP data proces...,10.1016/j.neuroimage.2022.119443,,osf.io/guwnm/,,True,False,False,[1. Accounting for bias in multiverse analyses...,[Data and code availability statement]


#### 2.2.1. Save to csv 

In [8]:
# Path to the 'Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# File path
file_path = os.path.join(data_dir, 'articles_groundtruth.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
articles.to_csv(file_path, index=False, mode='w')

<a name='investigation'></a>
### 2.3. Investigation 

The following is an investigation of where the dataset is mentioned. 
<br>
<br>

<a name='statistics'></a>
#### 2.3.1. Statistics 
I'm curious to see the distribution of how many datasets are mentioned per paper, how many have a link attached to it, how many are referenced in the references, and where the datasets in mentioned (if it's inline, in a footnote, or in a figure or table description). 

Findings: 
- Between 1 and 12 datasets are used in each article. 
- 69 % of the datasets are listed in the references 
- 100 % of the datasets are mentioned as a part of the article's main text. 
- 0 % of the datasets are mentioned in footnotes. 
- 65.5 % of the datasets are mentioned in a figure or table description. 

In [9]:
# The number of occurrences of each DOI is equal to the number of datasets mentioned in each article
doi_counts = articles['DOI'].value_counts()
print(doi_counts)

10.1016/j.neuroimage.2022.119646    12
10.1016/j.neuroimage.2022.119443     5
10.1016/j.neuroimage.2022.119526     5
10.1016/j.neuroimage.2022.119030     2
10.1016/j.neuroimage.2021.118839     1
10.1016/j.neuroimage.2022.119050     1
10.1016/j.neuroimage.2022.119240     1
10.1016/j.neuroimage.2022.119549     1
10.1016/j.neuroimage.2022.119676     1
Name: DOI, dtype: int64


In [10]:
# The percentage of articles mentioned in the references (reference != 'N/A'), 
# in the text (inline_mention = True), in a footnote (footnote_mention = True), 
# and in a figure or table description (description_mention = True) 

# References mention
references_mention = (articles['reference'] != 'N/A').sum()
percentage_with_references_mention = (references_mention / len(articles)) * 100

# In-text mention
intext_mention = (articles['inline_mention'] == True).sum()
percentage_with_inline_mention = (intext_mention / len(articles)) * 100

# Footnote_mention
infootnote_mention = (articles['footnote_mention'] == True).sum()
percentage_with_footnote_mention = (infootnote_mention / len(articles)) * 100

# Description_mention
indescription_mention = (articles['description_mention'] == True).sum()
percentage_with_description_mention = (indescription_mention / len(articles)) * 100

print(f"Percentage of articles mentioned in the references: {percentage_with_references_mention:.2f}%")
print(f"Percentage of articles mentioned in the text: {percentage_with_inline_mention:.2f}%")
print(f"Percentage of articles mentioned in a footnote: {percentage_with_footnote_mention:.2f}%")
print(f"Percentage of articles mentioned in a description: {percentage_with_description_mention:.2f}%")

Percentage of articles mentioned in the references: 68.97%
Percentage of articles mentioned in the text: 100.00%
Percentage of articles mentioned in a footnote: 0.00%
Percentage of articles mentioned in a description: 65.52%


<a name='sections'></a>
#### 2.3.2. Sections 
I want to get an overview of where the datasets are typically mentioned, i.e., in which section is the data mentioned. 

In [11]:
# Explode the 'section' column to split the lists into separate rows
exploded_sections = articles.explode('dataset_section')

# Count the occurrences of each section
section_counts = exploded_sections['dataset_section'].value_counts().reset_index()
section_counts.columns = ['Section', 'Frequency']

# Print the section frequency overview
print("Frequency of dataset mentions in each section:")
print(section_counts)

Frequency of dataset mentions in each section:
                                              Section  Frequency
0                                     1. Introduction         15
1                                  2. Applied methods         12
2                            2. Materials and methods          5
3       1. Accounting for bias in multiverse analyses          5
4                Data and code availability statement          5
5                          Data and code availability          4
6                    2.6. Genetic spatial correlation          3
7                                          2. Methods          3
8                                          3. Results          2
9                                       4. Discussion          2
10                                           Abstract          2
11         2.3. Gradient analysis and reproducibility          2
12                                      2.1. Subjects          2
13                                  2.1. Pa

Based on the section names and frequencies aboved, I can see that some sections can be grouped together for a clearer view. This includes mentions of the word 'method', 'availability', and 'subject/participant/selection'. I define some simple regex patterns to group the section names and see if this makes the overview clearer. 

In [12]:
# Define regex patterns for keyword matching
patterns = {
    'Method': r'\bmethod',
    'Availability': r'\bavailability',
    'Subject or Participant or Selection': r'\bsubject|\bparticipant|\bselection',
    'Abstract': r'abstract',
    'Introduction': r'introduction',
    'Conclusion': r'conclusion',
    'Results': r'results',
    'Discussion': r'discussion',
    'Other': r''  # Initialize 'Other' as an empty pattern
}

# Dictionary to store frequencies and matched sections
frequencies = {}
matched_sections = {category: set() for category in patterns.keys()}

# Iterate through the list of sections
for sections in articles['dataset_section']:
    for section in sections:
        matched = False

        # Iterate through patterns and check for matches
        for category, pattern in patterns.items():
            if re.search(pattern, section, flags=re.IGNORECASE):
                matched_sections[category].add(section)
                frequencies[category] = frequencies.get(category, 0) + 1
                matched = True
                break

        # If no match was found, consider it as "Other"
        if not matched:
            matched_sections['Other'].add(section)

# Convert frequencies to a DataFrame
section_df = pd.DataFrame(list(frequencies.items()), columns=['Category', 'Frequency'])

# Append the matched sections to the DataFrame
section_df['Matched Sections'] = [', '.join(matched_sections[category]) for category in section_df['Category']]

# Order the sections by their frequency
section_df = section_df.sort_values(by='Frequency', ascending=False)

In [13]:
section_df

Unnamed: 0,Category,Frequency,Matched Sections
0,Method,20,"2. Applied methods, 2. Methods, 2. Materials a..."
3,Introduction,15,1. Introduction
4,Other,15,"2.8. Cross-modality comparisons, 1. Accounting..."
7,Availability,13,"Data availability, Data and code availability ..."
1,Subject or Participant or Selection,6,"2.1. Subjects and data, 2.1. Subjects, 2.1. Pa..."
2,Abstract,2,Abstract
5,Results,2,3. Results
6,Discussion,2,4. Discussion
8,Conclusion,1,5. Conclusion


I'm curious to see what section names were captured with the 'Other' category: 

In [14]:
section_df['Matched Sections'].loc[4]

'2.8. Cross-modality comparisons, 1. Accounting for bias in multiverse analyses, 3.7. Comparison with tracer-based cellular connectivity, 3.2. Correspondence with spatial gene expression patterns, 2.6. Genetic spatial correlation, 2.3. Gradient analysis and reproducibility, 3.1. Resource, 2.1. Animals'

These section names are quite unique. Based on the numbering in the names alone, I can make some assumptions about what they might be related to: 
- "2.1. Animals" could be captured by the category 'subject/participant/selection', as the section describes the animals used for the imaging. 
- The sections starting with '2' could be captured by the category 'method', as the second section in almost all of the articles is the methodology section. 
- The sections starting with '2' could be captured by the category 'results', as the third section in almost all of the articles is the results section - however, this section has also been called 'analysis' or something else. 
<br>
<br> 

Instead of making assumptions about which categories the sections captured by the 'Other' category could belong to, I'm going to see if the articles contains any other sections that I'm more confident about. 

In [15]:
# Split the string into individual strings
matched_sections_list = section_df['Matched Sections'].loc[4].split(', ')

# Create a function to check if any string in the list is in the row's list
def contains_any_matched_section(row):
    return any(matched_section in row for matched_section in matched_sections_list)

# Apply the function to each row in articles_df['section']
matching_rows = articles[articles['dataset_section'].apply(contains_any_matched_section)]

In [16]:
matching_rows['dataset_section'].values

array([list(['Abstract', '1. Introduction', '2. Methods', '2.1. Animals', '3. Results', '3.1. Resource', '4. Discussion', 'Schaeffer data availability statement']),
       list(['2. Methods', '2.8. Cross-modality comparisons', '3. Results', '3.7. Comparison with tracer-based cellular connectivity', '4. Discussion']),
       list(['1. Accounting for bias in multiverse analyses', 'Data and code availability statement']),
       list(['1. Accounting for bias in multiverse analyses', 'Data and code availability statement']),
       list(['1. Accounting for bias in multiverse analyses', 'Data and code availability statement']),
       list(['1. Accounting for bias in multiverse analyses', 'Data and code availability statement']),
       list(['1. Accounting for bias in multiverse analyses', 'Data and code availability statement']),
       list(['2.6. Genetic spatial correlation', '3.2. Correspondence with spatial gene expression patterns', 'Data and code availability']),
       list(['2.3. 

Looking at the articles and the section-names of where the datasets are mentioned, most of the datasets are captured by the categories I defined above, with the exception of two datasets that appear to only be mentioned in sections with quite unique names ('2.3. Gradient analysis and reproducibility', '2.6. Genetic spatial correlation'). 
<br>
<br>

Based on this investigation, there are four main areas in the articles that I need to pay special attention to when attempting to extract the datasets, namely: 
* The **methods** section (typically titled '2.' and contains a variation of 'method')
    * A special subection within the methods is **subject**/**participant**/**selection**
* The **introduction** (typically titled '1.')
* The **availability** section (typically titled without a number and contains the word 'availability') 
* Regarding the **other** category, I saw that a majority of the datasets were mentioned in sections with titles that I had other categories for. The variability of section titles is difficult to deal with, and the downfall of using the limited list of categories used above is that I will miss some datasets. 
<br>
<br>

<a name='sectiontitles'></a>
#### 2.3.3. Section titles
By reading the titles printed above, it seems like most are stylized so that only the first word is capitalized and the rest of the words are in lower case. I want to see if this is the case for all section titles. 

In [84]:
print(exploded_sections['dataset_section'].unique())

['2. Materials and methods'
 '2.1. Patient selection and inclusion criteria' 'Abstract'
 '1. Introduction' '2. Methods' '2.1. Animals' '3. Results'
 '3.1. Resource' '4. Discussion' 'Schaeffer data availability statement'
 '2.8. Cross-modality comparisons'
 '3.7. Comparison with tracer-based cellular connectivity' '2.1. Subjects'
 'Data and code availability' '2.1. Participants'
 'Data availability statement'
 '1. Accounting for bias in multiverse analyses'
 'Data and code availability statement' '2.1. Subjects and data'
 '5. Conclusion' '2.6. Genetic spatial correlation'
 '3.2. Correspondence with spatial gene expression patterns'
 '2.3. Gradient analysis and reproducibility' 'Data availability'
 '2. Applied methods' 'Data Availability']


In [85]:
unique_titles = len(exploded_sections['dataset_section'].unique())
print("Number of unique section titles, wherein a dataset is mentioned: ", unique_titles)

Number of unique section titles, wherein a dataset is mentioned:  26


In [82]:
def get_nonmatching_section_titles(text_list):
    """This function processes a list of text strings to see if, for all strings 
    - The first word is capitalized 
    - The first word is the only capitalized word 
    e.g., "Introduction" is considered to follow the pattern, as the string starts with an uppercase
    letter followed by lowercase letters. Another title that follows the pattern is "Data and code 
    availability", whereas "Data and Code Availability" does not follow the pattern. 
    The function was designed to assist in identifying section titles that deviate from the expected format. 
    
    Paramters: 
    :param text_list (list): A list of text strings to be checked for conformity to the pattern.
    
    Returns: 
    :returns non_matching_strings (list): A list of text strings from text_list that do not follow the specified pattern.
    """ 
    non_matching_strings = []  # Initialize a list to store non-matching strings
    for text in text_list:
        words = text.split('. ', 1)  # Split at the first period and space
        if len(words) == 2:
            first_word, rest_of_text = words
            first_word_parts = first_word.split('.')
            if all(part.isdigit() for part in first_word_parts) | first_word.isdigit():
                if rest_of_text[0].isupper() and rest_of_text[1:].islower():
                    continue  # The pattern is followed
        elif text[0].isupper() and text[1:].islower():
            continue  # The pattern is followed
        else:
            parts = text.split('.')
            if len(parts) >= 2 and all(part.isdigit() for part in parts):
                continue  # The pattern is followed
        non_matching_strings.append(text)  # Add non-matching strings to the list
    return non_matching_strings

In [83]:
# Check if all titles in follow the pattern
non_matching_strings = articles['dataset_section'].apply(follows_pattern)

# Flatten the list of non-matching strings
non_matching_strings = [string for sublist in non_matching_strings for string in sublist]

if len(non_matching_strings) == 0:
    print("All strings in dataset_section follow the pattern.")
else:
    print("Strings that do not follow the pattern:")
    for string in non_matching_strings:
        print(string)

Strings that do not follow the pattern:
Data Availability


From 26 unique section titles, only one deviated from the expected pattern of having only the first word in the title in uppercase and the rest of the words in lowercase. 
<br>
<br>

<a name='Links'></a>
#### 2.3.4. Links
I'm interested to know the distribution of articles with and without links, as well as where the links are mentioned compared to any additional mentions of the dataset. 

In [None]:
# Calculate the percentage of articles with dataset links
has_links = (articles['dataset_link'] != 'N/A').sum()
percentage_with_links = (has_links / len(articles)) * 100

print(f"Percentage of articles with dataset links: {percentage_with_links:.2f}%")

I want to see if there are any overlaps between the sections where the name of the dataset (or any dataset information in general) and the link to the dataset are mentioned. 

In [None]:
# Lists to store section information
same_section = []  # sections where both dataset and link are mentioned
mention_only = []  # sections where only dataset is mentioned
link_only = []  # sections where only link is mentioned
same_count = []  # count of same_section mentions
mention_count = []  # count of only mentions
link_count = []  # count of only links

for index, row in articles.iterrows():
    dataset_section = row['dataset_section']
    link_section = row['link_section']

    # Ignore rows where link_section is empty
    if not link_section:
        continue

    # Find sections where both dataset and link are mentioned
    common_sections = set(dataset_section) & set(link_section)
    # Find sections where only dataset is mentioned
    dataset_only_sections = set(dataset_section) - set(link_section)
    # Find sections where only link is mentioned
    link_only_sections = set(link_section) - set(dataset_section)

    # Append the section information to the lists
    same_section.append(common_sections)
    mention_only.append(dataset_only_sections)
    link_only.append(link_only_sections)

    # Count the occurrences
    same_count.append(len(common_sections))
    mention_count.append(len(dataset_only_sections))
    link_count.append(len(link_only_sections))

# Create a DataFrame to store the results
mentions_vs_links = pd.DataFrame({
    'Same Section': same_section,
    'Mention Only': mention_only,
    'Link Only': link_only,
    'Same Count': same_count,
    'Mention Count': mention_count,
    'Link Count': link_count
})

In [None]:
mentions_vs_links

In [None]:
# Dictionary to store frequencies and matched sections
frequencies = {}
matched_sections = {category: set() for category in patterns.keys()}

# Iterate through the list of sets in dataset_section
for section_set in mentions_vs_links['Same Section']:
    for section in section_set:
        matched = False

        # Iterate through patterns and check for matches
        for category, pattern in patterns.items():
            if re.search(pattern, section, flags=re.IGNORECASE):
                matched_sections[category].add(section)
                frequencies[category] = frequencies.get(category, 0) + 1
                matched = True
                break

        # If no match was found, consider it as "Other"
        if not matched:
            matched_sections['Other'].add(section)

# Convert frequencies to a DataFrame
same_sections_df = pd.DataFrame(list(frequencies.items()), columns=['Category', 'Frequency'])

# Append the matched sections to the DataFrame
same_sections_df['Matched Sections'] = [', '.join(matched_sections[category]) for category in same_sections_df['Category']]

# Order the sections by their frequency
same_sections_df = same_sections_df.sort_values(by='Frequency', ascending=False)

In [None]:
same_sections_df

When an article includes a link to the dataset, it is always mentioned with the name of the dataset (or some other information about the dataset), and it is typically done so in the section where the title contains the word 'availability'. 
<br>
<br>

<a name='mainobservations'></a> 
## 2.4. Main observations 

Based on my investigation of ten randomly selected articles, focusing on where the datasets are mentioned and how I can go about the task of extracting them: 

* Each article utilizes between 1 and 12 datasets.
* All datasets are referenced within the main text of the articles.
    * There are four primary sections in the articles that I have to close attention to when extracting the datasets: 
        * **Methods**: This is typically titled '2.' and contains a variation of 'method')
            * A special subsection within Methods is: **subject**/**participant**/**selection**. 
        * **Introduction**: Typically titled '1. Introduction'
        * **Availability**: This section is typically titled without a number and contains the word 'availability'. 
        * Other sections: Many datasets were found in sections with diverse titles, making categorization challenging, and emphasizing the need for a broader approach. However, a majority of the datasets that were mentioned in sections with unique titles were also mentioned in one or more of the sections mentioned above. With that said, the downfall of using the limited list of categories used above is that I will miss some datasets. 
* None of the datasets are mentioned in footnotes.
* 65.5% of the datasets are referenced within figure or table descriptions.
* Approximately half of the mentioned datasets are linked to external sources. 
* Articles with linked datasets typically reference them within sections containing the word 'availability' in their title.
    * When an article includes a link to the dataset, it is always mentioned with the name of the dataset (or some other information about the dataset), and it is typically done so in the section where the title contains the word 'availability'.
* Each article utilizes between 1 and 12 datasets.
* 69% of the datasets are listed in the references.

<a name='references'></a>
# 3. References
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget