# Table of contents 
- [Setup](#setup) 
    - [Target](#target)
    - [Libraries](#libraries)
- [Get datasets](#getdatasets)
    - [Get URLs](#geturls)
    
<br>
<br>

[Old code](#oldcode)
- [Gather datasets](#gatherdatasets)
    - [Get text sections](#gettextsections)
        - [Section patterns v1](#sectionpatternsv1)
        - [Section patterns v2](#sectionpatternsv2)
        - [Section patterns v3](#sectionpatternsv3)
    - [Preprocessing text sections](#preprocessingtextsections) 
        - [Start patterns](#startpatterns)
        - [Clean text](#cleantext)
    - [Get datasets](#getdatasets)
        - ['Availability' pattern](#availabilitypattern)
        - [Other section patterns](#othersectionpatterns)
- [References](#references)

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='target'></a> 
## 0.1. Target
The goal is the use pypdf to locate and extract the datasets used for analysis in the research articles. Based on an initial review of nine random 
<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 

# Read PDFs
import pypdf 
# Extract URLs from text 
import urlextract 

# 1. Get datasets 
<a name='getdatasets'></a>
Steps: 
- Get sentences that contain URLs. 
- Clean sentences. 

<br>

## 1.1. Get URLs 
<a name='geturls'></a>
Based on my exploration of ten randomly picked articles, 75% of the articles contained links to the dataset(s) used for analysis - of the articles that did not contain links, they either used self-collected data or no datasets at all. 

I use the work of Sourget (2023) to search the PDFs for their datasets: 
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

I use the Python library urlextract by Lipovský (2022) to extract the URLs. 

<br>

References: 
- Lipovský, J. (2022). urlextract: Collects and extracts URLs from given text. (1.8.0) [Python]. https://github.com/lipoja/URLExtract
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [34]:
def get_content(pdf_path, alt_pdf_path):
    """Get sentences that contain URLs. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param alt_pdf_path (str): Alternative path to the PDF file. 
    
    Returns: 
    :return: Extracted sentence or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sentences containing datasets
        content = get_datasets(pdf_text)
        if content: 
            return content 
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            # print(alternative_pdf_path)
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")
        

############### SENTENCES ################################################
def split_text_into_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences


############### LINKS ################################################
def extract_links(text): 
    # Create an instance of the URLExtract class
    extractor = urlextract.URLExtract()

    urls = []
    for url in extractor.gen_urls(text):
        urls.append(url)
        
    return urls 

############### DATASETS ################################################
def get_datasets(text):
    """
    """
    # Initialize lists to store extracted datasets and their corresponding sentences
    extracted_datasets = []
    dataset_sentences = []
    
    # Split the text into sentences
    sentences = split_text_into_sentences(text)
    
    # Extract links and capitalized words
    links = extract_links(text)
    
    for sentence in sentences:
        datasets_in_sentence = []
        
        # Check if the sentence contains a link
        for link in links:
            if link in sentence:
                datasets_in_sentence.append(link)
        
        # Check if the sentence contains the word "request"
        if "request" in sentence.lower():
            datasets_in_sentence.append("Request")
        
        if datasets_in_sentence:
            # If any datasets were found in the sentence, add them and the sentence itself
            extracted_datasets.extend(datasets_in_sentence)
            dataset_sentences.extend([sentence] * len(datasets_in_sentence))
    
    # If no dataset was found, return None
    if not extracted_datasets:
        return None
    
    #df = pd.DataFrame({'dataset': extracted_datasets, 'dataset_sentence': dataset_sentences})
    #return df

    return extracted_datasets, dataset_sentences


###############
def extract_and_add_datasets(row, text_column):
    """This function needs a description 
    
    Parameters: 
    :param row: 
    :param text_column: 
    
    Returns: 
    :return: 
    """
    result = get_datasets(row[text_column])
    
    if result is None:
        return None
    
    if len(result) == 2:
        datasets, sentences = result
    else:
        # Handle the case where get_datasets didn't return the expected two values
        datasets, sentences = ["N/A"], ["N/A"]
    
    rows_list = []
    for dataset, sentence in zip(datasets, sentences):
        new_row = row.copy()
        new_row['dataset'] = dataset
        new_row['dataset_sentence'] = sentence
        rows_list.append(new_row)
    
    return rows_list

In [29]:
# Path to the directory containing PDFs
articles_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
editorialboard_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

I will test the functions using the groundtruth texts as my validation set. 
When manually extracting the datasets from the ten groundtruth texts, we should get the following datasets (NB! Currently, I have not distinguished between links that leads the reader to data and links that leads the reader to code - this will come later): 
<br>
<br>

| DOI                                   | Dataset                                      | Dataset_sentence                                                                                                                                                                                            |
|---------------------------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 10.1016/j.neuroimage.2022.119526       | Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil)                       | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | Allen Hu-man Brain Atlas (http://human.brain-map.org/)                                | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | https://github.com/jbrown81/gradients                                    | All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients. |
| 10.1016/j.neuroimage.2022.119443       | osf.io/gazx2/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/eucqf/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/thsqg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/bndjg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/guwnm/                               | Code used to reproduce the plots in Fig. 1 , as well as averaged ERP data, is available from osf.io/guwnm/.                                      |
| 10.1016/j.neuroimage.2022.119240       | Request                                           | statement Data used in this study are available from the corresponding author upon reasonable request.                                                                         |
| 10.1016/j.neuroimage.2022.119050       | zenodo.org (doi: 10.5281/zenodo.6110595) | Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi: 10.5281/zenodo.6110595).                         |
| 10.1016/j.neuroimage.2021.118854       | Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation) | The data used in this study was downloaded from the Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation). |
|                                       | https://github.com/ferreirafabio80/gfa | The GFA models and experiments were implemented in Python 3.9.1 and are available here: https://github.com/ferreirafabio80/gfa.                                          |

In [30]:
# List of groundtruth DOI values to filter 
validation_dois = [
    '10.1016/j.neuroimage.2021.118839',
    '10.1016/j.neuroimage.2021.118854',
    '10.1016/j.neuroimage.2022.119030',
    '10.1016/j.neuroimage.2022.119050',
    '10.1016/j.neuroimage.2022.119240',
    '10.1016/j.neuroimage.2022.119443',
    '10.1016/j.neuroimage.2022.119526',
    '10.1016/j.neuroimage.2022.119549',
    '10.1016/j.neuroimage.2022.119646',
    '10.1016/j.neuroimage.2022.119676',
] 

In [35]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    #first_10_dois = doi_data['DOIs'][:10]

    for doi in validation_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        sentence = get_content(pdf_path, editorialboard_directory)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": sentence})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [None]:
extracted_datasets, dataset_sentences

In [45]:
results_df['Section'].loc[0]

(['www.elsevier.com/locate/neuroimage',
  'https://doi.org/10.1016/j.neuroimage.2021.118839',
  'http://creativecommons.org/licenses/by-nc-nd/4.0/',
  'http://neuroimage.usc.edu/brainstorm'],
 ['NeuroImage  248 (2022) 118839 \nContents  lists available  at ScienceDirect  \nNeuroImage  \njournal  homepage:  www.elsevier.com/locate/neuroimage  \nMotor  impairment  evoked  by direct  electrical  stimulation  of human  \nparietal  cortex  during  object  manipulation  \nLuca Fornia  a , c , Marco  Rossi b , Marco  Rabuﬀetti  c , Andrea  Bellacicca  a , Luca Viganòb , \nLuciano  Simone  d , Henrietta  Howells  a , Guglielmo  Puglisi  a , Antonella  Leonetti  a , \nVincenzo  Callipo  e , Lorenzo  Bello b , Gabriella  Cerri e , ∗ \na Laboratory  of Motor Control, Department  of Medical Biotechnologies  and Translational  Medicine,  Universitàdegli  Studi di Milano, Italy \nb Neurosurgical  Oncology Unit, Department  of Oncology and Hemato-Oncology,  Universitàdegli  Studi di Milano, Italy \nc

Unnamed: 0,0,1,2,3
0,www.elsevier.com/locate/neuroimage,https://doi.org/10.1016/j.neuroimage.2021.118839,http://creativecommons.org/licenses/by-nc-nd/4.0/,http://neuroimage.usc.edu/brainstorm
1,NeuroImage 248 (2022) 118839 \nContents list...,Corticospinal projections has been shown als...,This is an open access article under the CC BY...,Subsequently the results were loaded in a Mat...


In [42]:
results_df['Section'].loc[1]

(['www.elsevier.com/locate/neuroimage',
  'https://doi.org/10.1016/j.neuroimage.2021.118854',
  'http://creativecommons.org/licenses/by/4.0/',
  'https://www.humanconnectome.org/study/hcp-',
  'https://www.humanconnectome.org/study/hcp-',
  'https://www.humanconnectome.org/study/hcp-young-',
  'https://github.com/ferreirafabio80/gfa',
  'https://statistics.berkeley.edu/tech-reports/688',
  'https://dl.acm.org/doi/book/10.5555/1162264',
  'https://dl.acm.org/doi/10.5555/3104482.3104540',
  'http://proceedings.mlr.press/v22/virtanen12.html',
  'https://dl.acm.org/doi/10.5555/2946645.3053478'],
 ['NeuroImage  249 (2022) 118854 \nContents  lists available  at ScienceDirect  \nNeuroImage  \njournal  homepage:  www.elsevier.com/locate/neuroimage  \nA hierarchical  Bayesian  model  to ﬁnd brain-behaviour  associations  in \nincomplete  data  sets \nFabio  S.',
  'Second,  the \nassociations  within data modalities,  which might explain  important  vari- \nhttps://doi.org/10.1016/j.neuroimage.

---
<a name = 'oldcode'></a>
# OLD CODE 

<a name='gatherdatasets'></a>
# 1. Gather datasets 

PLAN OF ATTACK TO EXPLORE: 
* IF - Locate 'Data availability' (or similar) section and look for links - if multiple, save all of them and look at surrounding words for context 
* ELSE If there is no 'Data availability' (or similar) section 
	* Look at wording in section 2.1 
<br>
<br>

<a name='gettextsections'></a>
## 1.1. Get text sections 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [30]:
def get_content(pdf_path, alt_pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content = get_section(pdf_text, section_patterns)
        if content: 
            return content 
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            print(alternative_pdf_path)
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section(article, section_patterns):
    """Get sections from a research paper based on patterns.
    This function is losely interpreted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    specifically PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
    
    Parameters: 
    :param article (str): Text contents of the research paper.
    :param section_patterns (list of lists): A list of lists where each inner list represents the start and end patterns.
    
    Returns: 
    :return: The extracted section text.
    """
    article_lower = article.lower()  # Convert contents to lowercase

    # Attempt to find the section based on the current patterns (case-insensitive)
    for start_patterns, end_patterns in section_patterns:
        for start_pattern in start_patterns:
            start_pattern = re.compile(re.escape(start_pattern), re.IGNORECASE)
            match_start = start_pattern.search(article_lower)
            if match_start:
                idx0 = match_start.start()
                for end_pattern in end_patterns:
                    end_pattern = re.compile(re.escape(end_pattern), re.IGNORECASE)
                    match_end = end_pattern.search(article_lower[idx0:])
                    if match_end:
                        end_idx = idx0 + match_end.end()
                        section = article[idx0:end_idx]  # Extract the matched section
                        return section

    # If no match is found, return an empty string
    return ""

In [31]:
# Path to the directory containing PDFs
pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
alternative_pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

<a name='sectionpatternsv1'></a>
### 1.1.1. Section patterns v1 
Before I continue working on extracting the dataset names and potential links from the sections, I am curious to see how the section pattern performs. 

I investigate the first ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [32]:
section_patterns = [
    (["Data and Code Availability", "Data Availability"], ["3", "CRediT authorship contribution statement", "Acknowledgements", "References"]),
    (["2.1"], ["2.2"]),
    (["Resource", "3.1 'Resource'"], ["3.2"]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1"], ["2"]),
    (["Abstract"], ["1", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][:10]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [6]:
results_df

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.119451,Data and code availability statements \nSpec...
1,10.1016/j.neuroimage.2022.119632,Data and code availability \nThe data incorpo...
2,10.1016/j.neuroimage.2022.119584,Data Availability \nData will be made availab...
3,10.1016/j.neuroimage.2022.119550,Data and code availability \nAll data used in...
4,10.1016/j.neuroimage.2022.119710,Data and code availability statement \nAll i...
5,10.1016/j.neuroimage.2022.119338,Data and code availability \nNo data were acq...
6,10.1016/j.neuroimage.2022.118986,Data and Code Availability Statement \nAll c...
7,10.1016/j.neuroimage.2022.119192,Data and code availability statement \nData ...
8,10.1016/j.neuroimage.2022.119177,Data and code availability statement \nThe E...
9,10.1016/j.neuroimage.2022.119110,Data and code availability statements : The ...


*In the following description, I refer to the index of the articles in results_df.*

Observations from the text sections extracted with section_patterns: 
* In 1, 2, 4, 7, and 9, **the text is cut short because there's a mention of a number 3** within the section (in a link, in a release number, etc.). 
* In 2, they call it: 'Data/code availability statement'
* In 2 and 9, the **end of the section can be 'Acknowledgements'**.
* In 3 and 6, the **end of the section can be 'Declaration of Competing Interest'**.
* In 4, 5, and 8, the **section ends with 'Credit authorship contribution statement'**.
* In 5, we see that the use of a **URL does not necessarily mean that it's pointing to data (in this case, it's code and software)**. 
* In 6, we see that **the formulation of the text is important** (as the github link both contains data and code, but that is tricky to see). 
* In 7 and 8, they **mention which dataset they used, but do not link it**. 
* In 9, it says: "The review summarizes data but does not contain new data." (this is important if I want to look into and further filter the documents for significance testing). 

<br>
From this investigation I can see that I need to edit the section patterns. Ideas: 

- Maybe the end of a section can be \n\n? 
- Section end '3' should be called '3. ' - maybe this will fix some 
- Add variations: 
    - Section starts: 
        - Data/code availability statement
    - Section ends: 
        - [data and code] Declaration of Competing Interest
        - [data and code] Acknowledgements
        - [data and code] Credit authorship contribution statement
<br>
<br>

FOR FUTURE STEPS: 
- URLs do not necessarily link to the data. 
- A git repository can contain both data and code - but not always. 
- The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase). 
- QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

<a name='sectionpatternsv2'></a>
### 1.1.2. Section patterns v2 
Based on my exploration on the performance of the first section patterns, I can see that they need to be rewritten. For version 2, I made a few edits: 
* Add variations
    * Section starts: 
        * Data/code availability statement 
    * Section ends: 
        * '\n\n' (this could be a general way to end the section) 
        * [data and code] Declaration of Competing Interest
        * [data and code] Acknowledgements
        * [data and code] Credit authorship contribution statement
* Change pattern containing numbers (e.g., '3' is now '3. ')
<br>
I investigate the next ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [7]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [8]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/10.1016.S1053-8119(22)00043-X.pdf


In [9]:
results_df2

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.118931,Data availability \nThe Matlab code for the p...
1,10.1016/j.neuroimage.2022.119447,2.1. Participants \nThirty-ﬁve people (20 fe...
2,10.1016/j.neuroimage.2022.119403,Data and code availability statement \nData ...
3,10.1016/j.neuroimage.2021.118831,Data and code availability statements \nThe ...
4,10.1016/j.neuroimage.2022.119308,Data and code availability statement \nFinni...
5,10.1016/S1053-8119(22)00043-X,Editorial board
6,10.1016/j.neuroimage.2021.118792,data availability \nThe Shen 268 atlas is ava...
7,10.1016/j.neuroimage.2022.118890,data availability \nThe code used to run the ...
8,10.1016/j.neuroimage.2022.119339,Data and code availability statement \nThe h...
9,10.1016/j.neuroimage.2022.119295,Introduction \nReal-time functional magneti...


*In the following description, I refer to the index of the articles in results_df2.*

Observations from the text sections extracted with section_patterns_v2: 
- In 0, there are links, but these are not to the dataset - they write "The used data can be shared with other researchers upon reasonable request." 
- In 0 and 7, the next section is called 'Supplementary materials' - which means that my attempt at \n\n did not work.  
- In 2, the only mention of data was picked up in section 2.1.
- In 2, the 'Declaration  of Competing  Interest' was not picked up - it looks like it's because there are double spaces between the words. 
- In 3, the 'Credit authorship  contribution  statement' is not picked - double spaces?
- In 6, the data section is called 'Code and data availability' - but it was picked up by 'data availability'. 
- In 6, there are multiple links mentioned - one for data (an atlas), one for the code, and one for the data. 
    - NB! When copying the URL for the data, it is broken up by the formatting: https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release - this is also the case for the atlas. 
- In 7 and 8, there are spaces in the URL. 
- In 8, the following section 'Declaration of Competing Interest' was not picked up. 
- In 9, the introduction was picked up: but it does not look like any data is analysed in this article. 

<br>
From this investigation I can see that I need to edit the section patterns further. 

<br>
<br>
TO DO: 

- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words mess these up
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


### 1.1.3. Section patterns v3 

I want to make a regex_pattern work, as it seems like a double space after 

TO DO: 
- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words messed these up. 
- Section_patterns I'm worried about: 
    - Section_start: '2.1.' - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Make the titles case sensitive, and it seems like most only capitalize the first word (see investigation in ../Code/articles_groundtruth.ipynb under 'Ground truth/Investigation/Section titles')


In [10]:
def get_content_regex(pdf_path, alt_pdf_path, section_patterns):
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content, matched_start_pattern, start_pattern, end_pattern = get_section_regex(pdf_text, section_patterns)
        
        if content:
            return content, matched_start_pattern, start_pattern, end_pattern
        else:
            # Handle the case where no content is found
            return content, matched_start_pattern, start_pattern, end_pattern
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board', '', '', ''
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board', '', '', ''
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section_regex(article, section_patterns):
    """This function extracts text sections from articles based on provided regex patterns.

    Parameters:
    :param article (str): The text of the article.
    :param section_patterns (list of tuple): A list of tuples containing start and end regex patterns.

    Returns:
    :returns tuple: A tuple containing the extracted section text, the matched start pattern, and the matched end pattern.
               If no section is found, it returns ('', '', '').
    """
    matched_pattern = None  # Variable to store the matched start pattern
    start_match = None      # Variable to store the specific matched start pattern
    end_match = None        # Variable to store the specific matched end pattern
    
    # Iterate through each pattern pair
    for start_pattern, end_pattern in section_patterns:
        # Find all matches of the start pattern in the article
        start_matches = re.finditer(start_pattern, article)

        # Iterate through each start match
        for match in start_matches:
            start_idx = match.start()  # Get the start position of the start match

            # Search for the end pattern starting from the end position of the start match
            end_match = re.search(end_pattern, article[start_idx:])
            
            if end_match:
                end_idx = start_idx + end_match.start()  # Calculate the end position of the section
                section_text = article[start_idx:end_idx].strip()  # Extract the section text

                # Store the matched start and end patterns
                matched_pattern = start_pattern
                start_match = match
                end_match = end_match

                # Return the section text and matched patterns
                return section_text, matched_pattern, start_match, end_match

    # If no match is found, return an empty string and the last matched patterns
    return '', '', '', ''

In [11]:
section_patterns_regex = [
    (r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability', r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'),
    (r'\n?2\. | \n?2\.1\.', r'\s*?\n?3\.\s*?\n?'),
    # (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n ')
]

In [12]:
# Empty list to store individual results
results_list_regex = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first X DOIs
    first_dois = doi_data['DOIs'][11:21] # 0:11 to compare with results_df, 11:21 to compare with results_df2

    for doi in first_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results_list_regex.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
results_df_regex = pd.DataFrame(results_list_regex)

results_df: 0-10, i.e., [0:11]

results_df2: 11-20, i.e., [11:21]

In [13]:
results_df2['Section'].loc[0]

'Data availability  \nThe Matlab code for the proposed  vein segmentation  algorithm  \nis available  on github: https://github.com/SinaStraub/GRE  _ vessel _ \nseg.git and example  data on Zenodo.org:  https://doi.org/10.  \n5281/zenodo.5791233  \nThe used data can be shared with other researchers  upon reasonable  \nrequest.  \nSupplementary  materials  \nSupplementary  material  associated  with this article can be found, in \nthe online version,  at doi:10.1016/j.neuroimage.2022.118931  . \nAppendices  \nAll equations  are to be understood  voxel-wise,  however,  spatial co- \nordinates  are omitted  when possible.  \nA. True susceptibility-weighted  images \nIn contrast  to susceptibility-weighted  data, true susceptibility-  \nweighted  images (tSWI) are generated  using susceptibility  masks 𝑊 in- \nstead of phase masks ( Liu et al., 2014 ), \n𝑡𝑆𝑊 𝐼 = 𝑚𝑎𝑔 ⋅𝑊 𝑛 , where 𝑊 = ⎧ \n⎪ \n⎨ \n⎪ ⎩ 1 , 𝑓𝑜𝑟 𝜒≤ 𝜒1 , \n1 − 𝜒− 𝜒1 \n𝜒2 − 𝜒1 , 𝑓𝑜𝑟 𝜒1 < 𝜒≤ 𝜒2 , \n0 , 𝑓𝑜𝑟 𝜒> χ2 , (A.1) \nwhere [ 𝜒

In [14]:
results_df_regex['Section'].loc[0]

'Data availability  \nThe Matlab code for the proposed  vein segmentation  algorithm  \nis available  on github: https://github.com/SinaStraub/GRE  _ vessel _ \nseg.git and example  data on Zenodo.org:  https://doi.org/10.  \n5281/zenodo.5791233  \nThe used data can be shared with other researchers  upon reasonable  \nrequest.'

**Fixed issues**: 
- Edited get_content_regex function to be case sensitive instead of insensitive 
    - When searching using all lowercase, results_df2['Section'].loc[2], this is cut short
        - From 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the funding  body or institute,  and with the institutional  ethics \napproval.  Parts of the data are conﬁdential  and additional  ethical ap- \nproval may be needed  for re-use. \n'
        - To: 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the'
        
**Persisting issues**: 
- Reading the PDF 
    - By page-shift, the header is picked up (results_df_regex['Section'].loc[6], DOI  10.1016/j.neuroimage.2022.118986)
    - Double (or more) spaces
    - \n characters 
- Section titles 
    - There are variations of section_start titles that I have not included in my pattern, e.g., "Data Availability", which I discovered in articles_groundtruth
    - There are infinitely many undiscovered section_end titles, that I have not included in my pattern. 

NB! THIS TAKES MORE THAN AN HOUR TO RUN!
started at 16.09 - saw it was done at 18.15 - but checked at 17:40+, where it hadn't finished 

In [15]:
# Empty list to store individual results
results = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    for doi in doi_data['DOIs']:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content_regex function for each DOI 
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
articles_dataset_sections = pd.DataFrame(results)

NB! The code above takes between one and two hours to run. 

In [16]:
# articles_dataset_sections

In [17]:
# Define the path to the 'Code-git/Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# Define the file path
file_path = os.path.join(data_dir, 'articles_dataset_sections.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
articles_dataset_sections.to_csv(file_path, index=False, mode='w')  

<a name='preprocessingtextsections'></a>
## 1.2. Preprocessing text sections
Before I continue to the extraction of the datasets from the text sections, I want to clean the current data a bit. This includes: 
- Clean the matching start patterns 
- Clean the extracted text sections, including 
    - Remove characters like '\n' 
    - Remove double (or more) spaces 

In [18]:
# Path to the CSV file
csv_file_path = os.path.join(os.pardir, 'Data/articles_dataset_sections.csv') 

# Read the CSV file into a DataFrame
articles_dataset_sections = pd.read_csv(csv_file_path)

In [19]:
articles_dataset_sections

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern
0,10.1016/j.neuroimage.2022.119451,Data and code availability statements \nSpec...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(25302, 25330), match='...","<re.Match object; span=(705, 709), match=' 3. '>"
1,10.1016/j.neuroimage.2022.119632,Data and code availability \nThe data incorpo...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(39700, 39730), match='...","<re.Match object; span=(689, 727), match=' \nD..."
2,10.1016/j.neuroimage.2022.119584,Data/code availability statement \nData and...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(26185, 26209), match='...","<re.Match object; span=(80, 86), match=' \n3. '>"
3,10.1016/j.neuroimage.2022.119550,Data and code availability \nAll data used in...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(104602, 104632), match...","<re.Match object; span=(479, 518), match=' \n..."
4,10.1016/j.neuroimage.2022.119710,Data and code availability statement \nAll i...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(70621, 70650), match='...","<re.Match object; span=(596, 641), match=' \nC..."
...,...,...,...,...,...
829,10.1016/j.neuroimage.2022.118922,Data and code availability statement \nThe b...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(48329, 48358), match='...","<re.Match object; span=(328, 373), match=' \nC..."
830,10.1016/j.neuroimage.2022.119713,"Data availability \nROI time series, along wi...","(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(58431, 58452), match='...","<re.Match object; span=(468, 507), match=' \n..."
831,10.1016/j.neuroimage.2022.119688,2. \n1053-8119/©2022 The Authors. Published ...,\n?2\. | \n?2\.1\.,"<re.Match object; span=(5336, 5339), match='2. '>","<re.Match object; span=(6586, 6588), match='3.'>"
832,10.1016/j.neuroimage.2022.118939,Data and code availability \nDe-identiﬁed da...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(50360, 50390), match='...","<re.Match object; span=(125, 152), match=' \nS..."


<a name='startpatterns'></a>
### 1.2.1. Start patterns 

In [20]:
def extract_matched_text(text):
    """This function extracts matched text from a string containing a regular expression 
    match object and performs data cleaning.

    Parameters:
    :param text (str): A string containing a regular expression match object (e.g., "<re.Match object; span=(start, end), match='text'>").

    Returns:
    :returns: If a match is found in the input text, the function returns the matched text after performing the following operations:
        Stripping leading and trailing spaces from the matched text.
        Replacing '\n' (newline) characters with empty strings.
    :returns: If no match is found or the resulting matched text is empty, the function returns NaN.
    """
    
    match = re.search(r"match='(.*?)'", str(text))
    if match:
        matched_text = match.group(1).strip().replace('\\n', '').replace('  ', ' ').replace('   ', ' ')
        if matched_text:
            return matched_text
        else:
            return np.nan
    else:
        return np.nan

In [21]:
# Apply the function to clean up the 'Start_pattern' column
articles_dataset_sections['Start_pattern_clean'] = articles_dataset_sections['Start_pattern'].apply(extract_matched_text)

Overview of how many articles matches each of the section patterns. 

In [22]:
# Group by 'Matched_pattern' and count the number of rows in each group
pattern_counts = articles_dataset_sections['Matched_pattern'].value_counts()

# Count NaN values and add it to the pattern_counts Series
nan_count = articles_dataset_sections['Matched_pattern'].isna().sum()
pattern_counts['NaN'] = nan_count

# Create a DataFrame to store the results
articles_section_patterns = pd.DataFrame({
    'Matched_pattern': pattern_counts.index,
    'Count': pattern_counts.values
})

# Print the result DataFrame
print(articles_section_patterns)

# Calculate and print the total count
total_count = articles_section_patterns['Count'].sum()
print("Total Count:", total_count)

                                     Matched_pattern  Count
0  (?<![\'"]) \s*?\n?Data\s+and\s+code\s+availabi...    563
1                                 \n?2\. | \n?2\.1\.    248
2       \n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+      1
3        \n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n?       1
4                                                NaN     21
Total Count: 834


I was only expecting to see 19 articles with NaN as a matched pattern (since there are 19 editorial board papers). 

In [23]:
# Filter and display rows where 'Start_pattern_clean' is None
no_pattern = articles_dataset_sections[articles_dataset_sections['Start_pattern_clean'].isna()]
len(no_pattern)

21

In [24]:
# Filter rows where 'Section' is not 'Editorial board'
no_pattern[no_pattern['Section'] != 'Editorial board']

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern,Start_pattern_clean
517,10.1016/j.neuroimage.2022.119154,,,,,
670,10.1016/j.neuroimage.2022.118921,,,,,


There should only be 19 articles where there is no pattern-match, as there are 19 'Editorial Board' articles. The articles that were not filtered properly by my code are: 
- 10.1016/j.neuroimage.2022.119560
    - This has a section called 'Data Availability'
- 10.1016/j.neuroimage.2021.118776
    - This article does not have any distinct sections. It presents all the articles in the particular volume of Neuroimaging. 
- 10.1016/j.neuroimage.2022.119154
    - This article does not have any distinct sections. It is a commentary.     
- 10.1016/j.neuroimage.2022.118921
    - This article does not have any distinct sections. It is a corrigendum. 
<br>
<br>

Of the four articles that did not contain one of my start patterns, only one should have been picked up. The rest seems to have been properly filtered. 
<br>
<br>

### 1.2.3. Clean text 
I will do a very simple initial cleaning of the extracted text sections: 
- Replace multiple spaces with a single space
    - [.replace('   ', ' ').replace('  ', ' ')]
- Remove all \n characters 
    - [.replace('\n', '')]
- Remove leading and trailing spaces after the following characters: -, (, ), /, ., _ , and between : / 
    - [.replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_')] 


In [25]:
articles_dataset_sections['Section']

0      Data and code availability  statements  \nSpec...
1      Data and code availability  \nThe data incorpo...
2      Data/code  availability  statement  \nData and...
3      Data and code availability  \nAll data used in...
4      Data and code availability  statement  \nAll i...
                             ...                        
829    Data and code availability  statement  \nThe b...
830    Data availability  \nROI time series, along wi...
831    2. \n1053-8119/©2022  The Authors.  Published ...
832    Data and code availability  \nDe-identiﬁed  da...
833    2. \n1053-8119/©2022  The Authors.  Published ...
Name: Section, Length: 834, dtype: object

In [26]:
for i in range(len(articles_dataset_sections['Section'])):
    articles_dataset_sections['Section'].loc[i] = articles_dataset_sections['Section'].astype(str).loc[i].replace('   ', ' ').replace('  ', ' ').replace('\n', '').replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_') 

In [27]:
articles_dataset_sections['Section']

0      Data and code availability statements Speciﬁca...
1      Data and code availability The data incorporat...
2      Data/code availability statement Data and code...
3      Data and code availability All data used in th...
4      Data and code availability statement All indiv...
                             ...                        
829    Data and code availability statement The brain...
830    Data availability ROI time series, along with ...
831    2. 1053-8119/©2022 The Authors. Published by E...
832    Data and code availability De-identiﬁed data a...
833    2. 1053-8119/©2022 The Authors. Published by E...
Name: Section, Length: 834, dtype: object

# NB! PROBLEM WITH LINKS

osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/ - BUT THE WORD 'AND' IS NOT A PART OF THE LINK 

In [28]:
row = articles_dataset_sections[articles_dataset_sections['DOI'] == '10.1016/j.neuroimage.2022.119443']

In [29]:
row['Section'].values

array(['Data and code availability statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/. Code used to reproduce the plots in Fig. 1 , as well as averaged ERP data, is available from osf.io/guwnm/.'],
      dtype=object)

Additionally, I want to remove the section titles from the text, as they can cause issues with the code I will be writing for extracting the datasets. 

In [30]:
articles_dataset_sections[['Section', 'Start_pattern_clean']]

Unnamed: 0,Section,Start_pattern_clean
0,Data and code availability statements Speciﬁca...,Data and code availability
1,Data and code availability The data incorporat...,Data and code availability
2,Data/code availability statement Data and code...,Data/code availability
3,Data and code availability All data used in th...,Data and code availability
4,Data and code availability statement All indiv...,Data and code availability
...,...,...
829,Data and code availability statement The brain...,Data and code availability
830,"Data availability ROI time series, along with ...",Data availability
831,2. 1053-8119/©2022 The Authors. Published by E...,2.
832,Data and code availability De-identiﬁed data a...,Data and code availability


In [31]:
def remove_starting_pattern(row):
    """This function removes the matching start_pattern text from the extracted section texts. 
    E.g., if the start pattern is 'Data and code availability', and the extracted section text 
    is 'Data and code availability The data incorporat...', the returned clean_text will be 
    'The data incorporat...'. 
    """
    section = row['Section']
    start_pattern = str(row['Start_pattern_clean']) 
    section = section.replace(start_pattern, '')
    
    return section

In [32]:
# Apply the function to each row
articles_dataset_sections['Section_wo_pattern'] = articles_dataset_sections.apply(remove_starting_pattern, axis=1)

In [33]:
articles_dataset_sections[['Start_pattern_clean', 'Section', 'Section_wo_pattern']]

Unnamed: 0,Start_pattern_clean,Section,Section_wo_pattern
0,Data and code availability,Data and code availability statements Speciﬁca...,"statements Speciﬁcally, GES, PC and LiNGAM we..."
1,Data and code availability,Data and code availability The data incorporat...,The data incorporated in the primary analysis...
2,Data/code availability,Data/code availability statement Data and code...,statement Data and code are available upon re...
3,Data and code availability,Data and code availability All data used in th...,All data used in this project is from the Hum...
4,Data and code availability,Data and code availability statement All indiv...,statement All individual-level raw data used ...
...,...,...,...
829,Data and code availability,Data and code availability statement The brain...,statement The brain MR data was obtained from...
830,Data availability,"Data availability ROI time series, along with ...","ROI time series, along with the underlying MA..."
831,2.,2. 1053-8119/©2022 The Authors. Published by E...,1053-8119/©2022 The Authors. Published by Els...
832,Data and code availability,Data and code availability De-identiﬁed data a...,De-identiﬁed data and custom-built MATLAB cod...


<a name='getdatasets'></a>
## 1.3. Get datasets
I need to extract the datasets from the text sections we extracted above. 

Based on my previous observations, I will start the extraction with the following notions in mind: 
- Not open access datasets (meaning either fully private or available upon request) 
    - Markers include words such as "request", "no data", "new data", "not be shared" . E.g., 
        - "Data and code are available upon request."
        - "Data and code availability statement All individual-level raw data used in this study cannot be shared because of the ethical code of Tokyo Metropolitan University. How-ever, the acquired metadata (e.g., group level activation maps) are available upon request. The corresponding author should be contacted by email for all data requests."
        - "No data were acquired for this study."
        - "The review summarizes data but does not contain new data."
        - "The data and code presented here are available upon request to the corresponding author."
- Open access datasets (meaning it's available to everyone with a link or title of the dataset)
    - Markers include hyperlinks and capitalized words 
        - Hyperlink 
        - Capitalized words 
    - Word like "code", "data", or "package" is typically featured in the sentences with links, pointing to what the link refers to. 
    
- Issues (**code**)
    - The URL can be broken up by spaces due to line changes in the PDF. Do we stop at the parenthesis, comma or another symbol that might end the URL? 
        - EXAMPLES 
    - Not all links point to the dataset - some are to the code, e.g., 
        - "Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/. Notears method was implemented using Python available at https://github.com/xunzheng/notears . The proposed joint DAG method was implemented with Python and the code is available at https://github.com/gmeng92/joint-notears . The cohort data is accessible through the website (https://coins.trendscenter.org/) of COINS (COllaborative Infor-matics Neuroimaging Suite) database (Scott et al., 2011)."
        - "Data and code availability The data incorporated in the primary analysis were gathered from the public UK Biobank resource and will be made pub-licly available together with the code used to generate the data through the UK Biobank Returns Catalogue (https://biobank.ndph. ox.ac.uk/showcase/docs.cgi?id = 1). ABCD study data release 3.0 is available for approved researchers in NIMH Data Archive (NDA DOI:10.151.54/1,519,007). Code for conducting discovery and replication is available at https: //github.com/robloughnan/MOSTest _ generalization . Code for simu-lations is available at https://github.com/precimed/mostest/tree/master/simu."    
    
- Issues (**analysis**)
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
    - What if there are multiple sections and the text is slightly different (e.g., 10.1016/j.neuroimage.2022.118986)
<br>
<br>

TO DO Columns: 
- (DONE) Section text 
- (DONE) Section pattern (multiple reasons: 
    - 1) I can get a sense of whether the data statement is common in NeuroImage, 
    2) I can go back and handle potential more difficult cases) 
- Extracted dataset 

Validation dataset: 

In [34]:
# List of groundtruth DOI values to filter 
validation_dois = [
    '10.1016/j.neuroimage.2021.118839',
    '10.1016/j.neuroimage.2021.118854',
    '10.1016/j.neuroimage.2022.119030',
    '10.1016/j.neuroimage.2022.119050',
    '10.1016/j.neuroimage.2022.119240',
    '10.1016/j.neuroimage.2022.119443',
    '10.1016/j.neuroimage.2022.119526',
    '10.1016/j.neuroimage.2022.119549',
    '10.1016/j.neuroimage.2022.119646',
    '10.1016/j.neuroimage.2022.119676',
] 

# Filter rows based on DOI values
validation_set = articles_dataset_sections[articles_dataset_sections['DOI'].isin(validation_dois)]

<a name='availabilitypattern'></a>
### 1.3.1. 'Availability' pattern 

I will start by examining and dealing with the text sections that were filtered by the first section_pattern, namely: 

    r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability' 
 
The corresponding ending pattern: 

    r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'
 

In [35]:
articles_dataset_sections['Matched_pattern'].loc[0]

'(?<![\\\'"]) \\s*?\\n?Data\\s+and\\s+code\\s+availability |(?<![\\\'"]) \\s*?\\n?Data\\s+availability |(?<![\\\'"]) \\s*?\\n?Data/code\\s+availability'

In [36]:
pat_1 = articles_dataset_sections[articles_dataset_sections['Matched_pattern'] == '(?<![\\\'"]) \\s*?\\n?Data\\s+and\\s+code\\s+availability |(?<![\\\'"]) \\s*?\\n?Data\\s+availability |(?<![\\\'"]) \\s*?\\n?Data/code\\s+availability']

A total of 563 articles have a section where the title matches the pattern. 

In [37]:
pat_1[['Section_wo_pattern', 'Start_pattern_clean']]

Unnamed: 0,Section_wo_pattern,Start_pattern_clean
0,"statements Speciﬁcally, GES, PC and LiNGAM we...",Data and code availability
1,The data incorporated in the primary analysis...,Data and code availability
2,statement Data and code are available upon re...,Data/code availability
3,All data used in this project is from the Hum...,Data and code availability
4,statement All individual-level raw data used ...,Data and code availability
...,...,...
827,The data that support the ﬁndings of this stu...,Data availability
828,statement The underlying raw data to this man...,Data availability
829,statement The brain MR data was obtained from...,Data and code availability
830,"ROI time series, along with the underlying MA...",Data availability


In [38]:
############### SENTENCES ################################################
def split_text_into_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences


############### LINKS ################################################
def extract_links(text):
    """This function identifies and extracts URLs (web links) from a given text using a 
    regular expression pattern. It also cleans and formats the extracted links by 
    removing leading and trailing spaces. The pattern accounts for various URL formats, 
    including those starting with "http://" or "https://," DOI format, and domain names 
    with specific characters, e.g., 'osf.io'.

    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: links  
    """
    # ORIGINAL 
    #url_pattern = r'''(https?://[^\s)(]+|\bdoi:\s*\d+(?:\.\d+)*(?:/[a-zA-Z0-9\./_\-]+)?|[a-z]+\.[a-z]+[a-zA-Z0-9\./_\-]*)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'''
    # NEW 
    
    #url_pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?|osf\.io/[a-z0-9/]+/|doi:\s*10\.\d+/\S+|www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'
    
    url_pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\(\S+\))?|osf\.io/[a-z0-9/]+|(?:www\.)?[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+|doi:\s*10\.\d+/\S+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'

    matches = re.findall(url_pattern, text)
    cleaned_links = ["".join(match).strip() for match in matches]
    return cleaned_links


############### CAPITALIZED ################################################
def extract_capitalized_words(text):
    """This function detects and extracts capitalized words from a text, e.g., 'Human 
    Connectome Project'. It also includes capitalized words followed by parentheses. 
    The regular expression pattern captures words with mixed case and optional hyphens. 
    It identifies words that are part of a capitalized notation and may be followed by text 
    within parentheses, e.g., "In this sentence Dataset Example (www.linktodataset.com) 
    would be extracted" returns "Dataset Example (www.linktodataset.com)"
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: all capitalized words  
    """
    capitalized_pattern = r'([A-Z][a-zA-Z\-]+(?:\s+[A-Z][a-zA-Z\-]+)*(?:\s*\(.*?\)))(?=\s*\.|\s|$)'
    return re.findall(capitalized_pattern, text)


############### DATASETS ################################################
def get_datasets(text):
    """
    """
    # Initialize lists to store extracted datasets and their corresponding sentences
    extracted_datasets = []
    dataset_sentences = []
    
    # Split the text into sentences
    sentences = split_text_into_sentences(text)
    
    # Extract links and capitalized words
    links = extract_links(text)
    capitalized_words = extract_capitalized_words(text)
    
    for sentence in sentences:
        datasets_in_sentence = []
        
        # Check if the sentence contains any capitalized words
        for cap_word in capitalized_words:
            if cap_word in sentence:
                datasets_in_sentence.append(cap_word)
        
        # Check if the sentence contains a link
        for link in links:
            if link in sentence:
                # Check if the link is already captured as a capitalized word in the same sentence
                if not any(link in cap_word for cap_word in capitalized_words):
                    datasets_in_sentence.append(link)
        
        # Check if the sentence contains the word "request"
        if "request" in sentence.lower():
            datasets_in_sentence.append("Request")
        
        if datasets_in_sentence:
            # If any datasets were found in the sentence, add them and the sentence itself
            extracted_datasets.extend(datasets_in_sentence)
            dataset_sentences.extend([sentence] * len(datasets_in_sentence))
    
    # If no dataset was found, return "N/A"
    if not extracted_datasets:
        return "N/A"
    
    #df = pd.DataFrame({'dataset': extracted_datasets, 'dataset_sentence': dataset_sentences})
    #return df

    return extracted_datasets, dataset_sentences


###############
def extract_and_add_datasets(row, text_column):
    """This function needs a description 
    
    Parameters: 
    :param row: 
    :param text_column: 
    
    Returns: 
    :return: 
    """
    result = get_datasets(row[text_column])
    
    if result is None:
        return None
    
    if len(result) == 2:
        datasets, sentences = result
    else:
        # Handle the case where get_datasets didn't return the expected two values
        datasets, sentences = ["N/A"], ["N/A"]
    
    rows_list = []
    for dataset, sentence in zip(datasets, sentences):
        new_row = row.copy()
        new_row['dataset'] = dataset
        new_row['dataset_sentence'] = sentence
        rows_list.append(new_row)
    
    return rows_list


pattern = r'(?i)(https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?|osf\.io/[a-z0-9/]+/|doi:\s*10\.\d+/\S+|www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+)(?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$))'

This pattern captures the following formats:
- URLs starting with http:// or https://, including paths, and optional (dataset ...) parts.
- osf.io/.../ format links.
- DOI links in the format doi: 10.xxxxx/xxxx.
- URLs starting with www. and followed by domain and path.

The (?:\s*(?:[),]|\.\s*[\r\n]?|,\s*|/and|$)) part at the end is used to capture various possible endings. 

Here's how the pattern works:

    (https?://[^\s)(]+(?:/[^\s)(]+)*(?:\s*\([^)]*\))?: Captures HTTP/HTTPS links with paths and optional (dataset ...) parts.
    osf\.io/[a-z0-9/]+/: Captures osf.io/.../ format links.
    doi:\s*10\.\d+/\S+: Captures DOI links.
    www\.[a-z0-9.-]+\.[a-z]{2,}/[^\s)(]+: Captures links starting with www. and followed by domain and path.

<a name='testingavailabilitypattern'></a>
#### 1.3.1.1. Testing 'Availability' pattern

I will test the functions using the groundtruth texts as my validation set. 
When manually extracting the datasets from the ten groundtruth texts, we should get the following datasets (NB! Currently, I have not distinguished between links that leads the reader to data and links that leads the reader to code - this will come later): 
<br>
<br>

| DOI                                   | Dataset                                      | Dataset_sentence                                                                                                                                                                                            |
|---------------------------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 10.1016/j.neuroimage.2022.119526       | Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil)                       | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | Allen Hu-man Brain Atlas (http://human.brain-map.org/)                                | Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).                    |
|                                        | https://github.com/jbrown81/gradients                                    | All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients. |
| 10.1016/j.neuroimage.2022.119443       | osf.io/gazx2/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/eucqf/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/thsqg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/bndjg/                               | statement EEG datasets used to create the ﬁgure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/and osf.io/bndjg/.                  |
|                                        | osf.io/guwnm/                               | Code used to reproduce the plots in Fig. 1 , as well as averaged ERP data, is available from osf.io/guwnm/.                                      |
| 10.1016/j.neuroimage.2022.119240       | Request                                           | statement Data used in this study are available from the corresponding author upon reasonable request.                                                                         |
| 10.1016/j.neuroimage.2022.119050       | zenodo.org (doi: 10.5281/zenodo.6110595) | Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi: 10.5281/zenodo.6110595).                         |
| 10.1016/j.neuroimage.2021.118854       | Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation) | The data used in this study was downloaded from the Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation). |
|                                       | https://github.com/ferreirafabio80/gfa | The GFA models and experiments were implemented in Python 3.9.1 and are available here: https://github.com/ferreirafabio80/gfa.                                          |



In [39]:
# Filter rows based on the groundtruth DOI values
validation_set = pat_1[pat_1['DOI'].isin(validation_dois)]

In [40]:
# Initialize an empty list to store the rows
rows_list = []
# Column name to use for text extraction
text_column = 'Section_wo_pattern'

# Iterate through each row of the original DataFrame
for index, row in validation_set.iterrows():
    # Call the custom function to extract datasets and add new rows
    new_rows = extract_and_add_datasets(row, text_column)
    
    # Append the new rows to the list
    rows_list.extend(new_rows)

# Create the final DataFrame from the list of rows
validation_df = pd.DataFrame(rows_list)

In [41]:
validation_df

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern,Start_pattern_clean,Section_wo_pattern,dataset,dataset_sentence
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,"Human Connectome Project (1U54MH091657, PIs Va...",Original data was obtained from the Human Con...
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,Allen Hu-man Brain Atlas (http://human.brain-m...,Original data was obtained from the Human Con...
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,https://github.com/jbrown81/gradients.,"All code (latent space derivation, dynamical s..."
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/gazx2/,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/eucqf/,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/thsqg,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/bndjg/,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/guwnm/,"Code used to reproduce the plots in Fig. 1 , a..."
437,10.1016/j.neuroimage.2022.119240,Data availability statement Data used in this ...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(59756, 59777), match='...","<re.Match object; span=(130, 169), match=' \n...",Data availability,statement Data used in this study are availab...,Request,statement Data used in this study are availab...
583,10.1016/j.neuroimage.2022.119050,Data and code availability Raw EEG data from a...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(37690, 37718), match='...","<re.Match object; span=(172, 177), match=' \n3...",Data and code availability,"Raw EEG data from all healthy individuals, as...",doi: 10.5281/zenodo.6110595).,"Raw EEG data from all healthy individuals, as..."


<a name='extractingallavailabilitydatasets'></a>
#### 1.3.1.2. Extracting all availability datasets
I will now run the code on all articles that matched with the availability pattern.  

In [42]:
pat_1.columns

Index(['DOI', 'Section', 'Matched_pattern', 'Start_pattern', 'End_pattern',
       'Start_pattern_clean', 'Section_wo_pattern'],
      dtype='object')

In [43]:
# Initialize an empty list to store the rows
rows_list = []
# Column name to use for text extraction
text_column = 'Section_wo_pattern'

# Iterate through each row of the original DataFrame
for index, row in pat_1.iterrows():
    # Call the custom function to extract datasets and add new rows
    new_rows = extract_and_add_datasets(row, text_column)
    
    # Append the new rows to the list
    rows_list.extend(new_rows)

# Create the final DataFrame from the list of rows
articles_datasets = pd.DataFrame(rows_list)

I want to separate the code links from the data links by simply searching the 'dataset_sentence' to see if it contains code or not. 
- If it contains either data or (data and code), I save the articles as articles_dataset
- If it contains code and not (data and code), I save the articles as articles_code 

In [44]:
print("Links, capitalized words, and other in total: ", len(articles_datasets))

Links, capitalized words, and other in total:  1266


In [45]:
# Create a regex pattern for variations of "data" (match "data" as a standalone word or within other words)
data_pattern = r'\w*data\w*'
code_pattern = r'\w*code\w*'

# Create a mask for rows containing variations of "data"
data_mask = articles_datasets['dataset_sentence'].str.contains(data_pattern, case=False, regex=True, flags=re.IGNORECASE)

# Create a mask for rows containing "code" but not "data"
code_mask = (articles_datasets['dataset_sentence'].str.contains(code_pattern, case=False, regex=True, flags=re.IGNORECASE)) & (~data_mask)

# Create a mask for rows that do not fit either of the mentioned masks
other_mask = ~data_mask & ~code_mask

# Separate rows into articles_dataset, articles_code, and articles_other
articles_dataset = articles_datasets[data_mask]
articles_code = articles_datasets[code_mask]
articles_other = articles_datasets[other_mask]

# Reset the index for all dataframes
articles_dataset.reset_index(drop=True, inplace=True)
articles_code.reset_index(drop=True, inplace=True)
articles_other.reset_index(drop=True, inplace=True)

# Print the counts for each dataframe
print(f"Articles with 'data' or both 'data' and 'code': {len(articles_dataset)}")
print(f"Articles with 'code' but not 'data': {len(articles_code)}")
print(f"Articles that do not fit either mask: {len(articles_other)}")

Articles with 'data' or both 'data' and 'code': 688
Articles with 'code' but not 'data': 223
Articles that do not fit either mask: 355


Data: database, dataset, image(s), (neuro)(map(s)), DOI(s), atlas, (freely available)
Other: tool(kit, box), scripts, results, algorithm, software, package, plugin, function, analysis, 

Still not all links are finished: 
['https://osf',
        '2) are available on the Open Science Framework repository: https://osf.io/95ftn/?view_only = 9a1a085583544c3eac44d1c75870599c.'],
         ['https://www',
        'humanconnectome.org/and https://www.developingconnectome.'],
       ['https://www.developingconnectome',
        'humanconnectome.org/and https://www.developingconnectome.'],
       ['projects.nitrc.org/indi/indiPRIME.html',
        'projects.nitrc.org/indi/indiPRIME.html.'],
         ['https://www',
        ' statement MRI images can be downloaded from HCP website: https://www.'],
         ['https://github',
        'Processing and analysis scripts used in this study are available at: https://github.com/ofgulban/meso-MRI (v1.0.2 saved at https://zenodo.org/record/7210802).'],
        ['https://doi.org/10.5281/zenodo',
        'The interactive web application accompanying Fig. 2 is published at https://doi.org/10.5281/zenodo.6579997 and is hosted at https://representational-dynamics.herokuapp.com/.'],
        ['https://www.lead-dbs',
        ' statements The open source Matlab toolboxes that were used in this study can be obtained from: Lead-DBS: https://www.lead-dbs.org SPM12: http://www.ﬁl.ion.ucl.ac.uk/spm Fieldtrip: http://ﬁeldtriptoolbox.org Custom-written Matlab scripts are available for sharing upon re-quest.'],
       ['http://www.ﬁl.ion.ucl.ac',
        ' statements The open source Matlab toolboxes that were used in this study can be obtained from: Lead-DBS: https://www.lead-dbs.org SPM12: http://www.ﬁl.ion.ucl.ac.uk/spm Fieldtrip: http://ﬁeldtriptoolbox.org Custom-written Matlab scripts are available for sharing upon re-quest.'],
        ['https://github',
        'Volumetric PET receptor images can be found on neuromaps (https://netneurolab.github.io/neuromaps/(Markello et al., 2022)) and at https://github.com/netneurolab/hansen_receptors (Hansen et al., 2021).'],
        ['https://netneurolab.github',
        '(2021) and is available in neuromaps (https://netneurolab.github.io/neuromaps/) (Markello et al., 2022).'],
        ['https://github',
        'All processing was performed using the abagen toolbox (https://github.com/netneurolab/abagen (Markello et al., 2021)).'],
        ['https://github',
        'We created a surface-based representation of the parcellation on the FreeSurfer fsaverage left hemi-sphere surface, via ﬁles from the Connectome Mapper toolkit (https://github.com/LTS5/cmp).'],
        ['https://meg.univ-amu',
        'The toolboxes used in this work are available at https://meg.univ-amu.fr/wiki/Main_Page and https://ins-amu.fr/software.'],

In [46]:
articles_other[['dataset', 'dataset_sentence']].values

array([['https://cran.r-project.org/web/packages/pcalg/',
        ' statements Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/.'],
       ['https://github.com/xunzheng/notears',
        'Notears method was implemented using Python available at https://github.com/xunzheng/notears.'],
       ['GitHub (https://github.com/tierneytim/OPM/blob/master/testScripts/testVSM.m)',
        'Examples and tests can also be found on GitHub (https://github.com/tierneytim/OPM/blob/master/testScripts/testVSM.m).'],
       ['https://www.ieeg.org',
        'iEEG snippets used speciﬁ-cally in this manuscript are also available, while full iEEG recordings are publicly available at https://www.ieeg.org.'],
       ['N/A', 'N/A'],
       ['https://github.com/BioMag/dbs_pd_beta_burst',
        'Scripts used to produce the results and ﬁg-ures presented in the study are published on the BioMag Gitlab page 

------------

## URL extract 

Based on the results in the groundtruth exploration, 75 % of the datasets were mentioned with links. That will be the method of extraction. 


In [47]:
import urlextract 

In [48]:
############### SENTENCES ################################################
def split_text_into_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences

############### CAPITALIZED ################################################
def extract_capitalized_words(text):
    """This function detects and extracts capitalized words and DOIs from a text. 
    It also includes capitalized words followed by parentheses. e.g., "In this sentence 
    Dataset Example (www.linktodataset.com) would be extracted" returns 
    "Dataset Example (www.linktodataset.com)"
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: all capitalized words  
    """
    pattern = r'([A-Z][a-zA-Z\-]+(?:\s+[A-Z][a-zA-Z\-]+)*(?:\s*\(.*?\)))|(doi:\s*10\.\d+/\S+)'
    matches = re.findall(pattern, text)
    
    # Filter out empty strings from the matches
    non_empty_matches = [match for match in matches if match[0] or match[1]]
    # Extract the non-empty matches into a list
    extracted_words = [match[0] if match[0] else match[1] for match in non_empty_matches]
    
    return extracted_words

############### LINKS ################################################
def extract_links(text): 
    # Create an instance of the URLExtract class
    extractor = urlextract.URLExtract()

    urls = []
    for url in extractor.gen_urls(text):
        urls.append(url)
        
    return urls 

############### DATASETS ################################################
def get_datasets(text):
    """
    """
    # Initialize lists to store extracted datasets and their corresponding sentences
    extracted_datasets = []
    dataset_sentences = []
    
    # Split the text into sentences
    sentences = split_text_into_sentences(text)
    
    # Extract links and capitalized words
    links = extract_links(text)
    capitalized_words = extract_capitalized_words(text)    
    
    for sentence in sentences:
        datasets_in_sentence = []
        
        # Check if the sentence contains any capitalized words or DOIs 
        for cap_word in capitalized_words:
            if cap_word in sentence:
                datasets_in_sentence.append(cap_word)
        
        # Check if the sentence contains a link
        for link in links:
            if link in sentence:
                # Check if the link is already captured as a capitalized word in the same sentence
                if not any(link in cap_word for cap_word in capitalized_words):
                    datasets_in_sentence.append(link)
        
        # Check if the sentence contains the word "request"
        if "request" in sentence.lower():
            datasets_in_sentence.append("Request")
        
        if datasets_in_sentence:
            # If any datasets were found in the sentence, add them and the sentence itself
            extracted_datasets.extend(datasets_in_sentence)
            dataset_sentences.extend([sentence] * len(datasets_in_sentence))
    
    # If no dataset was found, return None
    if not extracted_datasets:
        return None
    
    #df = pd.DataFrame({'dataset': extracted_datasets, 'dataset_sentence': dataset_sentences})
    #return df

    return extracted_datasets, dataset_sentences


###############
def extract_and_add_datasets(row, text_column):
    """This function needs a description 
    
    Parameters: 
    :param row: 
    :param text_column: 
    
    Returns: 
    :return: 
    """
    result = get_datasets(row[text_column])
    
    if result is None:
        return None
    
    if len(result) == 2:
        datasets, sentences = result
    else:
        # Handle the case where get_datasets didn't return the expected two values
        datasets, sentences = ["N/A"], ["N/A"]
    
    rows_list = []
    for dataset, sentence in zip(datasets, sentences):
        new_row = row.copy()
        new_row['dataset'] = dataset
        new_row['dataset_sentence'] = sentence
        rows_list.append(new_row)
    
    return rows_list

In [49]:
# Initialize an empty list to store the rows
rows_list = []
# Column name to use for text extraction
text_column = 'Section_wo_pattern'

# Iterate through each row of the original DataFrame
for index, row in validation_set.iterrows():
    # Call the custom function to extract datasets and add new rows
    new_rows = extract_and_add_datasets(row, text_column)
    
    # Append the new rows to the list
    rows_list.extend(new_rows)

# Create the final DataFrame from the list of rows
validation_df = pd.DataFrame(rows_list)

In [50]:
validation_df

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern,Start_pattern_clean,Section_wo_pattern,dataset,dataset_sentence
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,"Human Connectome Project (1U54MH091657, PIs Va...",Original data was obtained from the Human Con...
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,Allen Hu-man Brain Atlas (http://human.brain-m...,Original data was obtained from the Human Con...
302,10.1016/j.neuroimage.2022.119526,Data and code availability Original data was o...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(73770, 73799), match='...","<re.Match object; span=(488, 526), match=' \nD...",Data and code availability,Original data was obtained from the Human Con...,https://github.com/jbrown81/gradients.,"All code (latent space derivation, dynamical s..."
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,"osf.io/gazx2/,",statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,"osf.io/eucqf/,",statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/thsqg/and,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/bndjg/.,statement EEG datasets used to create the ﬁgu...
355,10.1016/j.neuroimage.2022.119443,Data and code availability statement EEG datas...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(22317, 22347), match='...","<re.Match object; span=(312, 331), match=' \nA...",Data and code availability,statement EEG datasets used to create the ﬁgu...,osf.io/guwnm/.,"Code used to reproduce the plots in Fig. 1 , a..."
437,10.1016/j.neuroimage.2022.119240,Data availability statement Data used in this ...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(59756, 59777), match='...","<re.Match object; span=(130, 169), match=' \n...",Data availability,statement Data used in this study are availab...,Request,statement Data used in this study are availab...
583,10.1016/j.neuroimage.2022.119050,Data and code availability Raw EEG data from a...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(37690, 37718), match='...","<re.Match object; span=(172, 177), match=' \n3...",Data and code availability,"Raw EEG data from all healthy individuals, as...",doi: 10.5281/zenodo.6110595).,"Raw EEG data from all healthy individuals, as..."


In [51]:
# Create an instance of the URLExtract class
extractor = urlextract.URLExtract()

text_column = validation_set['Section_wo_pattern']

# Iterate through each cell in the text_column
for text in text_column:
    # Use the gen_urls function to extract URLs from the text
    for url in extractor.gen_urls(text):
        print(url)

http://human.brain-map.org/
https://github.com/jbrown81/gradients.
osf.io/gazx2/,
osf.io/eucqf/,
osf.io/thsqg/and
osf.io/bndjg/.
osf.io/guwnm/.
zenodo.org
https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation
https://github.com/ferreirafabio80/gfa.


---

<a name='othersectionpatterns'></a>
### 1.3.2. Other section patterns
The other section patterns: 

  (r'\n?2\.1\.', r'\n?2\.2. | \n\n '),
    (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n ')
]

### 1.3.3. Validate code 
In the code-file 'articles_groundtruth_v2.ipynb', I manually extracted datasets from ten articles. I will now use these ten articles to validate the code I have written so far. 

In [52]:
        'DOI': '10.1016/j.neuroimage.2021.118839',
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'DOI': '10.1016/j.neuroimage.2022.119549',
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'DOI': '10.1016/j.neuroimage.2022.119676',
        

SyntaxError: illegal target for annotation (1381813082.py, line 1)

# Save datasets 

- Store the extracted datasets for further analysis 

# X. References

- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget