# Table of contents 
- [Setup](#setup) 
    - [Target](#target)
    - [Libraries](#libraries)
- [Gather datasets](#gatherdatasets)
    - [Get content](#getcontent)
        - [Evaluation of section patterns (so far)](#evaluationofsectionpatterns(sofar))
- [References](#references)

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='target'></a> 
## 0.1. Target
The goal is the use pypdf to locate and extract the datasets used for analysis in the research articles. Based on an initial review of nine random 
<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 

import pypdf 

<a name='gatherdatasets'></a>
# 1. Gather datasets 

PLAN OF ATTACK TO EXPLORE: 
* IF - Locate 'Data availability' (or similar) section and look for links - if multiple, save all of them and look at surrounding words for context 
* ELSE If there is no 'Data availability' (or similar) section 
	* Look at wording in section 2.1 
<br>
<br>

<a name='getcontent'></a>
## 1.1. Get content 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [2]:
def get_content(pdf_path, section_patterns):
    """Get a PDF. 
    This function is losely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: 
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
         # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content = get_section(pdf_text, section_patterns)
        if content: 
            # print(f"Section content: '{content}''")
            return content 
        pdf_file.close()
        
    except Exception as e:
        print(f"Error reading PDF: {e}")  
    
        """
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            print(page_text)
            
            # Search for the regex pattern in the page text
            #if re.search(target_text_pattern, page_text, re.IGNORECASE):
            #    print(f"Found '{target_text_pattern}' on page {page_num + 1} of {pdf_path}")

            # Extract sections using the provided section patterns
            content = get_section(page_text, section_patterns)
            if content:
                print(f"Section Content: {content}")
        """
        
def get_section(article, section_patterns):
    """Get sections from a research paper based on patterns.
    This function is losely interpreted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    specifically PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
    
    Parameters: 
    :param article (str): Text contents of the research paper.
    :param section_patterns (list of lists): A list of lists where each inner list represents the start and end patterns.
    
    Returns: 
    :return: The extracted section text.
    """
    article_lower = article.lower()  # Convert contents to lowercase

    # Attempt to find the section based on the current patterns (case-insensitive)
    for start_patterns, end_patterns in section_patterns:
        for start_pattern in start_patterns:
            start_pattern = re.compile(re.escape(start_pattern), re.IGNORECASE)
            match_start = start_pattern.search(article_lower)
            if match_start:
                idx0 = match_start.start()
                for end_pattern in end_patterns:
                    end_pattern = re.compile(re.escape(end_pattern), re.IGNORECASE)
                    match_end = end_pattern.search(article_lower[idx0:])
                    if match_end:
                        end_idx = idx0 + match_end.end()
                        section = article[idx0:end_idx]  # Extract the matched section
                        return section

    # If no match is found, return an empty string
    return ""

In [3]:
# Path to the directory containing PDFs
pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

section_patterns = [
    (["Data and Code Availability", "Data Availability"], ["3", "CRediT authorship contribution statement", "Acknowledgements", "References"]),
    (["2.1"], ["2.2"]),
    (["Resource", "3.1 'Resource'"], ["3.2"]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1"], ["2"]),
    (["Abstract"], ["1", "Introduction"])
]

<a name='evaluationofsectionpatterns(sofar)'></a>
### 1.1.1. Evaluation of section patterns (so far)
Before I continue working on extracting the dataset names and potential links from the sections, I am curious to see how the section pattern performs. 

I investigate the first ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [8]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][:10]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, section_patterns)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [9]:
results_df

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.119451,Data and code availability statements \nSpec...
1,10.1016/j.neuroimage.2022.119632,Data and code availability \nThe data incorpo...
2,10.1016/j.neuroimage.2022.119584,Data Availability \nData will be made availab...
3,10.1016/j.neuroimage.2022.119550,Data and code availability \nAll data used in...
4,10.1016/j.neuroimage.2022.119710,Data and code availability statement \nAll i...
5,10.1016/j.neuroimage.2022.119338,Data and code availability \nNo data were acq...
6,10.1016/j.neuroimage.2022.118986,Data and Code Availability Statement \nAll c...
7,10.1016/j.neuroimage.2022.119192,Data and code availability statement \nData ...
8,10.1016/j.neuroimage.2022.119177,Data and code availability statement \nThe E...
9,10.1016/j.neuroimage.2022.119110,Data and code availability statements : The ...


From this investigation I can see that I need to edit the section patterns, because: 
* In 10.1016/j.neuroimage.2022.119632, 10.1016/j.neuroimage.2022.119584, 10.1016/j.neuroimage.2022.119710, 10.1016/j.neuroimage.2022.119192, and 10.1016/j.neuroimage.2022.119110, **the text is cut short because there's a mention of a number 3** within the section (in a link, in a release number, etc.). 
* In 10.1016/j.neuroimage.2022.119584, they call it: 'Data/code availability statement'
* In 10.1016/j.neuroimage.2022.119584 and 10.1016/j.neuroimage.2022.119110, the **end of the section can be 'Acknowledgements'**.
* In 10.1016/j.neuroimage.2022.119550 and 10.1016/j.neuroimage.2022.118986, the **end of the section can be 'Declaration of Competing Interest'**.
* In 10.1016/j.neuroimage.2022.119710, 10.1016/j.neuroimage.2022.119338, and 10.1016/j.neuroimage.2022.119177, the **section ends with 'Credit authorship contribution statement'**.
* In 10.1016/j.neuroimage.2022.119338, we see that the use of a **URL does not necessarily mean that it's pointing to data (in this case, it's code and software)**. 
* In 10.1016/j.neuroimage.2022.118986, we see that **the formulation of the text is important** (as the github link both contains data and code, but that is tricky to see). 
* In 10.1016/j.neuroimage.2022.119192 and 10.1016/j.neuroimage.2022.119177, they **mention which dataset they used, but do not link it**. 
* In 10.1016/j.neuroimage.2022.119110, it says: "The review summarizes data but does not contain new data." (this is important if I want to look into and further filter the documents for significance testing). 

CHANGES: 
- Maybe the end of a section can be \n\n? 
- Section end '3' should be called '3. ' - maybe this will fix some 
- Add variations: 
    - Section starts: 
        - Data/code availability statement
    - Section ends: 
        - [data and code] Declaration of Competing Interest
        - [data and code] Acknowledgements
        - [data and code] Credit authorship contribution statement
- For future steps: 
    - URLs do not necessarily link to the data. 
    - A git repository can contain both data and code - but not always. 
    - The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase). 
    - QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

In [20]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [21]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

Error reading PDF: [Errno 2] No such file or directory: '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/10.1016.S1053-8119(22)00043-X.pdf'


In [22]:
results_df2

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.118931,Data availability \nThe Matlab code for the p...
1,10.1016/j.neuroimage.2022.119447,2.1. Participants \nThirty-ﬁve people (20 fe...
2,10.1016/j.neuroimage.2022.119403,Data and code availability statement \nData ...
3,10.1016/j.neuroimage.2021.118831,Data and code availability statements \nThe ...
4,10.1016/j.neuroimage.2022.119308,Data and code availability statement \nFinni...
5,10.1016/S1053-8119(22)00043-X,
6,10.1016/j.neuroimage.2021.118792,data availability \nThe Shen 268 atlas is ava...
7,10.1016/j.neuroimage.2022.118890,data availability \nThe code used to run the ...
8,10.1016/j.neuroimage.2022.119339,Data and code availability statement \nThe h...
9,10.1016/j.neuroimage.2022.119295,Introduction \nReal-time functional magneti...


In [32]:
results_df2['Section'].loc[9]

'Introduction  \nReal-time  functional  magnetic  resonance  imaging  (RT-fMRI)  is an \nemerging  technology  that holds tremendous  promise  for breakthroughs  \nin basic science  and clinical  applications.  In contrast  to traditional,  of- \nﬂine fMRI analysis,  RT-fMRI  involves  analyzing  data while participants  \nare still in the scanner,  giving experimenters  the ability to modify the \nstimuli or tasks that they present  as a function  of the participant’s  mea- \nsured neural state. RT-fMRI  can be used in neurofeedback  designs,  in \nwhich participants  are given feedback  on how well they are instantiat-  \ning a target brain state, and they use this information  to learn how to \nbetter instantiate  that state (for a historical  review of fMRI neurofeed-  \nback, see Linden et al., 2021 ; this review is part of a recent textbook  \non fMRI neurofeedback  edited by Hampson,  2021 ). In another  use of \nRT-fMRI,  stimuli are modiﬁed  as a function  of brain activation,

Notes from the second attempt: 
- In 10.1016/j.neuroimage.2022.118931 - there are links, but these are not to the dataset - they write "The used data can be shared with other researchers upon reasonable request." 
- In 10.1016/j.neuroimage.2022.118931 and 10.1016/j.neuroimage.2022.118890, the next section is called 'Supplementary materials' - which means that my attempt at \n\n did not work.  
- In 10.1016/j.neuroimage.2022.119447, the only mention of data was picked up in section 2.1.
- In 10.1016/j.neuroimage.2022.119403, the 'Declaration  of Competing  Interest' was not picked up - it looks like it's because there are double spaces between the words. 
- In 10.1016/j.neuroimage.2021.118831, the 'Credit authorship  contribution  statement' is not picked - double spaces?
- In 10.1016/j.neuroimage.2021.118792, the data section is called 'Code and data availability' - but it was picked up by 'data availability'. 
- In 10.1016/j.neuroimage.2021.118792, there are multiple links mentioned - one for data (an atlas), one for the code, and one for the data. 
    - NB! When copying the URL for the data, it is broken up by the formatting: https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release - this is also the case for the atlas. 
- In 10.1016/j.neuroimage.2022.118890 and 10.1016/j.neuroimage.2022.119339, there are spaces in the URL. 
- In 10.1016/j.neuroimage.2022.119339, the following section 'Declaration of Competing Interest' was not picked up. 
- In 10.1016/j.neuroimage.2022.119295, the introduction was picked up: but it does not look like any data is analysed in this article. 

TO DO: 
- RESEARCH WHY \n\n did not work for section end 
- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words 
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


WHAT I WANT: 
- A column to indicate which pattern was picked up - that way I can more easily sort and handle difficult cases. 

# Save datasets 

- Store the extracted datasets for further analysis 

# X. References

- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [None]:
def get_section_v1(article, section_patterns):
    """Get a section from a research paper. 
    
    Parameters: 
    :param contents (): Text contents of the resaerch paper.
    :param section_patterns (list): A list of strings to indicate the start and ends of the dataset section.
    :return: returns substring of text region between section_header and a potential section_end. returns "" if it fails to find it.
    
    This function is adapted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    """
    contents_lower = article.lower()  # Convert contents to lowercase
    
    """THE CODE BELOW DOES WHAT I WANT IT TO DO"""
    #test_start = r'data and code'
    #test_end = r'availability'
    #idx0 = contents_lower.find(test_start)
    #if idx0 != -1:
        #idxend = contents_lower.find(test_end, idx0)  # Start searching for test_end from idx0
        #if idxend != -1:
            #section = article[idx0:idxend]  # "+ len(test_end)" to include the end pattern in the extracted section
            #print(section)

    # If no match is found, return an empty string
    return "" 