# Table of contents 
- [Setup](#setup) 
    - [Target](#target)
    - [Libraries](#libraries)
- [Gather datasets](#gatherdatasets)
    - [Get content](#getcontent)
        - [Section patterns v1](#sectionpatternsv1)
        - [Section patterns v2](#sectionpatternsv2)
        - 
    - [Get datasets](#getdatasets)
- [References](#references)

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='target'></a> 
## 0.1. Target
The goal is the use pypdf to locate and extract the datasets used for analysis in the research articles. Based on an initial review of nine random 
<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 

import pypdf 

<a name='gatherdatasets'></a>
# 1. Gather datasets 

PLAN OF ATTACK TO EXPLORE: 
* IF - Locate 'Data availability' (or similar) section and look for links - if multiple, save all of them and look at surrounding words for context 
* ELSE If there is no 'Data availability' (or similar) section 
	* Look at wording in section 2.1 
<br>
<br>

<a name='getcontent'></a>
## 1.1. Get content 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [2]:
def get_content(pdf_path, alt_pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content = get_section(pdf_text, section_patterns)
        if content: 
            return content 
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            print(alternative_pdf_path)
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section(article, section_patterns):
    """Get sections from a research paper based on patterns.
    This function is losely interpreted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    specifically PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
    
    Parameters: 
    :param article (str): Text contents of the research paper.
    :param section_patterns (list of lists): A list of lists where each inner list represents the start and end patterns.
    
    Returns: 
    :return: The extracted section text.
    """
    article_lower = article.lower()  # Convert contents to lowercase

    # Attempt to find the section based on the current patterns (case-insensitive)
    for start_patterns, end_patterns in section_patterns:
        for start_pattern in start_patterns:
            start_pattern = re.compile(re.escape(start_pattern), re.IGNORECASE)
            match_start = start_pattern.search(article_lower)
            if match_start:
                idx0 = match_start.start()
                for end_pattern in end_patterns:
                    end_pattern = re.compile(re.escape(end_pattern), re.IGNORECASE)
                    match_end = end_pattern.search(article_lower[idx0:])
                    if match_end:
                        end_idx = idx0 + match_end.end()
                        section = article[idx0:end_idx]  # Extract the matched section
                        return section

    # If no match is found, return an empty string
    return ""

In [3]:
# Path to the directory containing PDFs
pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
alternative_pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

<a name='sectionpatternsv1'></a>
### 1.1.1. Section patterns v1 
Before I continue working on extracting the dataset names and potential links from the sections, I am curious to see how the section pattern performs. 

I investigate the first ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [4]:
section_patterns = [
    (["Data and Code Availability", "Data Availability"], ["3", "CRediT authorship contribution statement", "Acknowledgements", "References"]),
    (["2.1"], ["2.2"]),
    (["Resource", "3.1 'Resource'"], ["3.2"]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1"], ["2"]),
    (["Abstract"], ["1", "Introduction"])
]

In [5]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][:10]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [6]:
results_df

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.119451,Data and code availability statements \nSpec...
1,10.1016/j.neuroimage.2022.119632,Data and code availability \nThe data incorpo...
2,10.1016/j.neuroimage.2022.119584,Data Availability \nData will be made availab...
3,10.1016/j.neuroimage.2022.119550,Data and code availability \nAll data used in...
4,10.1016/j.neuroimage.2022.119710,Data and code availability statement \nAll i...
5,10.1016/j.neuroimage.2022.119338,Data and code availability \nNo data were acq...
6,10.1016/j.neuroimage.2022.118986,Data and Code Availability Statement \nAll c...
7,10.1016/j.neuroimage.2022.119192,Data and code availability statement \nData ...
8,10.1016/j.neuroimage.2022.119177,Data and code availability statement \nThe E...
9,10.1016/j.neuroimage.2022.119110,Data and code availability statements : The ...


*In the following description, I refer to the index of the articles in results_df.*

Observations from the text sections extracted with section_patterns: 
* In 1, 2, 4, 7, and 9, **the text is cut short because there's a mention of a number 3** within the section (in a link, in a release number, etc.). 
* In 2, they call it: 'Data/code availability statement'
* In 2 and 9, the **end of the section can be 'Acknowledgements'**.
* In 3 and 6, the **end of the section can be 'Declaration of Competing Interest'**.
* In 4, 5, and 8, the **section ends with 'Credit authorship contribution statement'**.
* In 5, we see that the use of a **URL does not necessarily mean that it's pointing to data (in this case, it's code and software)**. 
* In 6, we see that **the formulation of the text is important** (as the github link both contains data and code, but that is tricky to see). 
* In 7 and 8, they **mention which dataset they used, but do not link it**. 
* In 9, it says: "The review summarizes data but does not contain new data." (this is important if I want to look into and further filter the documents for significance testing). 

<br>
From this investigation I can see that I need to edit the section patterns. Ideas: 

- Maybe the end of a section can be \n\n? 
- Section end '3' should be called '3. ' - maybe this will fix some 
- Add variations: 
    - Section starts: 
        - Data/code availability statement
    - Section ends: 
        - [data and code] Declaration of Competing Interest
        - [data and code] Acknowledgements
        - [data and code] Credit authorship contribution statement
<br>
<br>

FOR FUTURE STEPS: 
- URLs do not necessarily link to the data. 
- A git repository can contain both data and code - but not always. 
- The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase). 
- QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

<a name='sectionpatternsv2'></a>
### 1.1.2. Section patterns v2 
Based on my exploration on the performance of the first section patterns, I can see that they need to be rewritten. For version 2, I made a few edits: 
* Add variations
    * Section starts: 
        * Data/code availability statement 
    * Section ends: 
        * '\n\n' (this could be a general way to end the section) 
        * [data and code] Declaration of Competing Interest
        * [data and code] Acknowledgements
        * [data and code] Credit authorship contribution statement
* Change pattern containing numbers (e.g., '3' is now '3. ')
<br>
I investigate the next ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [7]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [8]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/10.1016.S1053-8119(22)00043-X.pdf


In [9]:
results_df2

Unnamed: 0,DOI,Section
0,10.1016/j.neuroimage.2022.118931,Data availability \nThe Matlab code for the p...
1,10.1016/j.neuroimage.2022.119447,2.1. Participants \nThirty-ﬁve people (20 fe...
2,10.1016/j.neuroimage.2022.119403,Data and code availability statement \nData ...
3,10.1016/j.neuroimage.2021.118831,Data and code availability statements \nThe ...
4,10.1016/j.neuroimage.2022.119308,Data and code availability statement \nFinni...
5,10.1016/S1053-8119(22)00043-X,Editorial board
6,10.1016/j.neuroimage.2021.118792,data availability \nThe Shen 268 atlas is ava...
7,10.1016/j.neuroimage.2022.118890,data availability \nThe code used to run the ...
8,10.1016/j.neuroimage.2022.119339,Data and code availability statement \nThe h...
9,10.1016/j.neuroimage.2022.119295,Introduction \nReal-time functional magneti...


*In the following description, I refer to the index of the articles in results_df2.*

Observations from the text sections extracted with section_patterns_v2: 
- In 0, there are links, but these are not to the dataset - they write "The used data can be shared with other researchers upon reasonable request." 
- In 0 and 7, the next section is called 'Supplementary materials' - which means that my attempt at \n\n did not work.  
- In 2, the only mention of data was picked up in section 2.1.
- In 2, the 'Declaration  of Competing  Interest' was not picked up - it looks like it's because there are double spaces between the words. 
- In 3, the 'Credit authorship  contribution  statement' is not picked - double spaces?
- In 6, the data section is called 'Code and data availability' - but it was picked up by 'data availability'. 
- In 6, there are multiple links mentioned - one for data (an atlas), one for the code, and one for the data. 
    - NB! When copying the URL for the data, it is broken up by the formatting: https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release - this is also the case for the atlas. 
- In 7 and 8, there are spaces in the URL. 
- In 8, the following section 'Declaration of Competing Interest' was not picked up. 
- In 9, the introduction was picked up: but it does not look like any data is analysed in this article. 

<br>
From this investigation I can see that I need to edit the section patterns further. 

<br>
<br>
TO DO: 

- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words mess these up
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


### 1.1.3. Section patterns v3 

I want to make a regex_pattern work, as it seems like a double space after 

TO DO: 
- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words messed these up. 
- Section_patterns I'm worried about: 
    - Section_start: '2.1.' - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Make the titles case sensitive, and it seems like most only capitalize the first word (see investigation in ../Code/articles_groundtruth.ipynb under 'Ground truth/Investigation/Section titles')


In [10]:
def get_content_regex(pdf_path, alt_pdf_path, section_patterns):
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content, matched_start_pattern, start_pattern, end_pattern = get_section_regex(pdf_text, section_patterns)
        
        if content:
            return content, matched_start_pattern, start_pattern, end_pattern
        else:
            # Handle the case where no content is found
            return content, matched_start_pattern, start_pattern, end_pattern
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board', '', '', ''
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board', '', '', ''
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section_regex(article, section_patterns):
    """This function extracts text sections from articles based on provided regex patterns.

    Parameters:
    :param article (str): The text of the article.
    :param section_patterns (list of tuple): A list of tuples containing start and end regex patterns.

    Returns:
    :returns tuple: A tuple containing the extracted section text, the matched start pattern, and the matched end pattern.
               If no section is found, it returns ('', '', '').
    """
    matched_pattern = None  # Variable to store the matched start pattern
    start_match = None      # Variable to store the specific matched start pattern
    end_match = None        # Variable to store the specific matched end pattern
    
    # Iterate through each pattern pair
    for start_pattern, end_pattern in section_patterns:
        # Find all matches of the start pattern in the article
        start_matches = re.finditer(start_pattern, article)

        # Iterate through each start match
        for match in start_matches:
            start_idx = match.start()  # Get the start position of the start match

            # Search for the end pattern starting from the end position of the start match
            end_match = re.search(end_pattern, article[start_idx:])
            
            if end_match:
                end_idx = start_idx + end_match.start()  # Calculate the end position of the section
                section_text = article[start_idx:end_idx].strip()  # Extract the section text

                # Store the matched start and end patterns
                matched_pattern = start_pattern
                start_match = match
                end_match = end_match

                # Return the section text and matched patterns
                return section_text, matched_pattern, start_match, end_match

    # If no match is found, return an empty string and the last matched patterns
    return '', '', '', ''

In [11]:
section_patterns_regex = [
    (r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability', r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'),
    (r'\n?2\.1\.', r'\n?2\.2. | \n\n '),
    (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n ')
]

In [12]:
# Empty list to store individual results
results_list_regex = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first X DOIs
    first_dois = doi_data['DOIs'][11:21] # 0:11 to compare with results_df, 11:21 to compare with results_df2

    for doi in first_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results_list_regex.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
results_df_regex = pd.DataFrame(results_list_regex)

results_df: 0-10, i.e., [0:11]

results_df2: 11-20, i.e., [11:21]

In [13]:
results_df2['Section'].loc[0]

'Data availability  \nThe Matlab code for the proposed  vein segmentation  algorithm  \nis available  on github: https://github.com/SinaStraub/GRE  _ vessel _ \nseg.git and example  data on Zenodo.org:  https://doi.org/10.  \n5281/zenodo.5791233  \nThe used data can be shared with other researchers  upon reasonable  \nrequest.  \nSupplementary  materials  \nSupplementary  material  associated  with this article can be found, in \nthe online version,  at doi:10.1016/j.neuroimage.2022.118931  . \nAppendices  \nAll equations  are to be understood  voxel-wise,  however,  spatial co- \nordinates  are omitted  when possible.  \nA. True susceptibility-weighted  images \nIn contrast  to susceptibility-weighted  data, true susceptibility-  \nweighted  images (tSWI) are generated  using susceptibility  masks 𝑊 in- \nstead of phase masks ( Liu et al., 2014 ), \n𝑡𝑆𝑊 𝐼 = 𝑚𝑎𝑔 ⋅𝑊 𝑛 , where 𝑊 = ⎧ \n⎪ \n⎨ \n⎪ ⎩ 1 , 𝑓𝑜𝑟 𝜒≤ 𝜒1 , \n1 − 𝜒− 𝜒1 \n𝜒2 − 𝜒1 , 𝑓𝑜𝑟 𝜒1 < 𝜒≤ 𝜒2 , \n0 , 𝑓𝑜𝑟 𝜒> χ2 , (A.1) \nwhere [ 𝜒

In [14]:
results_df_regex['Section'].loc[0]

'Data availability  \nThe Matlab code for the proposed  vein segmentation  algorithm  \nis available  on github: https://github.com/SinaStraub/GRE  _ vessel _ \nseg.git and example  data on Zenodo.org:  https://doi.org/10.  \n5281/zenodo.5791233  \nThe used data can be shared with other researchers  upon reasonable  \nrequest.'

**Fixed issues**: 
- Edited get_content_regex function to be case sensitive instead of insensitive 
    - When searching using all lowercase, results_df2['Section'].loc[2], this is cut short
        - From 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the funding  body or institute,  and with the institutional  ethics \napproval.  Parts of the data are conﬁdential  and additional  ethical ap- \nproval may be needed  for re-use. \n'
        - To: 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the'
        
**Persisting issues**: 
- Reading the PDF 
    - By page-shift, the header is picked up (results_df_regex['Section'].loc[6], DOI  10.1016/j.neuroimage.2022.118986)
    - Double (or more) spaces
    - \n characters 
- Section titles 
    - There are variations of section_start titles that I have not included in my pattern, e.g., "Data Availability", which I discovered in articles_groundtruth
    - There are infinitely many undiscovered section_end titles, that I have not included in my pattern. 

NB! THIS TAKES MORE THAN AN HOUR TO RUN!
started at 16.09 - saw it was done at 18.15 - but checked at 17:40+, where it hadn't finished 

In [16]:
# Empty list to store individual results
results = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    for doi in doi_data['DOIs']:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content_regex function for each DOI 
        section_content_regex, matched_pattern, start_match, end_match = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results.append({"DOI": doi, "Section": section_content_regex, "Matched_pattern": matched_pattern, "Start_pattern": start_match, "End_pattern": end_match})

# Convert the list of dictionaries to a DataFrame
articles_dataset_sections = pd.DataFrame(results)

NB! The code above takes between one and two hours to run. 

In [17]:
articles_dataset_sections

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern
0,10.1016/j.neuroimage.2022.119451,Data and code availability statements \nSpec...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(25302, 25330), match='...","<re.Match object; span=(705, 709), match=' 3. '>"
1,10.1016/j.neuroimage.2022.119632,Data and code availability \nThe data incorpo...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(39700, 39730), match='...","<re.Match object; span=(689, 727), match=' \nD..."
2,10.1016/j.neuroimage.2022.119584,Data/code availability statement \nData and...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(26185, 26209), match='...","<re.Match object; span=(80, 86), match=' \n3. '>"
3,10.1016/j.neuroimage.2022.119550,Data and code availability \nAll data used in...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(104602, 104632), match...","<re.Match object; span=(479, 518), match=' \n..."
4,10.1016/j.neuroimage.2022.119710,Data and code availability statement \nAll i...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(70621, 70650), match='...","<re.Match object; span=(596, 641), match=' \nC..."
...,...,...,...,...,...
829,10.1016/j.neuroimage.2022.118922,Data and code availability statement \nThe b...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(48329, 48358), match='...","<re.Match object; span=(328, 373), match=' \nC..."
830,10.1016/j.neuroimage.2022.119713,"Data availability \nROI time series, along wi...","(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(58431, 58452), match='...","<re.Match object; span=(468, 507), match=' \n..."
831,10.1016/j.neuroimage.2022.119688,2.1. Participants \nTwelve healthy female p...,\n?2\.1\.,"<re.Match object; span=(9729, 9734), match='\n...","<re.Match object; span=(721, 727), match='\n2...."
832,10.1016/j.neuroimage.2022.118939,Data and code availability \nDe-identiﬁed da...,"(?<![\'""]) \s*?\n?Data\s+and\s+code\s+availabi...","<re.Match object; span=(50360, 50390), match='...","<re.Match object; span=(125, 152), match=' \nS..."


In [18]:
# Define the path to the 'Code-git/Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# Define the file path
file_path = os.path.join(data_dir, 'articles_dataset_sections.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
articles_dataset_sections.to_csv(file_path, index=False, mode='w')  

<a name='cleantextsections'></a>
## 1.2. Clean text sections
Before I continue to the extraction of the datasets from the text sections, I want to clean the current data a bit. This includes: 
- Clean the matching start patterns 
- Clean the extracted text sections, including 
    - Remove characters like '\n' 
    - Remove double (or more) spaces 
<br>
<br>

<a name='startpatterns'></a>
### 1.2.1. Start patterns 

In [19]:
# Path to the CSV file
csv_file_path = os.path.join(os.pardir, 'Data/articles_dataset_sections.csv') 

# Read the CSV file into a DataFrame
articles_dataset_sections = pd.read_csv(csv_file_path)

In [20]:
def extract_matched_text(text):
    """This function extracts matched text from a string containing a regular expression 
    match object and performs data cleaning.

    Parameters:
    :param text (str): A string containing a regular expression match object (e.g., "<re.Match object; span=(start, end), match='text'>").

    Returns:
    :returns: If a match is found in the input text, the function returns the matched text after performing the following operations:
        Stripping leading and trailing spaces from the matched text.
        Replacing '\n' (newline) characters with empty strings.
    :returns: If no match is found or the resulting matched text is empty, the function returns NaN.
    """
    
    match = re.search(r"match='(.*?)'", str(text))
    if match:
        matched_text = match.group(1).strip().replace('\\n', '').replace('  ', ' ').replace('   ', ' ')
        if matched_text:
            return matched_text
        else:
            return np.nan
    else:
        return np.nan

In [21]:
# Apply the function to clean up the 'Start_pattern' column
articles_dataset_sections['Start_pattern_clean'] = articles_dataset_sections['Start_pattern'].apply(extract_matched_text)

Overview of how many articles matches each of the section patterns. 

In [27]:
# Group by 'Matched_pattern' and count the number of rows in each group
pattern_counts = articles_dataset_sections['Matched_pattern'].value_counts()

# Count NaN values and add it to the pattern_counts Series
nan_count = articles_dataset_sections['Matched_pattern'].isna().sum()
pattern_counts['NaN'] = nan_count

# Create a DataFrame to store the results
articles_section_patterns = pd.DataFrame({
    'Matched_pattern': pattern_counts.index,
    'Count': pattern_counts.values
})

# Print the result DataFrame
print(articles_section_patterns)

# Calculate and print the total count
total_count = articles_section_patterns['Count'].sum()
print("Total Count:", total_count)

                                     Matched_pattern  Count
0  (?<![\'"]) \s*?\n?Data\s+and\s+code\s+availabi...    563
1                                          \n?2\.1\.    219
2        \n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n?      22
3                     \n?Resource | \n?3\.1\.\s*?\n?      5
4       \n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+      1
5                      \n?Tab\.\d+ | \n?Table \d+\.?      1
6                                                NaN     23
Total Count: 834


I was only expecting to see 19 articles with NaN as a matched pattern (since there are 19 editorial board papers). 

In [24]:
# Filter and display rows where 'Start_pattern_clean' is None
no_pattern = articles_dataset_sections[articles_dataset_sections['Start_pattern_clean'].isna()]
len(no_pattern)

23

In [25]:
# Filter rows where 'Section' is not 'Editorial board'
no_pattern[no_pattern['Section'] != 'Editorial board']

Unnamed: 0,DOI,Section,Matched_pattern,Start_pattern,End_pattern,Start_pattern_clean
58,10.1016/j.neuroimage.2022.119560,,,,,
131,10.1016/j.neuroimage.2021.118776,,,,,
517,10.1016/j.neuroimage.2022.119154,,,,,
670,10.1016/j.neuroimage.2022.118921,,,,,


There should only be 19 articles where there is no pattern-match, as there are 19 'Editorial Board' articles. The articles that were not filtered properly by my code are: 
- 10.1016/j.neuroimage.2022.119560
    - This has a section called 'Data Availability'
- 10.1016/j.neuroimage.2021.118776
    - This article does not have any distinct sections. It presents all the articles in the particular volume of Neuroimaging. 
- 10.1016/j.neuroimage.2022.119154
    - This article does not have any distinct sections. It is a commentary.     
- 10.1016/j.neuroimage.2022.118921
    - This article does not have any distinct sections. It is a corrigendum. 
<br>
<br>

Of the four articles that did not contain one of my start patterns, only one should have been picked up. The rest seems to have been properly filtered. 
<br>
<br>

### 1.2.3. Clean text 
I will do a very simple initial cleaning of the extracted text sections. 

In [45]:
articles_dataset_sections['Section']

0      Data and code availability statements \nSpeciﬁ...
1      Data and code availability \nThe data incorpor...
2      Data/code availability statement \nData and co...
3      Data and code availability \nAll data used in ...
4      Data and code availability statement \nAll ind...
                             ...                        
829    Data and code availability statement \nThe bra...
830    Data availability \nROI time series, along wit...
831    2.1. Participants \nTwelve healthy female part...
832    Data and code availability \nDe-identiﬁed data...
833    2.1. Subjects \nThe data were 156 baseline [11...
Name: Section, Length: 834, dtype: object

In [65]:
for i in range(len(articles_dataset_sections['Section'])):
    articles_dataset_sections['Section'].loc[i] = articles_dataset_sections['Section'].astype(str).loc[i].replace('   ', ' ').replace('  ', ' ').replace('\n', '').replace('- ', '-').replace('( ', '(').replace('/ ', '/').replace(' )', ')').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_')


<a name='getdatasets'></a>
## 1.3. Get datasets
I need to extract the datasets from the text sections we extracted above. 

Based on my previous observations, I will start the extraction with the following notions in mind: 
- Private datasets (meaning either fully private or available upon request) 
    - Markers include words such as "request", "no data", "new data", "not be shared" . E.g., 
        - "Data and code are available upon request."
        - "Data and code availability statement All individual-level raw data used in this study cannot be shared because of the ethical code of Tokyo Metropolitan University. How-ever, the acquired metadata (e.g., group level activation maps) are available upon request. The corresponding author should be contacted by email for all data requests."
        - "No data were acquired for this study."
        - "The review summarizes data but does not contain new data."
        - "The data and code presented here are available upon request to the corresponding author."
- Public datasets (meaning it's available to everyone with a link or title of the dataset)
    - Markers include hyperlinks and camelcase 
        - Hyperlink 
        - Camelcase 
    - Word like "code", "data", or "package" is typically featured in the sentences with links, pointing to what the link refers to. 
    
- Issues (**code**)
    - The URL can be broken up by spaces due to line changes in the PDF. Do we stop at the parenthesis, comma or another symbol that might end the URL? 
        - EXAMPLES 
    - Not all links point to the dataset - some are to the code, e.g., 
        - "Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/. Notears method was implemented using Python available at https://github.com/xunzheng/notears . The proposed joint DAG method was implemented with Python and the code is available at https://github.com/gmeng92/joint-notears . The cohort data is accessible through the website (https://coins.trendscenter.org/) of COINS (COllaborative Infor-matics Neuroimaging Suite) database (Scott et al., 2011)."
        - "Data and code availability The data incorporated in the primary analysis were gathered from the public UK Biobank resource and will be made pub-licly available together with the code used to generate the data through the UK Biobank Returns Catalogue (https://biobank.ndph. ox.ac.uk/showcase/docs.cgi?id = 1). ABCD study data release 3.0 is available for approved researchers in NIMH Data Archive (NDA DOI:10.151.54/1,519,007). Code for conducting discovery and replication is available at https: //github.com/robloughnan/MOSTest _ generalization . Code for simu-lations is available at https://github.com/precimed/mostest/tree/master/simu."    
    
- Issues (**analysis**)
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
    - What if there are multiple sections and the text is slightly different (e.g., 10.1016/j.neuroimage.2022.118986)
<br>
<br>

TO DO Columns: 
- (DONE) Section text 
- (DONE) Section pattern (multiple reasons: 
    - 1) I can get a sense of whether the data statement is common in NeuroImage, 
    2) I can go back and handle potential more difficult cases) 
- Extracted dataset 
<br>
<br>

### 1.3.1. Pattern 1 
I will start by examining and dealing with the text sections that were filtered by the first section_pattern, namely: 

    r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability' 
 
The corresponding ending pattern: 

    r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Acknowledgment(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'
 

In [66]:
articles_dataset_sections['Matched_pattern'].loc[0]

'(?<![\\\'"]) \\s*?\\n?Data\\s+and\\s+code\\s+availability |(?<![\\\'"]) \\s*?\\n?Data\\s+availability |(?<![\\\'"]) \\s*?\\n?Data/code\\s+availability'

In [67]:
pat_1 = articles_dataset_sections[articles_dataset_sections['Matched_pattern'] == '(?<![\\\'"]) \\s*?\\n?Data\\s+and\\s+code\\s+availability |(?<![\\\'"]) \\s*?\\n?Data\\s+availability |(?<![\\\'"]) \\s*?\\n?Data/code\\s+availability']

In [82]:
# List of words that seems to indicate that the authors used a private dataset 
words_to_check = ['data\s+request', 'data\s+re-quest' 'no data', 'no?\s+new data', 'data\s+not be shared']

# Regex pattern to match any of the words
pattern = '|'.join(re.escape(word) for word in words_to_check)

# Check for the presence of the words
pat_1['contains_words'] = pat_1['Section'].str.contains(pattern, case=False, regex=True)

# Split the DataFrame into two sets 
pat_1_priv = pat_1[pat_1['contains_words']]
pat_1_pub = pat_1[~pat_1['contains_words']]

# Drop the temporary 'contains_words' column if you don't need it
#pat_1.drop(columns=['contains_words'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pat_1['contains_words'] = pat_1['Section'].str.contains(pattern, case=False, regex=True)


In [83]:
print(len(pat_1_priv))
print(len(pat_1_pub))

0
563


In [92]:
# Sample text containing links
text = """
'Data and code availability statements Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/. Notears method was implemented using Python available at https://github.com/xunzheng/notears. The proposed joint DAG method was implemented with Python and the code is available at https://github.com/gmeng92/joint-notears. The cohort data is accessible through the website (https://coins.trendscenter.org/) of COINS (COllaborative Infor-matics Neuroimaging Suite) database (Scott et al., 2011). Further information can also be found in (Gollub et al., 2013).',
       'Data and code availability The data incorporated in the primary analysis were gathered from the public UK Biobank resource and will be made pub-licly available together with the code used to generate the data through the UK Biobank Returns Catalogue (https://biobank.ndph. ox.ac.uk/showcase/docs.cgi?id = 1). ABCD study data release 3.0 is available for approved researchers in NIMH Data Archive (NDA DOI:10.151.54/1,519,007). Code for conducting discovery and replication is available at https://github.com/robloughnan/MOSTest_generalization. Code for simu-lations is available at https://github.com/precimed/mostest/tree/master/simu.',
       'Data/code availability statement Data and code are available upon request.',
       'Data and code availability All data used in this project is from the Human Connec-tome Project (HCP) (www.humanconnectome.org). This data is publicly available to researchers who agree to the data use terms (www.humanconnectome.org/study/hcp-young-adult/data-use-terms). All HCP data may be downloaded through the ConnectomeDB (db.humanconnectome.org). ARCHI database is available upon request to Cyril Poupon at the email cyril.poupon@gmail.com.',
       'Data and code availability statement All individual-level raw data used in this study cannot be shared because of the ethical code of Tokyo Metropolitan University. How-ever, the acquired metadata (e.g., group level activation maps) are available upon request. The corresponding author should be contacted by email for all data requests. The TDT (the toolbox for MVPA) and codes used for the cross-validation approach, cross-classiﬁcation ap-proach, and RSA are freely available online (https://drive.google.com/ﬁle/d/1kl6TMf7b3gndbkGDfbP2VhKcBxK_qnio/view).',
       'Data and code availability No data were acquired for this study. The software required to gen-erate the vector spherical harmonics described in this paper is made freely available on the ﬁrst author’s GitHub page (https://github.com/tierneytim/OPM). The key function is spm_opm_vslm. Examples and tests can also be found on GitHub (https://github.com/tierneytim/OPM/blob/master/testScripts/testVSM.m).',
       'Data availability and reproducibility All code ﬁles used in this manuscript are available at https://github. com/andyrevell/revellLab. All de-identiﬁed raw and processed data (ex-cept for patient MRI imaging) are available for download by following the links on the GitHub. Raw imaging data is available upon reasonable request from Principal Investigator K.A.D. iEEG snippets used speciﬁ-cally in this manuscript are also available, while full iEEG recordings are publicly available at https://www.ieeg.org. The Python environment for the exact packages and versions used in this study in contained in the 13 A.Y. Revell, A.B. Silva, T.C. Arnold et al. NeuroImage 254 (2022) 118986 environment directory within the GitHub. The QSIPrep docker container was used for DWI preprocessing. 5.',
       'Data and code availability statement Data used in preparation of this article were obtained from the Hu-man Connectome Project, which are publicly available for download. Please refer Section 3.1 of the main manuscript. The codes for conduct-ing statistical analyses using CLEAN are currently available as a form of R package at https://github.com/junjypark/CLEAN.',
       'Data and code availability statement The EEG/MEG data used in this study are openly available datasets, and are also available from the MRC Cognition And Brain Sciences’ data repository on request. The code used to analyze the EEG/MEG data is openly available on github: https://github.com/olafhauk/EEGMEG_ResolutionAtlas.',
       'Data and code availability statements : The review summarizes data but does not contain new data.',
       'Data availability The data and code presented here are available upon request to the corresponding author.'],
"""

# Define a regular expression pattern to match links
pattern = r'\b\s*?:\s*(?:http\s*?://|www\.)[^\s)]*(?=[\s)])'

# Find all matches in the text
matches = re.findall(pattern, text, re.IGNORECASE)

# Print the extracted links
for match in matches:
    print(match)

In [87]:
pat_1_pub['Section'].loc[:10].values

array(['Data and code availability statements Speciﬁcally, GES, PC and LiNGAM were implemented using the widely used R package pcalg , which is available at https://cran.r-project.org/web/packages/pcalg/. Notears method was implemented using Python available at https://github.com/xunzheng/notears. The proposed joint DAG method was implemented with Python and the code is available at https://github.com/gmeng92/joint-notears. The cohort data is accessible through the website (https://coins.trendscenter.org/) of COINS (COllaborative Infor-matics Neuroimaging Suite) database (Scott et al., 2011). Further information can also be found in (Gollub et al., 2013).',
       'Data and code availability The data incorporated in the primary analysis were gathered from the public UK Biobank resource and will be made pub-licly available together with the code used to generate the data through the UK Biobank Returns Catalogue (https://biobank.ndph. ox.ac.uk/showcase/docs.cgi?id = 1). ABCD study data r

The other section patterns: 

  (r'\n?2\.1\.', r'\n?2\.2. | \n\n '),
    (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n ')
]

# Save datasets 

- Store the extracted datasets for further analysis 

# X. References

- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [None]:
def get_section_v1(article, section_patterns):
    """Get a section from a research paper. 
    
    Parameters: 
    :param contents (): Text contents of the resaerch paper.
    :param section_patterns (list): A list of strings to indicate the start and ends of the dataset section.
    :return: returns substring of text region between section_header and a potential section_end. returns "" if it fails to find it.
    
    This function is adapted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    """
    contents_lower = article.lower()  # Convert contents to lowercase
    
    """THE CODE BELOW DOES WHAT I WANT IT TO DO"""
    #test_start = r'data and code'
    #test_end = r'availability'
    #idx0 = contents_lower.find(test_start)
    #if idx0 != -1:
        #idxend = contents_lower.find(test_end, idx0)  # Start searching for test_end from idx0
        #if idxend != -1:
            #section = article[idx0:idxend]  # "+ len(test_end)" to include the end pattern in the extracted section
            #print(section)

    # If no match is found, return an empty string
    return "" 


def get_content_v1(pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # ORIGINAL 
        # Read the entire PDF content        
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            print(page_text)
            
            # Search for the regex pattern in the page text
            #if re.search(target_text_pattern, page_text, re.IGNORECASE):
            #    print(f"Found '{target_text_pattern}' on page {page_num + 1} of {pdf_path}")

            # Extract sections using the provided section patterns
            content = get_section(page_text, section_patterns)
            if content:
                return content
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alternative_pdf_directory, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")