# Table of contents 
- [Setup](#setup) 
    - [Target](#target)
    - [Libraries](#libraries)
- [Gather datasets](#gatherdatasets)
    - [Get content](#getcontent)
        - [Section patterns v1](#sectionpatternsv1)
        - [Section patterns v2](#sectionpatternsv2)
        - 
    - [Get datasets](#getdatasets)
- [References](#references)

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='target'></a> 
## 0.1. Target
The goal is the use pypdf to locate and extract the datasets used for analysis in the research articles. Based on an initial review of nine random 
<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [None]:
import pandas as pd
import numpy as np

import json 
import os 
import re 

import pypdf 

<a name='gatherdatasets'></a>
# 1. Gather datasets 

PLAN OF ATTACK TO EXPLORE: 
* IF - Locate 'Data availability' (or similar) section and look for links - if multiple, save all of them and look at surrounding words for context 
* ELSE If there is no 'Data availability' (or similar) section 
	* Look at wording in section 2.1 
<br>
<br>

<a name='getcontent'></a>
## 1.1. Get content 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [None]:
def get_content(pdf_path, alt_pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content = get_section(pdf_text, section_patterns)
        if content: 
            return content 
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            print(alternative_pdf_path)
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section(article, section_patterns):
    """Get sections from a research paper based on patterns.
    This function is losely interpreted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    specifically PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
    
    Parameters: 
    :param article (str): Text contents of the research paper.
    :param section_patterns (list of lists): A list of lists where each inner list represents the start and end patterns.
    
    Returns: 
    :return: The extracted section text.
    """
    article_lower = article.lower()  # Convert contents to lowercase

    # Attempt to find the section based on the current patterns (case-insensitive)
    for start_patterns, end_patterns in section_patterns:
        for start_pattern in start_patterns:
            start_pattern = re.compile(re.escape(start_pattern), re.IGNORECASE)
            match_start = start_pattern.search(article_lower)
            if match_start:
                idx0 = match_start.start()
                for end_pattern in end_patterns:
                    end_pattern = re.compile(re.escape(end_pattern), re.IGNORECASE)
                    match_end = end_pattern.search(article_lower[idx0:])
                    if match_end:
                        end_idx = idx0 + match_end.end()
                        section = article[idx0:end_idx]  # Extract the matched section
                        return section

    # If no match is found, return an empty string
    return ""

In [None]:
# Path to the directory containing PDFs
pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
alternative_pdf_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

<a name='sectionpatternsv1'></a>
### 1.1.1. Section patterns v1 
Before I continue working on extracting the dataset names and potential links from the sections, I am curious to see how the section pattern performs. 

I investigate the first ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [None]:
section_patterns = [
    (["Data and Code Availability", "Data Availability"], ["3", "CRediT authorship contribution statement", "Acknowledgements", "References"]),
    (["2.1"], ["2.2"]),
    (["Resource", "3.1 'Resource'"], ["3.2"]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1"], ["2"]),
    (["Abstract"], ["1", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][:10]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns)

        # Create a dictionary for each result and add it to the list
        results_list.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df = pd.DataFrame(results_list)

In [None]:
results_df

*In the following description, I refer to the index of the articles in results_df.*

Observations from the text sections extracted with section_patterns: 
* In 1, 2, 4, 7, and 9, **the text is cut short because there's a mention of a number 3** within the section (in a link, in a release number, etc.). 
* In 2, they call it: 'Data/code availability statement'
* In 2 and 9, the **end of the section can be 'Acknowledgements'**.
* In 3 and 6, the **end of the section can be 'Declaration of Competing Interest'**.
* In 4, 5, and 8, the **section ends with 'Credit authorship contribution statement'**.
* In 5, we see that the use of a **URL does not necessarily mean that it's pointing to data (in this case, it's code and software)**. 
* In 6, we see that **the formulation of the text is important** (as the github link both contains data and code, but that is tricky to see). 
* In 7 and 8, they **mention which dataset they used, but do not link it**. 
* In 9, it says: "The review summarizes data but does not contain new data." (this is important if I want to look into and further filter the documents for significance testing). 

<br>
From this investigation I can see that I need to edit the section patterns. Ideas: 

- Maybe the end of a section can be \n\n? 
- Section end '3' should be called '3. ' - maybe this will fix some 
- Add variations: 
    - Section starts: 
        - Data/code availability statement
    - Section ends: 
        - [data and code] Declaration of Competing Interest
        - [data and code] Acknowledgements
        - [data and code] Credit authorship contribution statement
<br>
<br>

FOR FUTURE STEPS: 
- URLs do not necessarily link to the data. 
- A git repository can contain both data and code - but not always. 
- The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase). 
- QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

<a name='sectionpatternsv2'></a>
### 1.1.2. Section patterns v2 
Based on my exploration on the performance of the first section patterns, I can see that they need to be rewritten. For version 2, I made a few edits: 
* Add variations
    * Section starts: 
        * Data/code availability statement 
    * Section ends: 
        * '\n\n' (this could be a general way to end the section) 
        * [data and code] Declaration of Competing Interest
        * [data and code] Acknowledgements
        * [data and code] Credit authorship contribution statement
* Change pattern containing numbers (e.g., '3' is now '3. ')
<br>
I investigate the next ten DOIs in downloadedPDFs_info.json() to see exactly what text sections were extracted.

In [None]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

In [None]:
results_df2

*In the following description, I refer to the index of the articles in results_df2.*

Observations from the text sections extracted with section_patterns_v2: 
- In 0, there are links, but these are not to the dataset - they write "The used data can be shared with other researchers upon reasonable request." 
- In 0 and 7, the next section is called 'Supplementary materials' - which means that my attempt at \n\n did not work.  
- In 2, the only mention of data was picked up in section 2.1.
- In 2, the 'Declaration  of Competing  Interest' was not picked up - it looks like it's because there are double spaces between the words. 
- In 3, the 'Credit authorship  contribution  statement' is not picked - double spaces?
- In 6, the data section is called 'Code and data availability' - but it was picked up by 'data availability'. 
- In 6, there are multiple links mentioned - one for data (an atlas), one for the code, and one for the data. 
    - NB! When copying the URL for the data, it is broken up by the formatting: https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release - this is also the case for the atlas. 
- In 7 and 8, there are spaces in the URL. 
- In 8, the following section 'Declaration of Competing Interest' was not picked up. 
- In 9, the introduction was picked up: but it does not look like any data is analysed in this article. 

<br>
From this investigation I can see that I need to edit the section patterns further. 

<br>
<br>
TO DO: 

- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words mess these up
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'
- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


### 1.1.3. Section patterns v3 

I want to make a regex_pattern work, as it seems like a double space after 


TO DO: 
- Section_patterns that do not work: 
    - Section_end: \n\n
    - Section_end: 'Declaration  of Competing  Interest' + Section_end: 'Credit authorship  contribution  statement' + Section_end: 'Supplementary materials'
        - double-spaces between words messed these up. 
- Section_patterns I'm worried about: 
    - Section_start: 2.1. - what if it's '2.1'?
    - Section_start: 'Code and data availability'
- Undiscovered section_patterns: 
    - Section_end: 'Ethics statement'


In [None]:
def get_content_regex(pdf_path, alt_pdf_path, section_patterns):
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # Read the entire PDF content
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)
        
        # Extract sections using the provided section patterns
        content, matched_start_pattern, matched_end_pattern = get_section_regex(pdf_text, section_patterns)
        
        if content:
            return content, matched_start_pattern, matched_end_pattern
        else:
            # Handle the case where no content is found
            return content, matched_start_pattern, matched_end_pattern
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board', '', ''
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board', '', ''
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")


def get_section_regex(article, section_patterns):
    matched_start_pattern = None  # Variable to store the matched start pattern
    matched_end_pattern = None    # Variable to store the matched end pattern
    
    # Iterate through each pattern pair
    for start_pattern, end_pattern in section_patterns:
        # Find all matches of the start pattern in the article
        start_matches = re.finditer(start_pattern, article, re.IGNORECASE)

        # Iterate through each start match
        for match in start_matches:
            start_idx = match.start()  # Get the start position of the start match

            # Search for the end pattern starting from the end position of the start match
            end_match = re.search(end_pattern, article[start_idx:], re.IGNORECASE)
            
            if end_match:
                end_idx = start_idx + end_match.start()  # Calculate the end position of the section
                section_text = article[start_idx:end_idx].strip()  # Extract the section text

                # Store the matched start and end patterns
                matched_start_pattern = match
                matched_end_pattern = end_pattern

                # Return the section text and matched patterns
                return section_text, matched_start_pattern, matched_end_pattern

    # If no match is found, return an empty string and the last matched patterns
    return '', '', ''

In [None]:
section_patterns_regex = [
    (r'(?<![\'"]) \s*?\n?Data\s+and\s+code\s+availability |(?<![\'"]) \s*?\n?Data\s+availability |(?<![\'"]) \s*?\n?Data/code\s+availability', r'\s*?\n\n |\s*?\n?3\. | \s*?\n?CRediT\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Acknowledgement(?:s)? | \s*?\n?Reference(?:s)? | \s*?\n?Declaration\s+of\s+Competing\s+Interest(?:s)? | \s*?\n?Credit\s+authorship\s+contribution\s+statement(?:s)? | \s*?\n?Funding | \s*?\n?Supplementary\s+materials | \s*?\n?Ethic(?:s)? statement(?:s)?'),
    (r'\n?2\.1\.', r'\n?2\.2. | \n\n '),
    (r'\n?Resource | \n?3\.1\.\s*?\n?', r'\n?3\.2.\s*?| \s*?\n\n '),
    (r'\n?Introduction\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\s*?\n?2\.\s*?\n? | \s*?\n\n '),
    (r'\n?Fig\.\d+ | \n?Fig\.\d+\.? | \n?Figure \d+', r'https?://[^\s]+ | \s*?\n\n '),
    (r'\n?Tab\.\d+ | \n?Table \d+\.?', r'https?://[^\s]+ | [\w\s-]+\d{4} | \s*?\n\n '),
    (r'\n?Abstract\s*?\n? | \s*?\n?1\.\s*?\n? ', r'\n?Introduction\s*?\n? | \s*?\n\n ')
]

In [None]:
# Empty list to store individual results
results_list_regex = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first X DOIs
    first_dois = doi_data['DOIs'][20:21]
    
    print(len(first_dois))

    for doi in first_dois:
        print(doi)
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content_regex, matched_start_pattern, matched_end_pattern = get_content_regex(pdf_path, alternative_pdf_directory, section_patterns_regex)

        # Create a dictionary for each result and add it to the list
        results_list_regex.append({"DOI": doi, "Section": section_content_regex, "Start_pattern": matched_start_pattern, "End_pattern": matched_end_pattern})

# Convert the list of dictionaries to a DataFrame
results_df_regex = pd.DataFrame(results_list_regex)

In [None]:
# 11-21 in the json
results_df2.loc[9]

In [None]:
results_df2['Section'].loc[9]

In [None]:
results_df_regex['Section'].loc[0]

Persisting issues: 
- Can I change the regex patterns to camelcase? 
    - It looks like the titles mainly have the first letter capitalized - this will fix the next problem. 
- When searching using lowercase, results_df2['Section'].loc[2], this is cut short
    - From 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the funding  body or institute,  and with the institutional  ethics \napproval.  Parts of the data are conﬁdential  and additional  ethical ap- \nproval may be needed  for re-use. \n'
    - To: 'Data and code availability  statement  \nData used in the study are available  upon direct request.  Conditions  \nfor its sharing  involve  the formalisation  of a research  agreement.  The \ndata and code sharing  adopted  by the authors  comply  with the require-  \nments of the'

<a name='getdatasets'></a>
## 1.2. Get datasets
I need to extract the datasets from the text sections we extracted above. 

Based on my previous observations, I will start the extraction with the following notions in mind: 
- Private datasets (meaning either fully private or available upon request) 
    - Markers: 
        - Text 
- Public datasets (meaning it's available to everyone with a link or title of the dataset)
    - Markers: 
        - Hyperlink 
        - Camelcase 
- Issues (**code**)
    - The URL can be broken up by spaces due to line changes in the PDF. Do we stop at the parenthesis, comma or another symbol that might end the URL? 
        - EXAMPLES 
    - Identify the dataset by name; 'Github' can also be identified if we search for letters in camelcase. 
- Issues (**analysis**)
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
    - What if there are multiple sections and the text is slightly different (e.g., 10.1016/j.neuroimage.2022.118986)







Columns: 
- Section text 
- Section pattern (multiple reasons: 1) I can get a sense of whether the data statement is common in NeuroImage, 2) I can go back and handle potential more difficult cases) 
- Extracted dataset 


# Save datasets 

- Store the extracted datasets for further analysis 

# X. References

- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [None]:
def get_section_v1(article, section_patterns):
    """Get a section from a research paper. 
    
    Parameters: 
    :param contents (): Text contents of the resaerch paper.
    :param section_patterns (list): A list of strings to indicate the start and ends of the dataset section.
    :return: returns substring of text region between section_header and a potential section_end. returns "" if it fails to find it.
    
    This function is adapted from Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022) with some alterations.
    """
    contents_lower = article.lower()  # Convert contents to lowercase
    
    """THE CODE BELOW DOES WHAT I WANT IT TO DO"""
    #test_start = r'data and code'
    #test_end = r'availability'
    #idx0 = contents_lower.find(test_start)
    #if idx0 != -1:
        #idxend = contents_lower.find(test_end, idx0)  # Start searching for test_end from idx0
        #if idxend != -1:
            #section = article[idx0:idxend]  # "+ len(test_end)" to include the end pattern in the extracted section
            #print(section)

    # If no match is found, return an empty string
    return "" 


def get_content_v1(pdf_path, section_patterns):
    """Get a PDF. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param json_file_path (str): Path to the JSON file containing the DOIs of the relevant research articles. 
    
    Returns: 
    :return: Extracted content or 'Editorial board' if not found.
    """
    
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        # ORIGINAL 
        # Read the entire PDF content        
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            print(page_text)
            
            # Search for the regex pattern in the page text
            #if re.search(target_text_pattern, page_text, re.IGNORECASE):
            #    print(f"Found '{target_text_pattern}' on page {page_num + 1} of {pdf_path}")

            # Extract sections using the provided section patterns
            content = get_section(page_text, section_patterns)
            if content:
                return content
        pdf_file.close()
        
    except FileNotFoundError:
        try:
            # Try to open the PDF from the alternative directory
            alternative_pdf_path = os.path.join(alternative_pdf_directory, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return 'Editorial board'
        except FileNotFoundError:
            # If PDF is not found in the original or alternative directory, return 'Editorial board'
            return 'Editorial board'
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")