# Table of contents 
- [Libraries and packages](#librariesandpackages)
- [Queries through OpenAlex](#queriesthroughopenalex)
    - [Paratext](#paratext)
    - [Functions](#functions)
        - [Build url for 'works' entity](#buildurlforworks)
        - [Get dataframe for works](#getdataframeofworks)
        - [Get PDFs](#getpdfs)
    - [Investigation](#investigation) 
- [Test-read the PDFs](#testreadthepdfs)
- [References](#references) 

# Libraries and packages 
<a name='librariesandpackages'></a> 

In [1]:
# Access, use, and request OpenAlex 
import requests 

# Handle data 
import pandas as pd 
import numpy as np
import csv 

# Filter pdfs
import pdfminer 
from pypdf import PdfReader
from pypdf.errors import PdfReadError

# Handle files, directories, and paths 
import glob
import sys 
import os

<a name='queriesthroughopenalex'></a> 
# Queries through OpenAlex 

First, I need to see if NeuroImage is among the sources available in OpenAlex's database (Priem et al. 2022). 
I search OpenAlex's 'source' entity, as the source is where works are hosted (e.g., journals). To search OpenAlex, I use both the name of the journal and the journal's ISSNs. 

Reference: 
- "NeuroImage | Journal | ScienceDirect.com by Elsevier. (n.d.). Retrieved September 17, 2023, from https://www.sciencedirect.com/journal/neuroimage"
- Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833
- OpenAlexAPI. (n.d.-c). Sources. OpenAlex API Documentation. Retrieved September 17, 2023, from https://docs.openalex.org/api-entities/sources

In [2]:
# Variable for the name of the journal. 
journal_name = 'NeuroImage'
# Variable for the target ISSN numbers to match 
target_issn = ['1095-9572', '1053-8119']

In [3]:
def get_source_openalexid(journal_name, target_issn): 
    """Retrieve the OpenAlex source ID for a given journal based on its name and ISSN.
    This function sends a request to the OpenAlex API to search for sources with a matching
    display name (journal name) and ISSN (International Standard Serial Number). If a source
    with matching ISSN values is found, its OpenAlex source ID is returned.

    Parameters: 
    :param journal_name (str): The name of the journal for which the OpenAlex source ID is sought.
    :param target_issn (list): A list containing the ISSN values to match against in the OpenAlex results.
    
    Returns:
    :return: String (or None) of the OpenAlex source ID if a matching source is found, or None if no match is found.
    """
    
    # Request OpenAlex 
    sources = requests.get(f'https://api.openalex.org/sources?filter=display_name:{journal_name}').json()
    # Variable to store the matching OpenAlexID 
    matching_id = None
    
    # Iterate through the results to find a match
    for result in sources['results']:
        # Check if the result's ISSN values match the target ISSN
        if all(issn in result['issn'] for issn in target_issn):
            # If both ISSN values are found, store the ID and break the loop
            matching_id = result['id']
            break
    return matching_id 

In [4]:
# Print the matching OpenAlex ID (or None if no match was found)
neuroimage_openalexid = get_source_openalexid(journal_name, target_issn)

print("Matching ID:", neuroimage_openalexid)

Matching ID: https://openalex.org/S103225281


This following cell presents part of the metadata about NeuroImage that is available in OpenAlex's database. 

In [5]:
# This is the metadata about NeuroImage in Open Alex's database 
neuroimage_oa = requests.get(f'https://api.openalex.org/sources?filter=ids.openalex:{neuroimage_openalexid}').json()

# Extract and print selected fields
result = neuroimage_oa['results'][0]
print('id:', result['id'])
print('issn:', ', '.join(result['issn']))
print('display_name:', result['display_name'])
print('host_organization_name:', result['host_organization_name'])
print('homepage_url:', result['homepage_url'])
for year_data in result['counts_by_year']:
    if year_data['year'] == 2022:
        print('year: 2022, work_counts:', year_data['works_count'])

id: https://openalex.org/S103225281
issn: 1053-8119, 1095-9572
display_name: NeuroImage
host_organization_name: Elsevier BV
homepage_url: http://www.elsevier.com/locate/ynimg
year: 2022, work_counts: 828


Going through all of the 2022 volumes manually on Elsevier's website and clicking 'Select all articles' for all 19 volumes published that year(https://www.sciencedirect.com/journal/neuroimage/issues), I get the following article counts: 
<br><br>

| NeuroImage,   | 2022 articles | count by volume  |            |
| :--           | :--           | :--           | :--           |
|  Vol 264: 94  |  Vol 259: 33  |  Vol 254: 42  |  Vol 249: 35  |
|  Vol 263: 77  |  Vol 258: 50  |  Vol 253: 37  |  Vol 248: 14  |
|  Vol 262: 33  |  Vol 257: 62  |  Vol 252: 30  |  Vol 247: 55  |
|  Vol 261: 25  |  Vol 256: 43  |  Vol 251: 44  |  Vol 246: 25  |
|  Vol 260: 55  |  Vol 255: 41  |  Vol 250: 39  |

<br>
This is a total of 834 articles. OpenAlex's source has a works_count of 826 articles published in 2022, which is a difference of 8 articles. 
<br> 
<br> 

<a name='paratext'></a> 
## Paratext 
In OpenAlex's documentation of works, they define certain articles as **paratext**, which they define as: 

    In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples: 
    
    - yep it's paratext: front cover, back cover, tabel of contents, editorial board listing, issue information, masthead 
    - no, not paratext: research paper, dataset, lettors to the editor, figures. 
<br> 
Looking at the different volumes of NeuroImage on Elsevier's website, each has an article titled 'Editorial board' as the first article in the journal. If these are excluded, that would leave 815 research articles (starting at 834 total, from our manual count). 

I want to see if the articles that has the *is_paratext = True* are the Editorial board papers, and as such, should be excluded from my pool of papers. 

Reference: 
* OpenAlexAPI. (n.d.-d). Work object. OpenAlex API Documentation. Retrieved September 17, 2023, from https://docs.openalex.org/api-entities/works/work-object#is_paratext

<a name=functions></a> 
## Functions 
<a name='buildurlforworks'></a>
### Build url for 'works' entities 

In [6]:
def build_works_url(filters):
    """Build a URL for querying works from the OpenAlex API based on specified filters.
    
    Parameters: 
    :param filters (list): A list of filter strings to be applied to the query.
    
    Returns:
    :return: String of the constructed URL for querying works with the specified filters.

    Example:
    >>> filters = ['primary_location.source.id:S103225281', 'year:2022']
    >>> build_works_url(filters)
    'https://api.openalex.org/works?filter=primary_location.source.id:S103225281,year:2022'
    """
    base_url = 'https://api.openalex.org/works'
    filters = ','.join(filters)
    return f'{base_url}?filter={filters}'

<a name='getdataframeofworks'></a>
### Get dataframe of works
I want to build a dataframe containing the metadata for all the articles available in OpenAlex. I used the code written by Théo Sourget to implement the following function. This code is in the file with the following breadcrumb: code/other/download_fulltext.ipynb

References: 
- OpenAlexAPI. (n.d.-a). Filter works. OpenAlex API Documentation. Retrieved September 14, 2023, from https://docs.openalex.org/api-entities/works/filter-works
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [7]:
def get_works_df(filters):
    """Fetch articles' data from OpenAlex using using specified filters. 
    This function was written using Theo Sourget's code: https://github.com/TheoSourget/DDSA_Sourget/blob/41a92e931fc095804df87241ee01ea4290b50c83/code/other/download_fulltext.ipynb#L462
    
    Parameters: 
    :param filters (dict): A dictionary of filters to be applied to the query.
    
    Returns:
    :return: DataFrame object containing the retrieved articles' data.

    Example:
    >>> filters = {'issn_l': '1053-8119', 'year': 2022}
    >>> get_works_df(filters)
    # Returns a DataFrame with journal articles from 2022.
    """
    # Variable to store the URL using the specified filters  
    url = build_works_url(filters)
    
    # Initialize an empty list to store all articles
    all_articles = []

    # Set the initial page number to 1
    page = 1

    # Loop until all articles are retrieved
    while True:
        # Make a request to the API with the current page number
        response = requests.get(url, params={"page": page})

        # Check if the response is successful
        if response.status_code == 200:
            data = response.json()

            # Extract articles from the current page and append to the list
            articles_on_page = data.get("results", [])
            all_articles.extend(articles_on_page)

            # Check if there are more pages to fetch
            if len(articles_on_page) == 0 or page * data['meta']['per_page'] >= data['meta']['count']:
                break

            # Increment the page number for the next request
            page += 1
        else:
            print("Error fetching data. Status code:", response.status_code)
            break

    # Create and return a DataFrame from all articles 
    return pd.DataFrame(all_articles)

<a name='getpdfs'></a> 
### Get PDF's
I want to download the fulltext of all the articles in OpenAlex from NeuroImage. 
I used a part of code written by Théo Sourget to get the PDF's. The code in the file with the following breadcrumb: code/other/download_fulltext.ipynb

References: 
* Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [8]:
def get_works_pdf(df, path_keyword, pdf_path="../OpenAlex/papers_fulltext/"): 
    """
    This function was written using code that was written mostly by Theo Sourget: https://github.com/TheoSourget/DDSA_Sourget/blob/41a92e931fc095804df87241ee01ea4290b50c83/code/other/download_fulltext.ipynb#L462
    It downloads PDFs of articles based on their DOI and save them to the specified path.
    
    Parameters:
    :param df (pandas.DataFrame): A DataFrame object containing article information.
    :param path_keyword (str): A keyword used to create a subdirectory for saving PDFs.
    :param pdf_path (str): The base path where PDFs will be saved. 
    
    Returns:
    :return: Tuple containing two dictionaries:
            - A list of downloaded DOIs.
            - A dictionary mapping article titles to their DOIs.

    Notes:
        This function iterates through the provided DataFrame, extracts article titles and DOIs, 
        and attempts to download the corresponding PDFs using the DOIs. PDFs are saved in a 
        subdirectory named after 'path_keyword' within 'pdf_path'. It also ensures that articles 
        with the same title but different DOIs are both saved.    
    """
    # Variables to store downloaded DOIs and title-to-DOI mappings
    downloaded_doi = []
    doi_to_title = {}

    # Iterate through the articles in the DataFrame
    for index, row in df.iterrows():
        title = row['title']
        doi = row['doi']  # or row['fulltext_oa_url'], if that is available 
        fulltext_url = doi  # Assuming the DOI is the URL to the full text
        base_file_path = os.path.join(pdf_path, path_keyword)
        
        # Replace special characters in the title with underscores
        title = title.replace('/', '_').replace(':', '_').replace(' ', '_')
        file_path = f"{base_file_path}/{title}.pdf"
        suffix = 1

        # Add the title to the list of titles associated with the same DOI
        if doi in doi_to_title:
            doi_to_title[doi].append(title)
        else:
            doi_to_title[doi] = [title]

        if not fulltext_url:
            continue

        while os.path.exists(file_path):
            # If a file with the same name already exists, add a suffix to make it unique
            file_path = f"{base_file_path}/{title}_{suffix}.pdf"
            suffix += 1

        try:
            # Check if the DOI has already been downloaded
            if doi not in downloaded_doi:
                r_fulltext = requests.get(fulltext_url, allow_redirects=True, timeout=10)
                pdf_content = r_fulltext.content
                if r_fulltext.status_code == 200:
                    # Save the PDF to the download folder
                    os.makedirs(os.path.dirname(file_path), exist_ok=True)
                    open(file_path, "wb").write(pdf_content)
                    downloaded_doi.append(doi)
                else:
                    continue
            try: 
                # Try to read the pdf (Raise an error if the file is an invalid pdf)
                PdfReader(file_path, strict=True)
            except PdfReadError:
                # If a PdfReadError is raised, the pdf is invalid and therefore removed from downloaded list
                downloaded_doi.remove(doi)
                continue
        except requests.exceptions.RequestException as ce:
            continue            
    return downloaded_doi, doi_to_title

<a name='investigation'></a>
## Investigation

In [9]:
# Common filter criteria for NeuroImage articles published in 2022
filter_neuroimage2022_wo_paratexts = [
    "primary_location.source.id:"+neuroimage_openalexid, # Filter by NeuroImage's OpenAlexID 
    "publication_year:2022",  # Filter by 2022
    "is_paratext:false"  # Exclude paratext
]

filter_neuroimage2022_paratexts = [
    "primary_location.source.id:"+neuroimage_openalexid, # Filter by NeuroImage's OpenAlexID 
    "publication_year:2022", # Filter by 2022
    "is_paratext:true" # Exclude non-paratexts
]

In [10]:
# Dataframe for all non-paratext articles 
neuroimage_nonparatext = get_works_df(filter_neuroimage2022_wo_paratexts)
neuroimage_nonparatext

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,ids,language,primary_location,type,...,grants,referenced_works_count,referenced_works,related_works,ngrams_url,abstract_inverted_index,cited_by_api_url,counts_by_year,updated_date,created_date
0,https://openalex.org/W4205798776,https://doi.org/10.1016/j.neuroimage.2021.118870,Quantitative mapping of the brain’s structural...,Quantitative mapping of the brain’s structural...,2022,2022-04-01,{'openalex': 'https://openalex.org/W4205798776...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],544,"[https://openalex.org/W88165185, https://opena...","[https://openalex.org/W217664020, https://open...",https://api.openalex.org/works/W4205798776/ngrams,"{'Diffusion': [0], 'magnetic': [1], 'resonance...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 32}, {'year'...",2023-09-06T02:59:42.065054,2022-01-25
1,https://openalex.org/W4207064127,https://doi.org/10.1016/j.neuroimage.2021.118788,Connectomics of human electrophysiology,Connectomics of human electrophysiology,2022,2022-02-01,{'openalex': 'https://openalex.org/W4207064127...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320310638'...,163,"[https://openalex.org/W1551715170, https://ope...","[https://openalex.org/W2009409311, https://ope...",https://api.openalex.org/works/W4207064127/ngrams,"{'We': [0, 29, 106, 120], 'present': [1], 'bot...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 24}, {'year'...",2023-09-15T17:34:11.759500,2022-01-26
2,https://openalex.org/W4213330922,https://doi.org/10.1016/j.neuroimage.2022.119027,Triaxial detection of the neuromagnetic field ...,Triaxial detection of the neuromagnetic field ...,2022,2022-05-01,{'openalex': 'https://openalex.org/W4213330922...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320332161'...,36,"[https://openalex.org/W1964386293, https://ope...","[https://openalex.org/W807649110, https://open...",https://api.openalex.org/works/W4213330922/ngrams,"{'Optically-pumped': [0], 'magnetometers': [1]...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 23}, {'year'...",2023-09-15T17:43:53.290385,2022-02-24
3,https://openalex.org/W3216528674,https://doi.org/10.1016/j.neuroimage.2021.118774,A dynamic graph convolutional neural network f...,A dynamic graph convolutional neural network f...,2022,2022-02-01,{'openalex': 'https://openalex.org/W3216528674...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320306076'...,54,"[https://openalex.org/W1560723556, https://ope...","[https://openalex.org/W1985820334, https://ope...",https://api.openalex.org/works/W3216528674/ngrams,"{'The': [0, 200], 'pathological': [1], 'mechan...",https://api.openalex.org/works?filter=cites:W3...,"[{'year': 2023, 'cited_by_count': 15}, {'year'...",2023-09-16T06:58:12.371166,2021-12-06
4,https://openalex.org/W4200583417,https://doi.org/10.1016/j.neuroimage.2021.118789,A unified view on beamformers for M/EEG source...,A unified view on beamformers for M/EEG source...,2022,2022-02-01,{'openalex': 'https://openalex.org/W4200583417...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320311904'...,46,"[https://openalex.org/W1709380962, https://ope...","[https://openalex.org/W35607744, https://opena...",https://api.openalex.org/works/W4200583417/ngrams,"{'Beamforming': [0], 'is': [1, 160], 'a': [2, ...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 21}, {'year'...",2023-09-16T19:16:48.666479,2021-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
807,https://openalex.org/W4308479179,https://doi.org/10.1016/j.neuroimage.2022.119731,Behavioral and neural representation of expect...,Behavioral and neural representation of expect...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4308479179...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],65,"[https://openalex.org/W1499836032, https://ope...","[https://openalex.org/W1972742627, https://ope...",https://api.openalex.org/works/W4308479179/ngrams,"{'When': [0], 'faced': [1], 'with': [2, 187, 2...",https://api.openalex.org/works?filter=cites:W4...,[],2023-08-30T14:14:24.484562,2022-11-12
808,https://openalex.org/W4309047218,https://doi.org/10.1016/j.neuroimage.2022.119749,Late dominance of the right hemisphere during ...,Late dominance of the right hemisphere during ...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4309047218...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320332161'...,137,"[https://openalex.org/W97601513, https://opena...","[https://openalex.org/W1977584810, https://ope...",https://api.openalex.org/works/W4309047218/ngrams,"{'PET': [0], 'and': [1, 98, 195, 201, 228, 259...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-15T04:09:42.507402,2022-11-21
809,https://openalex.org/W4309294048,https://doi.org/10.1016/j.neuroimage.2022.119747,Optimising the sensing volume of OPM sensors f...,Optimising the sensing volume of OPM sensors f...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4309294048...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],48,"[https://openalex.org/W1976755501, https://ope...","[https://openalex.org/W1999107116, https://ope...",https://api.openalex.org/works/W4309294048/ngrams,"{'Magnetoencephalography': [0], '(MEG)': [1], ...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-13T06:30:04.653841,2022-11-25
810,https://openalex.org/W4309477016,https://doi.org/10.1016/j.neuroimage.2022.119752,Multistage classification identifies altered c...,Multistage classification identifies altered c...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4309477016...,en,"{'is_oa': False, 'landing_page_url': 'https://...",article,...,[],68,"[https://openalex.org/W1828576284, https://ope...","[https://openalex.org/W1864544210, https://ope...",https://api.openalex.org/works/W4309477016/ngrams,"{'Distinguishing': [0], 'groups': [1, 224], 'o...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-01T07:11:31.049435,2022-11-28


In [11]:
# Dataframe for all non-paratext articles 
neuroimage_paratext = get_works_df(filter_neuroimage2022_paratexts)
neuroimage_paratext

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,ids,language,primary_location,type,...,grants,referenced_works_count,referenced_works,related_works,ngrams_url,abstract_inverted_index,cited_by_api_url,counts_by_year,updated_date,created_date
0,https://openalex.org/W4205101057,https://doi.org/10.1016/s1053-8119(21)01130-7,Editorial Board,Editorial Board,2022,2022-02-01,{'openalex': 'https://openalex.org/W4205101057...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2096946506, https://ope...",https://api.openalex.org/works/W4205101057/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-01T07:18:26.022248,2022-01-26
1,https://openalex.org/W4206240093,https://doi.org/10.1016/s1053-8119(22)00014-3,Editorial Board,Editorial Board,2022,2022-02-01,{'openalex': 'https://openalex.org/W4206240093...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2049775471, https://ope...",https://api.openalex.org/works/W4206240093/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-08-30T00:42:44.981219,2022-01-25
2,https://openalex.org/W4206932992,https://doi.org/10.1016/s1053-8119(22)00043-x,Editorial Board,Editorial Board,2022,2022-03-01,{'openalex': 'https://openalex.org/W4206932992...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W40804987, https://opena...",https://api.openalex.org/works/W4206932992/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-16T09:21:38.861146,2022-01-26
3,https://openalex.org/W4211001203,https://doi.org/10.1016/s1053-8119(22)00080-5,Editorial Board,Editorial Board,2022,2022-04-01,{'openalex': 'https://openalex.org/W4211001203...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W1596801655, https://ope...",https://api.openalex.org/works/W4211001203/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-15T19:20:28.836392,2022-02-13
4,https://openalex.org/W4220912835,https://doi.org/10.1016/s1053-8119(22)00127-6,Editorial Board,Editorial Board,2022,2022-04-01,{'openalex': 'https://openalex.org/W4220912835...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2096946506, https://ope...",https://api.openalex.org/works/W4220912835/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-01T23:47:39.215263,2022-04-03
5,https://openalex.org/W4220998640,https://doi.org/10.1016/s1053-8119(22)00234-8,Editorial Board,Editorial Board,2022,2022-05-01,{'openalex': 'https://openalex.org/W4220998640...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2064050299, https://ope...",https://api.openalex.org/works/W4220998640/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-09T01:19:40.056487,2022-04-03
6,https://openalex.org/W4221126408,https://doi.org/10.1016/s1053-8119(22)00203-8,Editorial Board,Editorial Board,2022,2022-05-01,{'openalex': 'https://openalex.org/W4221126408...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2096946506, https://ope...",https://api.openalex.org/works/W4221126408/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-01T21:12:30.418656,2022-04-03
7,https://openalex.org/W4224219090,https://doi.org/10.1016/s1053-8119(22)00288-9,Editorial Board,Editorial Board,2022,2022-06-01,{'openalex': 'https://openalex.org/W4224219090...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W1596801655, https://ope...",https://api.openalex.org/works/W4224219090/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-16T08:20:58.985405,2022-04-26
8,https://openalex.org/W4225395090,https://doi.org/10.1016/s1053-8119(22)00358-5,Editorial Board,Editorial Board,2022,2022-07-01,{'openalex': 'https://openalex.org/W4225395090...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2130043461, https://ope...",https://api.openalex.org/works/W4225395090/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-05T00:21:51.218338,2022-05-05
9,https://openalex.org/W4280566973,https://doi.org/10.1016/s1053-8119(22)00379-2,Editorial Board,Editorial Board,2022,2022-07-01,{'openalex': 'https://openalex.org/W4280566973...,,"{'is_oa': True, 'landing_page_url': 'https://d...",paratext,...,[],0,[],"[https://openalex.org/W2049775471, https://ope...",https://api.openalex.org/works/W4280566973/ngrams,,https://api.openalex.org/works?filter=cites:W4...,[],2023-09-02T23:49:01.764196,2022-05-22


There are a total of 812 articles when is_paratext is false, and 19 paratext-articles. Looking at the 'display_name' and 'title' for the paratext dataframe, I see that they are all titled 'Editorial board'. 
<br>
<br>
When comparing the doi to fulltext_oa_url, we see that they are mostly the same. As such, I will use the doi to download the PDFs. 

In [12]:
# Compare the 'fulltext_oa_url' to the 'doi' url 
neuroimage_nonparatext['fulltext_oa_url'] = neuroimage_nonparatext['open_access'].apply(lambda x: x.get('oa_url') if isinstance(x, dict) else None)

In [13]:
neuroimage_nonparatext[['doi', 'fulltext_oa_url']]

Unnamed: 0,doi,fulltext_oa_url
0,https://doi.org/10.1016/j.neuroimage.2021.118870,https://doi.org/10.1016/j.neuroimage.2021.118870
1,https://doi.org/10.1016/j.neuroimage.2021.118788,https://doi.org/10.1016/j.neuroimage.2021.118788
2,https://doi.org/10.1016/j.neuroimage.2022.119027,https://doi.org/10.1016/j.neuroimage.2022.119027
3,https://doi.org/10.1016/j.neuroimage.2021.118774,https://doi.org/10.1016/j.neuroimage.2021.118774
4,https://doi.org/10.1016/j.neuroimage.2021.118789,https://doi.org/10.1016/j.neuroimage.2021.118789
...,...,...
807,https://doi.org/10.1016/j.neuroimage.2022.119731,https://doi.org/10.1016/j.neuroimage.2022.119731
808,https://doi.org/10.1016/j.neuroimage.2022.119749,https://doi.org/10.1016/j.neuroimage.2022.119749
809,https://doi.org/10.1016/j.neuroimage.2022.119747,https://doi.org/10.1016/j.neuroimage.2022.119747
810,https://doi.org/10.1016/j.neuroimage.2022.119752,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9...


<a name='testreadthepdfs'></a>
# Test-read the PDF's 
I will try and retrieve five of the articles from the articles that are not paratext and attempt to read them using pypdf's PdfReader. 

In [14]:
# Define the folder where the PDFs will be stored 
pdf_path = glob.glob("../OpenAlex/papers_fulltext/articles_pdf/*.pdf")

# Define the folder where the paratext-PDFs will be stored 
pdf_paratext_path = glob.glob("../OpenAlex/papers_fulltext/paratext_pdf/*.pdf")

In [15]:
test = neuroimage_nonparatext[:5] 
test_download_doi, test_doi_to_title = get_works_pdf(test, 'articles_pdf')

In [16]:
test_doi_to_title

{'https://doi.org/10.1016/j.neuroimage.2021.118870': ['Quantitative_mapping_of_the_brain’s_structural_connectivity_using_diffusion_MRI_tractography__A_review'],
 'https://doi.org/10.1016/j.neuroimage.2021.118788': ['Connectomics_of_human_electrophysiology'],
 'https://doi.org/10.1016/j.neuroimage.2022.119027': ['Triaxial_detection_of_the_neuromagnetic_field_using_optically-pumped_magnetometry__feasibility_and_application_in_children'],
 'https://doi.org/10.1016/j.neuroimage.2021.118774': ['A_dynamic_graph_convolutional_neural_network_framework_reveals_new_insights_into_connectome_dysfunctions_in_ADHD'],
 'https://doi.org/10.1016/j.neuroimage.2021.118789': ['A_unified_view_on_beamformers_for_M_EEG_source_reconstruction']}

In [17]:
key = test['doi'][0]
value = test_doi_to_title[key]
title_test = value[0]
print(title_test)

Quantitative_mapping_of_the_brain’s_structural_connectivity_using_diffusion_MRI_tractography__A_review


In [18]:
reader = PdfReader(f"../OpenAlex/papers_fulltext/articles_pdf/{title_test}.pdf")
page = reader.pages[0]
print(page.extract_text())

invalid pdf header: b'\n\n\n\n\n'
EOF marker not found


PdfStreamError: Stream has ended unexpectedly

When I attempt to open the pdf's manually, and when I attempt to read them using pypdf's PdfReader, I get an error. Upon attempting to open the pdf's manually, I am informed that they seem to be broken. Above is the attempt to read them using PdfReader. 
<br><br>
At this point in the process, I will now attempt to use Elsevier's own API to download and read the pdf's, seeing as I cannot seem to download them using pypdf. 

<a name='references'></a>
# References 

- NeuroImage | Journal | ScienceDirect.com by Elsevier. (n.d.). Retrieved September 17, 2023, from https://www.sciencedirect.com/journal/neuroimage
- OpenAlexAPI. (n.d.-a). Filter works. OpenAlex API Documentation. Retrieved September 14, 2023, from https://docs.openalex.org/api-entities/works/filter-works
- OpenAlexAPI. (n.d.-b). Search institutions. OpenAlex API Documentation. Retrieved September 14, 2023, from https://docs.openalex.org/api-entities/sources/source-object
- OpenAlexAPI. (n.d.-c). Sources. OpenAlex API Documentation. Retrieved September 17, 2023, from https://docs.openalex.org/api-entities/sources
- OpenAlexAPI. (n.d.-d). Work object. OpenAlex API Documentation. Retrieved September 17, 2023, from https://docs.openalex.org/api-entities/works/work-object#is_paratext
- Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget