# Table of contents 
- [Setup](#setup) 
    - [Target issues](#targetissues) 
    - [Libraries](#libraries)
    - [Setup](#setup) 
- [Elsevier API](#elsevierAPI)
    - [NeuroImage 2022 articles](#NeuroImage2022articles)
        - [Articles overview](#articlesoverview)
        - [Metadata-database](#metadataDatabase)
    - [Fetch PDFs](#fetchPDFs)
- [References](#references)

<a name='setup'></a>
# 0. Setup 

In this file, I download all papers published in NeuroImage in 2022. 
<br>
<br>

<a name='targetissues'></a>
## 0.1. Target issues 

NeuroImage 2022: https://www.sciencedirect.com/journal/neuroimage/issues, issues 246-264. 
<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np
import requests 
import json 
import os 
import re
import time
from urllib.parse import quote  # Import the quote function for URL encoding

<a name='setup'></a> 
## 0.3. Setup 
The *config.json* file contains the API Key and Institution Token; both of which are necessary to use Elsevier's API. 

Everyone can freely create an API key on https://dev.elsevier.com/. The institution token, however, is private and unique to each institution. To comply with the rules, this token cannot appear in any browser code or in the address bar and must be kept secure. As such, it will not appear in my code. This also means that if someone where to run this notebook, they would not be able to get the papers themselves. 

In [2]:
## Load configuration 
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

## Initialize client 
api_key = config['apikey'] 
inst_token = config['insttoken']

<a name = 'elsevierAPI'></a>
# 1. Elsevier API 

I use Elsevier's API to download the PDFs published in NeuroImage. 
Elsevier has a Python SDK called **elsapy**. In this github repository, there are multiple files with code snippets, that I read as inspiration to understand how to use their API. 
 
References: 
- Elsapy. (2023). [Python]. Elsevier Developers. https://github.com/ElsevierDev/elsapy (Original work published 2016)


<a name='NeuroImage2022articles'></a>
## 1.1. NeuroImage 2022 articles 

Steps: 
* Formulate the correct search to get the 834 articles (815 if I exclude the 'Editorial board' articles from each volume) 
* Extract the metadata about each journal 

<br>
<br>

<a name='articlesoverview'></a> 
### 1.1.1. Articles overview
The function **get_articles_info** uses Elsevier's ScienceDirect API to search ScienceDirect for the available information on the articles specified in my request.  

References: 
- Elsevier B.V. (2023d). ScienceDirect Search API Migration. Elsevier Developer Portal. https://dev.elsevier.com/tecdoc_sdsearch_migration.html

In [4]:
def get_articles_info(api_key, inst_token, search_query, publication_date, volume_range, show_per_request, page):
    """Fetches scientific articles' information from ScienceDirect based on 
    specified search criteria.
    
    Parameters: 
    :param api_key (str): Elsevier API key for authentication.
    :param inst_token (str): Elsevier institution token for access.
    :param search_query (str): The search query for articles.
    :param publication_date (str): The publication date or year for filtering articles.
    :param volume_range (str): The range of publication volumes for filtering articles.
    :param show_per_request (int): The number of articles to retrieve per request.
    :param page (int): The current page number for pagination.
    
    Returns: 
    :return: pandas.DataFrame object containing information about the fetched articles.
    """
    # Base URL for the PUT request
    base_url = 'https://api.elsevier.com/content/search/sciencedirect'
    
    # Header for the request 
    headers_param = {
        'Accept': 'application/json',
        'X-ELS-APIKey': api_key,
        'X-ELS-Insttoken': inst_token
    }
    
    # Empty list to store all the articles 
    all_articles = []
    
    while True: 
        # Increment the page number for pagiation 
        page += 1
    
        # The PUT request in JSON format 
        put_body = {
            'qs': search_query,
            'date': publication_date,  
            'pub': search_query,  
            'volume': volume_range, 
            'view': 'COMPLETE',
            'filters': {
                'openAccess': 'true',
            },
            'display': {
                'offset': (page-1) * show_per_request, 
                'show': show_per_request, 
                'sortBy': 'date'
            }
        }

        # Convert the PUT body to JSON format
        put_body_json = json.dumps(put_body)

        # Send the PUT request
        response = requests.put(base_url, data=put_body_json, headers=headers_param)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            data = response.json()
            articles_on_page = data.get("results", [])
            
            if not articles_on_page:
                break # No more articles to fetch 
            
            all_articles.extend(articles_on_page)
            
            # Rate limiting to avoid hitting an API rate limit 
            time.sleep(1) # sleep for one second between requests 
        
        else:
            # Handle error cases
            print(f"Request failed with status code: {response.status_code}")
            break
        
    # Create a dataframe from all the articles 
    df = pd.DataFrame(all_articles)
    return df 

In [5]:
search_query = 'NeuroImage'
publication_date = '2022'
volume_range = '246-264'
show_per_request = 50 
page = 0

articles_df = get_articles_info(api_key, inst_token, search_query, publication_date, volume_range, show_per_request, page)

In [6]:
print(f"Total articles fetched: {len(articles_df)}")

Total articles fetched: 834


In [7]:
articles_df

Unnamed: 0,doi,loadDate,openAccess,pages,pii,publicationDate,sourceTitle,title,uri,volumeIssue,authors
0,10.1016/S1053-8119(22)00918-1,2022-12-09T00:00:00.000Z,True,{'first': '119797'},S1053811922009181,2022-12-01,NeuroImage,Editorial Board,https://www.sciencedirect.com/science/article/...,Volume 264,
1,10.1016/j.neuroimage.2022.119763,2022-11-24T00:00:00.000Z,True,{'first': '119763'},S1053811922008849,2022-12-01,NeuroImage,An optimized reference tissue method for quant...,https://www.sciencedirect.com/science/article/...,Volume 264,"[{'order': 1, 'name': 'Kenji Tagai'}, {'order'..."
2,10.1016/j.neuroimage.2022.119764,2022-11-24T00:00:00.000Z,True,{'first': '119764'},S1053811922008850,2022-12-01,NeuroImage,Shared and distinct neural activity during ant...,https://www.sciencedirect.com/science/article/...,Volume 264,"[{'order': 1, 'name': 'Yu Chen'}, {'order': 2,..."
3,10.1016/j.neuroimage.2022.119768,2022-11-24T00:00:00.000Z,True,{'first': '119768'},S1053811922008898,2022-12-01,NeuroImage,Sample size requirement for achieving multisit...,https://www.sciencedirect.com/science/article/...,Volume 264,"[{'order': 1, 'name': 'Pravesh Parekh'}, {'ord..."
4,10.1016/j.neuroimage.2022.119769,2022-11-24T00:00:00.000Z,True,{'first': '119769'},S1053811922008904,2022-12-01,NeuroImage,Dissociation and hierarchy of human visual pat...,https://www.sciencedirect.com/science/article/...,Volume 264,"[{'order': 1, 'name': 'Xuetong Ding'}, {'order..."
...,...,...,...,...,...,...,...,...,...,...,...
829,10.1016/j.neuroimage.2021.118745,2021-11-19T00:00:00.000Z,True,{'first': '118745'},S105381192101017X,2022-02-01,NeuroImage,Mapping cortico-subcortical sensitivity to 4 H...,https://www.sciencedirect.com/science/article/...,Volume 246,"[{'order': 1, 'name': 'Søren A. Fuglsang'}, {'..."
830,10.1016/j.neuroimage.2021.118714,2021-11-18T00:00:00.000Z,True,{'first': '118714'},S1053811921009861,2022-02-01,NeuroImage,An MRI method for parcellating the human stria...,https://www.sciencedirect.com/science/article/...,Volume 246,"[{'order': 1, 'name': 'JL Waugh'}, {'order': 2..."
831,10.1016/j.neuroimage.2021.118738,2021-11-17T00:00:00.000Z,True,{'first': '118738'},S1053811921010107,2022-02-01,NeuroImage,Advances in spiral fMRI: A high-resolution stu...,https://www.sciencedirect.com/science/article/...,Volume 246,"[{'order': 1, 'name': 'Lars Kasper'}, {'order'..."
832,10.1016/j.neuroimage.2021.118698,2021-11-16T00:00:00.000Z,True,{'first': '118698'},S105381192100971X,2022-02-15,NeuroImage,Delta- and theta-band cortical tracking and ph...,https://www.sciencedirect.com/science/article/...,Volume 247,"[{'order': 1, 'name': 'Adam Attaheri'}, {'orde..."


<a name='metadataDatabase'></a>
### 1.1.2. Metadata-database 
As I was unsure if there were any additional metadata pertaining to each article that would be relevant, I defined a function to get and save this information. To get the metadata for all the articles, I use Elsevier's **Article Metadata API**. 

Steps: 
- For each doi in article_df['doi']
    - use https://api.elsevier.com/content/metadata/article with 
        - header_params as normal 
        - the query_params should probably just be 'query' with the doi 
        - the 'get' method 

References: 
* Elsevier B.V. (2023b). Article Metadata API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleMetadataAPI.wadl
* Elsevier B.V. (2023c). ScienceDirect Article Metadata Guide. Elsevier Developer Portal. https://dev.elsevier.com/sd_article_meta_tips.html

In [8]:
def get_metadata_for_dois(api_key, inst_token, df):
    """Fetches metadata for a list of DOIs from ScienceDirect.

    Parameters: 
    :param api_key (str): Elsevier API key for authentication.
    :param inst_token (str): Elsevier institution token for access.
    :param df (pandas.DataFrame): A pandas.DataFrame object containing a 'doi' column with DOIs to fetch metadata for.
    
    Returns:
    :return: pandas.DataFrame object containing the fetched metadata for each DOI.
    """
    # Dase URL for the article metadata endpoint
    metadata_base_url = 'https://api.elsevier.com/content/metadata/article'

    # Header for the request
    headers_param = {
        'Accept': 'application/json',
        'X-ELS-APIKey': api_key,
        'X-ELS-Insttoken': inst_token
    }
    
    # Query parameters for the request
    query_params = {
        'view': 'COMPLETE', 
        'count': '1'  # Set count to 1 to fetch only one result for each DOI
    }
    
    # Empty list to store the metadata for each article
    metadata_list = []
    count = 0

    for doi in df['doi']:
        # Encode the DOI and add it to the query parameters
        doi_encoded = quote(doi)
        query_params['query'] = f'doi({doi_encoded})'

        # GET request to fetch metadata for the article
        response = requests.get(metadata_base_url, headers=headers_param, params=query_params)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            metadata = response.json()
            metadata_list.append(metadata)
            count += 1
            # print("Fetched", count, "articles")
        else:
            # Handle error cases
            print(f"Failed to fetch metadata for DOI: {doi}. Status code: {response.status_code}")

    # Create a DataFrame from the list of metadata
    metadata_df = pd.DataFrame(metadata_list)

    return metadata_df

In [None]:
metadata_df = get_metadata_for_dois(api_key, inst_token, articles_df)

In [None]:
metadata_df['search-results'].loc[1]

<a name='fetchPDFs'></a>
## 1.2. Fetch PDF's

To get the PDFs, I have to define a couple of functions, some of which ensure that if they are called multiple times, the same PDFs won't get downloaded multiple times. To get the PDFs, I use Elsevier's Article (Full Text) Retrieval API. 

References: 
- Elsevier B.V. (2023a). Article (Full Text) Retrieval API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleRetrievalAPI.wadl#d1e52

In [25]:
def load_downloaded_info():
    """This function loads information about downloaded articles from a JSON file 
    or initializes empty sets if the file doesn't exist.
    """
    global downloaded_dois, downloaded_article_names
    info_file_path = 'downloadedPDFs_info.json'  # File path
    if os.path.exists(info_file_path):
        with open(info_file_path, 'r') as file:
            data = json.load(file)
            downloaded_dois = set(data.get('DOIs', []))
            downloaded_article_names = set(data.get('article_names', []))
    else:
        # Create the file if it doesn't exist
        downloaded_dois = set()
        downloaded_article_names = set()

def save_downloaded_info():
    """This function saves information about downloaded articles (DOIs and article names) 
    to a JSON file.
    """
    data = {
        'DOIs': list(downloaded_dois),
        'article_names': list(downloaded_article_names)
    }
    info_file_path = 'downloadedPDFs_info.json'  # File path
    with open(info_file_path, 'w') as file:
        json.dump(data, file, indent=4)  # Indent for readability     

def download_pdf(api_key, inst_token, df, doi_column='doi', title_column='title'):
    """This function downloads PDF articles based on their DOI and saves them in 
    specific directories. It checks if an article with the same DOI has been 
    downloaded before and avoids duplicate downloads.
    
    Parameters:
    :param api_key (str): Elsevier API key for authentication. 
    :param inst_token (str): Elsevier institution token for API access.
    :param df (pandas.DataFrame): The pandas.DataFrame object containing article information.
    :param doi_column (str, optional): The name of the DOI column in the DataFrame. Defaults to 'doi'.
    :param title_column (str, optional): The name of the article title column in the DataFrame. Defaults to 'title'.

    Note: This function creates directories to organize saved PDFs based on the article title. 
    PDFs with the title 'Editorial board' are saved in the 'fulltext_editorialboard_doi' 
    directory, while others are saved in the 'fulltext_articles_doi' directory. 
    Additionally, it maintains sets of downloaded DOIs and article names to 
    prevent duplicate downloads and saves this information for future runs.
    """
    global downloaded_dois, downloaded_article_names

    # Define the header for the request
    headers_param = {
        'Accept': 'application/pdf',
        'X-ELS-APIKey': api_key,
        'X-ELS-Insttoken': inst_token
    }

    # Load downloaded info from the previous run
    load_downloaded_info()

    # Create directories to save the PDFs
    fulltext_articles_dir = 'downloaded_pdfs/fulltext_articles_doi'
    fulltext_editorialboard_dir = 'downloaded_pdfs/fulltext_editorialboard_doi'
    
    for pdf_dir in [fulltext_articles_dir, fulltext_editorialboard_dir]:
        if not os.path.exists(pdf_dir):
            os.makedirs(pdf_dir)
    
    for index, row in df.iterrows():
        doi = row[doi_column]
        article_title = row[title_column]

        # Check if the DOI has already been downloaded
        if doi in downloaded_dois:
            print(f"Article with DOI {doi} already downloaded. Skipping.")
            continue

        # Replace slashes (/) in the DOI with dots
        doi_filename = doi.replace('/', '.')

        # Determine the PDF directory based on the article title
        if article_title.lower() == 'editorial board':
            pdf_dir = fulltext_editorialboard_dir
        else:
            pdf_dir = fulltext_articles_dir

        # Define the PDF file path with DOI as the filename
        pdf_file_path = os.path.join(pdf_dir, f'{doi_filename}.pdf')
        
        # Send the GET request to retrieve the PDF content
        article_url = f'https://api.elsevier.com/content/article/doi/{doi}?view=FULL'
        response = requests.get(article_url, headers=headers_param)

        # Check if the request was successful (HTTP status code 200)
        if response.status_code == 200:
            
            # Save the PDF content to a file
            with open(pdf_file_path, 'wb') as pdf_file:
                pdf_file.write(response.content)
            
            # Add the DOI to the downloaded set
            downloaded_dois.add(doi)
            # Add the sanitized article name to the downloaded set
            downloaded_article_names.add(article_title)
        else:
            print(f"Failed to retrieve PDF for article with DOI {doi}. Status code: {response.status_code}")

    # Save downloaded info for the next run
    save_downloaded_info()

In [26]:
# Maintain a list of downloaded DOIs and article names
downloaded_dois = set()
downloaded_article_names = set()

download_pdf(api_key, inst_token, articles_df, doi_column='doi', title_column='title')

Article with DOI 10.1016/S1053-8119(22)00918-1 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119763 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119764 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119768 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119769 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119771 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119772 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119766 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119767 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119759 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119757 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119747 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.

Article with DOI 10.1016/j.neuroimage.2022.119007 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119009 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119010 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119011 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.118992 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119002 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119003 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119004 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119005 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.119001 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.118991 already downloaded. Skipping.
Article with DOI 10.1016/j.neuroimage.2022.118990 already downloaded. Skipping.
Article with DOI 10.1016/S1053-8119(22)0

In [27]:
len(downloaded_dois)

834

In [28]:
len(downloaded_article_names)

816

In [24]:
len(articles_df['title'])

834

<a name='references'></a>
# 2. References 

- Elsapy. (2023). [Python]. Elsevier Developers. https://github.com/ElsevierDev/elsapy (Original work published 2016)
- Elsevier B.V. (2023a). Article (Full Text) Retrieval API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleRetrievalAPI.wadl#d1e52
- Elsevier B.V. (2023b). Article Metadata API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleMetadataAPI.wadl
- Elsevier B.V. (2023c). ScienceDirect Article Metadata Guide. Elsevier Developer Portal. https://dev.elsevier.com/sd_article_meta_tips.html
- Elsevier B.V. (2023d). ScienceDirect Search API Migration. Elsevier Developer Portal. https://dev.elsevier.com/tecdoc_sdsearch_migration.html

In [None]:
# ALTERNATIVE WHERE THE TITLE IS SAVED AS THE FILE NAME 

def sanitize_file_name(name):
    """
    This function takes an input string (file name) and removes special characters 
    while replacing spaces with underscores. It is used for creating clean and 
    valid file names.
    
    Parameters:
        name (str): The input string (file name) to be sanitized.

    Returns:
        sanitized_name (str): The sanitized version of the input string.
    """
    # Remove special characters and replace spaces with underscores
    sanitized_name = re.sub(r'[<>:"/\\|?*]', '', name)
    sanitized_name = sanitized_name.replace(' ', '_')
    return sanitized_name

def download_pdf_title(api_key, inst_token, df, doi_column='doi', title_column='title'):
    """
    This function downloads PDF articles based on their DOI and saves them in 
    specific directories. It checks if an article with the same DOI has been 
    downloaded before and avoids duplicate downloads.

    Parameters:
        api_key (str): The API key for accessing the Elsevier API.
        inst_token (str): The institution token for API access.
        df (pandas DataFrame): The DataFrame containing article information.
        doi_column (str, optional): The name of the DOI column in the DataFrame. Defaults to 'doi'.
        title_column (str, optional): The name of the article title column in the DataFrame. Defaults to 'title'.

    Note: This function also creates two directories ('fulltext_articles' and 
    'fulltext_editorialboard') to organize saved PDFs based on the article title. 
    PDFs with the title 'Editorial board' are saved in the 'fulltext_editorialboard' 
    directory, while others are saved in the 'fulltext_articles' directory. 
    Additionally, it maintains sets of downloaded DOIs and article names to 
    prevent duplicate downloads and saves this information for future runs.
    """
    
    global downloaded_dois, downloaded_article_names

    # Define the header for the request
    headers_param = {
        'Accept': 'application/pdf',
        'X-ELS-APIKey': api_key,
        'X-ELS-Insttoken': inst_token
    }

    # Load downloaded info from the previous run
    load_downloaded_info()

    # Create directories to save the PDFs
    fulltext_pdf_dir = 'downloaded_pdfs/fulltext_articles'
    editorialboard_pdf_dir = 'downloaded_pdfs/fulltext_editorialboard'
    
    for pdf_dir in [fulltext_pdf_dir, editorialboard_pdf_dir]:
        if not os.path.exists(pdf_dir):
            os.makedirs(pdf_dir)
    
    for index, row in df.iterrows():
        doi = row[doi_column]
        article_title = row[title_column]

        # Check if the DOI has already been downloaded
        if doi in downloaded_dois:
            print(f"Article with DOI {doi} already downloaded. Skipping.")
            continue

        # Sanitize the article title for use as a file name
        sanitized_title = sanitize_file_name(article_title)

        # Define the API endpoint for fetching the PDF
        article_url = f'https://api.elsevier.com/content/article/doi/{doi}?view=FULL'

        # Determine the PDF directory based on the article title
        if sanitized_title.lower() == 'editorial_board':
            pdf_dir = editorialboard_pdf_dir
        else:
            pdf_dir = fulltext_pdf_dir

        # Check if a file with the same title already exists
        pdf_file_path = os.path.join(pdf_dir, f'{sanitized_title}.pdf')
        counter = 1
        while os.path.exists(pdf_file_path):
            # If a file with the same title exists, add a suffix to the filename
            pdf_file_path = os.path.join(pdf_dir, f'{sanitized_title}_{counter}.pdf')
            counter += 1

        # Send the GET request to retrieve the PDF content
        response = requests.get(article_url, headers=headers_param)

        # Check if the request was successful (HTTP status code 200)
        if response.status_code == 200:
            
            # Save the PDF content to a file
            with open(pdf_file_path, 'wb') as pdf_file:
                pdf_file.write(response.content)
            
            # Add the DOI to the downloaded set
            downloaded_dois.add(doi)
            # Add the sanitized article name to the downloaded set
            downloaded_article_names.add(sanitized_title)
        else:
            print(f"Failed to retrieve PDF for article with DOI {doi}. Status code: {response.status_code}")

    # Save downloaded info for the next run
    save_downloaded_info()