# Table of contents 
- [Setup](#setup) 
    - [Target issues](#targetissues) 
    - [Libraries](#libraries)
    - [Setup](#setup) 
- [Elsevier API](#elsevierAPI)
    - [NeuroImage 2022 Articles](#NeuroImage2022Articles)
        - [Create database](#createdatabase)
        - [Metadata-database](#metadataDatabase)
    - [Fetch PDFs](#fetchPDFs)
    - []
- [References](#references)

<a name='setup'></a>
# 0. Setup 

In this file, I download all papers published in the NeuroImage journal in 2022. 
<br>
<br>

<a name='targetissues'></a>
## 0.1. Target issues 

NeuroImage 2022: https://www.sciencedirect.com/journal/neuroimage/issues 
<br>Issues 246-264 
<br>
<br>

<a name='libraries'></a>
## 0.2 Libraries 

In [1]:
import pandas as pd
import numpy as np
import requests 
import json 
import os 
import re
import time
from urllib.parse import quote  # Import the quote function for URL encoding

import elsapy 

<a name='setup'></a> 
## 0.3 Setup 

In [2]:
## Load configuration
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

## Initialize client
api_key = config['apikey'] 
inst_token = config['insttoken']

In [3]:
# Define your API Key and Institution Token
api_key = '5e101447f9a3293a580f41e255b2aba8'
inst_token = '4f442b07d27bb44790066adf1df84413'

In [None]:
# Define the directory to save the PDFs
#output_dir = '../ElsevierAPI/papers_fulltext/'

# Ensure the output directory exists (only create if it doesn't exist)
#if not os.path.exists(output_dir):
    #os.makedirs(output_dir)

<a name = 'elsevierAPI'></a>
# 1. Elsevier API 

I use Elsevier's API to download the PDFs published in NeuroImage. 
Elsevier has a Python SDK called **elsapy**. In this github repository, there are multiple files with code snippets, that I read as inspiration to understand how to use their API. 

Everyone can create an API key on https://dev.elsevier.com/. The institution token, however, is not freely available.  

To comply with the rules, the insttoken cannot appear in any browser code or in the address bar, and I must keep it secure. This means that if someone were to run this notebook, they would not be able to get the papers. 
 
References: 
- Elsapy. (2023). [Python]. Elsevier Developers. https://github.com/ElsevierDev/elsapy (Original work published 2016)


<a name='NeuroImage2022Articles'></a>
## 1.1. NeuroImage 2022 articles 

Steps: 
* Formulate the correct search to get the 834 articles (815 if I exclude the 'Editorial board' articles from each volume) 
* Extract the metadata about each journal 

In [None]:
def search_sciencdirect(put_query):
    # Define the base URL for the PUT request
    base_url = 'https://api.elsevier.com/content/search/sciencedirect'
    
    # The 'headers' parameter 
    headers_param = {
        'Accept': 'application/json',
        'X-ELS-APIKey': api_key,
        'X-ELS-Insttoken': inst_token
    }

    # Convert the PUT body to JSON format
    put_body_json = json.dumps(put_body)

    # Send the PUT request
    response = requests.put(base_url, data=put_body_json, headers=headers_param)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        data = response.json()
        # Assuming data is a JSON response with articles, you can extract DOI and title
        # df = extract_data_from_response(data)  # Implement this function to extract data
        # download_pdf(df, headers_param, doi_column=doi_column, title_column=title_column)
    else:
        # Handle error cases
        print(f"Request failed with status code: {response.status_code}")
        
    return data 

In [None]:
put_body = {
    'qs': 'NeuroImage',
    'date': '2022',  
    'pub': 'NeuroImage',  
    'volume': '246-264', 
    'view': 'COMPLETE',
    'filters': {
        'openAccess': 'true',
    }
}

search_and_download_articles(put_body)

<a name='createdatabase'></a> 
### 1.1.1. Create database 

References: 
- Elsevier B.V. (2023d). ScienceDirect Search API Migration. Elsevier Developer Portal. https://dev.elsevier.com/tecdoc_sdsearch_migration.html

In [None]:
headers_param = {
    'Accept': 'application/json', 
    'X-ELS-APIKey': api_key, 
    'X-ELS-Insttoken': inst_token
}

# Define the base URL for the PUT request
base_url = 'https://api.elsevier.com/content/search/sciencedirect'

# Define the search query parameters
search_query = 'NeuroImage'
publication_date = '2022'
volume_range = '246-264'
show_per_request = 50  # Number of articles to retrieve per request

# Create an empty list to store all articles
all_articles = []

# Initialize page number
page = 0

while True:
    # Increment the page number
    page += 1

    # Construct the PUT request body in JSON format with "display" element
    put_body = {
        'qs': search_query,
        'date': publication_date,
        'pub': search_query,
        'volume': volume_range,
        'view': 'COMPLETE',
        'filters': {
            'openAccess': 'true',
        },
        'display': {
            'offset': (page - 1) * show_per_request,
            'show': show_per_request,
            'sortBy': 'date'  # Sort by date (you can change this to 'relevance' if needed)
        }
    }

    # Convert the PUT body to JSON format
    put_body_json = json.dumps(put_body)

    # Send the PUT request
    response = requests.put(base_url, data=put_body_json, headers=headers_param)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        data = response.json()
        #print("Results found: ", data['resultsFound'])
        articles_on_page = data.get("results", [])

        if not articles_on_page:
            break  # No more articles to fetch

        all_articles.extend(articles_on_page)

        # Implement rate limiting to avoid hitting the API rate limit
        time.sleep(1)  # Sleep for 1 second between requests

        # Print progress information
        #print(f"Fetched {len(all_articles)} articles so far...")
    else:
        # Handle error cases
        print(f"Request failed with status code: {response.status_code}")
        break

# Create a DataFrame from all articles
df = pd.DataFrame(all_articles)

# Print the final count of articles
print(f"Total articles fetched: {len(df)}")

In [None]:
df

<a name='metadataDatabase'></a>
### 1.1.2. Metadata-database 
Steps: 
- For each doi in df['doi']
    - use https://api.elsevier.com/content/metadata/article with 
        - header_params as normal 
        - the query_params should probably just be 'query' with the doi 
        - the 'get' method 
        
References: 
* Elsevier B.V. (2023b). Article Metadata API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleMetadataAPI.wadl
* Elsevier B.V. (2023c). ScienceDirect Article Metadata Guide. Elsevier Developer Portal. https://dev.elsevier.com/sd_article_meta_tips.html

In [None]:
# Define the base URL for the article metadata endpoint
metadata_base_url = 'https://api.elsevier.com/content/metadata/article'

# Initialize an empty list to store the metadata for each article
metadata_list = []

# Define query parameters for the request
query_params = {
    'view': 'COMPLETE',  # Specify the view as COMPLETE
    'count': '1'  # Set count to 1 to fetch only one result for each DOI
}

# Define headers for the request
headers_param = {
    'Accept': 'application/json',
    'X-ELS-APIKey': api_key,
    'X-ELS-Insttoken': inst_token
}
count=0

# Iterate through each DOI in your DataFrame (assuming your DataFrame is named 'df')
for doi in df['doi']:
    # Encode the DOI and add it to the query parameters
    doi_encoded = quote(doi)
    query_params['query'] = f'doi({doi_encoded})'
    
    # Make a GET request to fetch metadata for the article
    response = requests.get(metadata_base_url, headers=headers_param, params=query_params)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        metadata = response.json()
        metadata_list.append(metadata)
        count += 1
        # This takes a while, but it'll get done 
        #print("Fetched ", count, " articles")
    else:
        # Handle error cases
        print(f"Failed to fetch metadata for DOI: {doi}. Status code: {response.status_code}")

# Create a DataFrame from the list of metadata
metadata_df = pd.DataFrame(metadata_list)

# Now, 'metadata_df' contains the additional metadata for each article

In [None]:
metadata_df['search-results'].loc[1]

<a name='fetchPDFs'></a>
## 1.2. Fetch PDF's

References: 
- Elsevier B.V. (2023a). Article (Full Text) Retrieval API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleRetrievalAPI.wadl#d1e52

In [None]:
#filters
neuroImagedf = df 
headers_param = {
    'Accept': 'application/pdf',
    'X-ELS-APIKey': api_key,
    'X-ELS-Insttoken': inst_token
}

# Maintain a list of downloaded DOIs and article names
downloaded_dois = set()
downloaded_article_names = set()

In [None]:
def load_downloaded_info():
    global downloaded_dois, downloaded_article_names
    info_file_path = 'downloadedPDFs_info.json'  # File path
    if os.path.exists(info_file_path):
        with open(info_file_path, 'r') as file:
            data = json.load(file)
            downloaded_dois = set(data.get('DOIs', []))
            downloaded_article_names = set(data.get('article_names', []))
    else:
        # Create the file if it doesn't exist
        downloaded_dois = set()
        downloaded_article_names = set()

def save_downloaded_info():
    data = {
        'DOIs': list(downloaded_dois),
        'article_names': list(downloaded_article_names)
    }
    info_file_path = 'downloadedPDFs_info.json'  # File path
    with open(info_file_path, 'w') as file:
        json.dump(data, file, indent=4)  # Indent for readability     
        
def sanitize_file_name(name):
    # Remove special characters and replace spaces with underscores
    sanitized_name = re.sub(r'[<>:"/\\|?*]', '', name)
    sanitized_name = sanitized_name.replace(' ', '_')
    return sanitized_name

def download_pdf(df, headers_param, doi_column='doi', title_column='title'):
    global downloaded_dois, downloaded_article_names

    # Load downloaded info from the previous run
    load_downloaded_info()

    # Create a directory to save the PDFs if it doesn't exist
    pdf_dir = 'downloaded_pdfs'
    if not os.path.exists(pdf_dir):
        os.makedirs(pdf_dir)
    
    for index, row in df.iterrows():
        doi = row[doi_column]
        article_title = row[title_column]

        # Check if the DOI has already been downloaded
        if doi in downloaded_dois:
            print(f"Article with DOI {doi} already downloaded. Skipping.")
            continue

        # Sanitize the article title for use as a file name
        sanitized_title = sanitize_file_name(article_title)

        # Define the API endpoint for fetching the PDF
        article_url = f'https://api.elsevier.com/content/article/doi/{doi}?view=FULL'

        # Check if a file with the same title already exists
        pdf_file_path = os.path.join(pdf_dir, f'{sanitized_title}.pdf')
        counter = 1
        while os.path.exists(pdf_file_path):
            # If a file with the same title exists, add a suffix to the filename
            pdf_file_path = os.path.join(pdf_dir, f'{sanitized_title}_{counter}.pdf')
            counter += 1

        # Send the GET request to retrieve the PDF content
        response = requests.get(article_url, headers=headers_param)

        # Check if the request was successful (HTTP status code 200)
        if response.status_code == 200:
            
            # Save the PDF content to a file
            with open(pdf_file_path, 'wb') as pdf_file:
                pdf_file.write(response.content)
            # Add the DOI to the downloaded set
            downloaded_dois.add(doi)
            # Add the sanitized article name to the downloaded set
            downloaded_article_names.add(sanitized_title)
        else:
            print(f"Failed to retrieve PDF for article with DOI {doi}. Status code: {response.status_code}")

    # Save downloaded info for the next run
    save_downloaded_info()

In [None]:
download_pdf(df, headers_param, doi_column='doi', title_column='title')

In [None]:
# Define the API endpoint for fetching the PDF
test_doi = df['doi'].loc[0]
article_url = f'https://api.elsevier.com/content/article/doi/{test_doi}?view=FULL'

# Define headers to specify the desired content type (application/pdf)
headers_param = {
    'Accept': 'application/pdf',
    'X-ELS-APIKey': api_key,
    'X-ELS-Insttoken': inst_token
}

# Send the GET request to retrieve the PDF content
response = requests.get(article_url, headers=headers_param)

# Check if the request was successful (HTTP status code 200)
if response.status_code == 200:
    # Specify the path where you want to save the PDF file
    pdf_file_path = '../ElsevierAPI/papers_fulltext/article.pdf'

    # Save the PDF content to a file
    with open(pdf_file_path, 'wb') as pdf_file:
        pdf_file.write(response.content)

    print(f"PDF saved to {pdf_file_path}")
else:
    print(f"Failed to retrieve PDF. Status code: {response.status_code}")

In [None]:
headers_param={'Accept': 'application/json', 
         'X-ELS-APIKey': api_key, 
         'X-ELS-Insttoken': inst_token}

base_url = 'https://api.elsevier.com/content/search/sciencedirect'

# Define the search query to find NeuroImage articles published in 2022
issn = '1095-9572'
issnl = '1053-8119'

search_query = 'NeuroImage+AND+2022+ISSN+10959572' 

# Define additional query parameters
query_params = {
    'query': search_query,
    'count': '200',  # Increase count to retrieve more results if necessary
    'view': 'COMPLETE',
    'start': '0',
}

# Send the GET request to retrieve article metadata
response = requests.get(base_url, params=query_params, headers=headers_param).json()

Next steps: 
- Download PDFs
- Text extraction using e.g., PDFMiner - save it in a structured format 
    - NB! It seems like a lot of the articles have a section called 'Data and code availability, where the information is available' 
- Dataset extraction 
    - Locate the dataset in the text 
- Store the extracted datasets for further analysis 


# ORIGINAL CODE 
## Step 1.1

In [None]:
headers_param = {
    'Accept': 'application/json', 
    'X-ELS-APIKey': api_key, 
    'X-ELS-Insttoken': inst_token
}

# Define the base URL for the PUT request
base_url = 'https://api.elsevier.com/content/search/sciencedirect'

# Construct the PUT request body in JSON format
put_body = {
    'qs': 'NeuroImage',
    'date': '2022',
    'pub': 'NeuroImage',
    'volume': '246-264',
    'view': 'COMPLETE',
    'filters': {
        'openAccess': 'true',
    }
}

# Convert the PUT body to JSON format
put_body_json = json.dumps(put_body)

# Send the PUT request
response = requests.put(base_url, data=put_body_json, headers=headers_param)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    data = response.json()
else:
    # Handle error cases
    print(f"Request failed with status code: {response.status_code}")

<a name='references'></a>
# References 

- Elsapy. (2023). [Python]. Elsevier Developers. https://github.com/ElsevierDev/elsapy (Original work published 2016)
- Elsevier B.V. (2023a). Article (Full Text) Retrieval API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleRetrievalAPI.wadl#d1e52
- Elsevier B.V. (2023b). Article Metadata API. Elsevier Developer Portal. https://dev.elsevier.com/documentation/ArticleMetadataAPI.wadl
- Elsevier B.V. (2023c). ScienceDirect Article Metadata Guide. Elsevier Developer Portal. https://dev.elsevier.com/sd_article_meta_tips.html
- Elsevier B.V. (2023d). ScienceDirect Search API Migration. Elsevier Developer Portal. https://dev.elsevier.com/tecdoc_sdsearch_migration.html