# 0. Setup 

In this file, I download all papers published in the NeuroImage journal in 2022. 

## 0.1 Pipeline 

1. Pick targets 
2. Find input 
3. Run downloader 
4. Read PDFs 
5. Save results 

### 0.1.1. Target issues 

NeuroImage 2022: https://www.sciencedirect.com/journal/neuroimage/issues 

In [1]:
issues = 0
# 246-264

### 0.1.2. Libraries 

In [2]:
import pandas as pd
import numpy as np
import requests 
import json 

import elsapy 

### 0.1.3 Variables 

# 1. Scrape journal links 

I have to use Elsevier's API, using the API I generated at the beginning of this project. 

I use the Python SDK called **elsapy** (github link). 
I use the code in the file 'exampleProg.py'. 
 

In [4]:
# The API key is free and available on https://dev.elsevier.com/
# 5e101447f9a3293a580f41e255b2aba8

In [None]:
from elsapy.elsclient import ElsClient
from elsapy.elsprofile import ElsAuthor, ElsAffil
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

## Load configuration - it contains my API key, which is free and was created using https://dev.elsevier.com/
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

## Initialize client
client = ElsClient(config['apikey'])

# Define the search query to find NeuroImage articles published in 2022
search_query = 'jr(NEUROIMAGE) AND pubyear=2022'

# Initialize the document search object
doc_search = ElsSearch(search_query, 'sciencedirect')

# Execute the search
doc_search.execute(client)

# Print the number of results
print(f"Found {len(doc_search.results)} NeuroImage articles published in 2022 on ScienceDirect.")

# Print the titles of the articles
for result in doc_search.results:
    print("Title:", result['dc:title'])

In [14]:
headers = {}
headers['Accept']='application/json'
headers['X-ELS-APIKey'] = '5e101447f9a3293a580f41e255b2aba8'
headers['Content-Type'] = 'application/json'

data = {
        'qs':'("NeuroImage"") AND (2022)'
       } 

Url = "https://api.elsevier.com/content/search/sciencedirect"

r = requests.put(Url, data =json.dumps(data), headers=headers)

print(r.url)
      
# Check if the request was successful
if r.status_code == 200:
    # Parse the JSON response
    result_data = r.json()

    # Extract the list of article entries
    articles = result_data.get('results', [])

    # Initialize an empty list to store the article metadata
    article_metadata = []

    # Extract article metadata
    for article in articles:
        title = article.get('dc:title', 'N/A')
        authors = ', '.join(author.get('preferred-name', {}).get('ce:indexed-name', 'N/A') for author in article.get('authors', []))
        pub_date = article.get('prism:coverDate', 'N/A')
        doi = article.get('prism:doi', 'N/A')
        source_title = article.get('prism:publicationName', 'N/A')

        article_metadata.append({
            'Title': title,
            'Authors': authors,
            'Publication Date': pub_date,
            'DOI': doi,
            'Source Title': source_title,
        })

    # Create a DataFrame from the article metadata
    df = pd.DataFrame(article_metadata)

    # Save the metadata as a CSV file
    df.to_csv('NeuroImage_2022_Metadata.csv', index=False)

    print(f"Saved {len(df)} articles' metadata to NeuroImage_2022_Metadata.csv.")
else:
    print(f"Error: {response.status_code} - {response.text}")

https://api.elsevier.com/content/search/sciencedirect
Error: 401 - {"service-error":{"status":{"statusCode":"AUTHORIZATION_ERROR","statusText":"The requestor is not authorized to access the requested view or fields of the resource"}}}


In [10]:
# The API key is free and available on https://dev.elsevier.com/
api_key = '5e101447f9a3293a580f41e255b2aba8'

# Define the base URL for the Scopus API
base_url = 'https://api.elsevier.com/content/search/scopus'

# Define the search query to find NeuroImage articles published in 2022
search_query = 'TITLE-ABS-KEY("NeuroImage") AND PUBYEAR = 2022'

# Define the headers with your API key and desired response format
headers = {
    'Accept': 'application/json',
    'X-ELS-APIKey': api_key,
}

# Define additional query parameters
params = {
    'query': search_query,
    'count': '200',  # Increase count to retrieve more results if necessary
    'view': 'COMPLETE',
    'start': '0',
}

# Initialize an empty list to store the article metadata
article_metadata = []

# Fetch articles in batches until there are no more results
while True:
    response = requests.get(base_url, headers=headers, params=params)
    print(response)
    data = response.json()

    # Check if there are no more results
    if 'results' not in data:
        break
    
    # Extract article metadata from the response
    for entry in data['results']:
        article_metadata.append({
            'Title': entry.get('dc:title', ''),
            'Authors': ', '.join(author['authname'] for author in entry.get('author', [])),
            'Publication Date': entry.get('prism:coverDate', ''),
            'DOI': entry.get('prism:doi', ''),
            'Source Title': entry.get('prism:publicationName', ''),
        })
    
    # Update the 'start' parameter to fetch the next batch
    params['start'] = str(int(params['start']) + len(data['results']))

# Create a DataFrame from the article metadata
df = pd.DataFrame(article_metadata)

# Save the metadata as a CSV file
df.to_csv('NeuroImage_2022_Metadata.csv', index=False)

print(f"Saved {len(df)} articles' metadata to NeuroImage_2022_Metadata.csv.")

<Response [401]>
Saved 0 articles' metadata to NeuroImage_2022_Metadata.csv.


Next steps: 
- Download PDFs
- Text extraction using e.g., PDFMiner - save it in a structured format 
    - NB! It seems like a lot of the articles have a section called 'Data and code availability, where the information is available' 
- Dataset extraction 
    - Locate the dataset in the text 
- Store the extracted datasets for further analysis 