# Fetch Scopus Data

This notebook retrieves publication data from the Scopus API for analysis of SSRI-related research papers.

## Resources

- [Scopus Search API](https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl)
- [Scopus Retrieval of more than 5,000 articles](https://dev.elsevier.com/support.html)
- [Interactive Scopus API](https://dev.elsevier.com/scopus.html)
- [API Settings (rate limits)](https://dev.elsevier.com/api_key_settings.html)

## Prerequisites

1. Valid Scopus API key(s)
2. Cisco VPN connection
3. Required Python packages:
   - pandas
   - requests
   - tqdm
   - urllib3

## Overview

The notebook performs the following steps:

1. Connects to Scopus API using authentication key
2. Fetches publication records in batches to handle rate limits
3. Saves intermediate results to prevent data loss
4. Processes and combines results into a pandas DataFrame

Note: A valid Scopus API key and Cisco VPN connection are required.


## Setup and Configuration


In [1]:
from dotenv import load_dotenv

from src.data_fetching.ScopusApiKeyLoader import ScopusApiKeyLoader
from src.data_fetching.ScopusArticleFetcher import ScopusArticleFetcher

import os


# Load environment variables from .env file
load_dotenv()

# Access environment variables
python_path = os.getenv("PYTHONPATH")
data_dir = os.getenv("DATA_DIR")
src_dir = os.getenv("SRC_DIR")
output_dir = os.getenv("OUTPUT_DIR")

## Load API Keys


In [3]:
# Get API keys using the loader
api_keys = ScopusApiKeyLoader.get_api_keys()
print("Available API keys:")
for name, info in api_keys.items():
    print(f"- {name}: {info['rate_limit']:,} requests per week ({info['description']})")

# Convert to list format if ScopusArticleFetcher requires it
api_keys_list = list(api_keys.values())

dict_keys(['api_key_A', 'api_key_B', 'api_key_deb', 'api_key_haoxin', 'comment'])
rate limits are 40,000 per week for api_key_A and 10,000 for every other key


In [5]:
# Custom query parameters (optional)
custom_params = {
    "date": "1982-2030",  # make sure everything is included
}

# Initialize fetcher with multiple API keys
fetcher = ScopusArticleFetcher(
    api_keys=api_keys, output_dir=data_dir + "/01-raw/publications"
)

# Print query parameters for verification
print("Custom Parameters:")
for key, value in custom_params.items():
    print(f"{key}: {value}")
# Fetch results with custom parameters
results_df = fetcher.fetch_results(saving_interval=250, query_params=custom_params)

Custom Parameters:
date: 1982-2030

Total results found: 44444


Fetching results: 100%|██████████| 44444/44444 [33:49<00:00, 21.90articles/s]



Saved final results to ../data/01-raw/scopusnew/final_scopus_results_20250326_081230.csv
Total articles fetched: 44444


In [6]:
results_df.head(5)

Unnamed: 0,@_fa,link,prism:url,dc:identifier,eid,dc:title,dc:creator,prism:publicationName,prism:issn,prism:volume,...,openaccessFlag,prism:doi,pubmed-id,prism:eIssn,freetoread,freetoreadLabel,fund-acr,fund-sponsor,prism:isbn,article-number
0,True,"[{'@_fa': 'true', '@ref': 'self', '@href': 'ht...",https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:49049137192,2-s2.0-49049137192,"The action of monoaminergic, cholinergic and g...",Lloyd K.,Advances in the Biosciences,653446,40,...,False,,,,,,,,,
1,True,"[{'@_fa': 'true', '@ref': 'self', '@href': 'ht...",https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:4243303071,2-s2.0-4243303071,Failure of exogenous serotonin to inhibit the ...,Figueroa H.R.,General Pharmacology,3063623,13,...,False,10.1016/0306-3623(82)90072-6,7095394.0,,,,,,,
2,True,"[{'@_fa': 'true', '@ref': 'self', '@href': 'ht...",https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:0020468334,2-s2.0-0020468334,Citalopram. An introduction,Hyttel J.,Progress in Neuropsychopharmacology and Biolog...,2785846,6,...,False,10.1016/S0278-5846(82)80178-4,,,,,,,,
3,True,"[{'@_fa': 'true', '@ref': 'self', '@href': 'ht...",https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:0020459436,2-s2.0-0020459436,A placebo controlled study of the cardiovascul...,Robinson J.,British Journal of Clinical Pharmacology,3065251,14,...,False,10.1111/j.1365-2125.1982.tb02040.x,6817771.0,13652125.0,"{'value': [{'$': 'all'}, {'$': 'repository'}, ...","{'value': [{'$': 'All Open Access'}, {'$': 'Gr...",,,,
4,True,"[{'@_fa': 'true', '@ref': 'self', '@href': 'ht...",https://api.elsevier.com/content/abstract/scop...,SCOPUS_ID:0020446870,2-s2.0-0020446870,"Paroxetine, a potent selective long-acting inh...",Magnussen I.,Journal of Neural Transmission,3009564,55,...,False,10.1007/BF01276577,,14351463.0,,,,,,


In [7]:
import pandas as pd

pd.to_datetime(results_df["prism:coverDate"]).dt.year.value_counts().sort_index()

prism:coverDate
1982      79
1983      97
1984     105
1985     135
1986     111
1987     134
1988     182
1989     279
1990     348
1991     437
1992     543
1993     613
1994     683
1995     725
1996     831
1997     950
1998     941
1999    1012
2000    1003
2001    1044
2002    1124
2003    1130
2004    1278
2005    1251
2006    1376
2007    1356
2008    1422
2009    1370
2010    1454
2011    1424
2012    1517
2013    1522
2014    1514
2015    1638
2016    1647
2017    1458
2018    1483
2019    1486
2020    1567
2021    1737
2022    1715
2023    1585
2024    1696
2025     442
Name: count, dtype: int64