# Fetch Reference Data from Scopus API

Resources:

- [Scopus Abstract Retrieval Views](https://dev.elsevier.com/sc_abstract_retrieval_views.html)
- [Scopus Retrieval API](https://dev.elsevier.com/documentation/AbstractRetrievalAPI.wadl)
- [Interactive Scopus API](https://dev.elsevier.com/scopus.html)
- [API Settings (rate limits)](https://dev.elsevier.com/api_key_settings.html)
- Remember Logging In to Cisco VPN!!!


In this notebook, we use the `ScopusReferenceFetcher` to retrieve references for a list of Scopus IDs.

Key features:
1. Automatic API key rotation when rate limits are hit
2. Saves progress after every 500 requests
3. Resumes from last saved state if interrupted
4. Comprehensive logging of progress and errors
5. Handles multiple API keys with different rate limits

The process:
1. Loads previously processed data to avoid duplicates
2. Filters the input dataframe to only unprocessed articles
3. Processes articles in batches, rotating API keys as needed
4. Saves progress after each batch
5. Logs all operations and errors for debugging

In [1]:
import pandas as pd
import json
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm import tqdm
import logging
from math import ceil
import sys
import datetime
import os
%load_ext autoreload
%autoreload 2

from src.data.ScopusReferenceFetcher import ScopusReferenceFetcher
from src.data.ScopusReferenceFetcherUtils import (
    ScopusRefFetcherPrep,
    ScopusRefFetcherProcessor,
)

# Setup and Initialization

In [3]:
# Get all API keys
api_keys = ScopusRefFetcherPrep.get_api_keys()
print(f"Available API keys: {', '.join(api_keys.keys())}")

# Setup logging
log_directory = "../data/01-raw/references"
ScopusRefFetcherProcessor.setup_logging(log_directory, log_level=logging.INFO)


# load data
df_path = "../data/02-clean/articles/scopus_cleaned_20250326_081230.pkl"
df = pd.read_pickle(df_path)
print(f"Total articles to process: {len(df)}")

Comment:  rate limits are 40,000 per week for api_key_A and 10,000 for every other key
Available API keys: api_key_A, api_key_B, api_key_deb, api_key_haoxin
Total articles to process: 38961


# Load Previously Processed Data

In [6]:
# Initialize prepper
prepper = ScopusRefFetcherPrep()

# Load previously processed data
refs_path = "../data/01-raw/references"
# create this 
eids, max_batch = prepper.load_fetched_reference_data(refs_path)

# Filter dataframe to unprocessed articles
df_filtered = prepper.load_and_filter_articles(df_path, eids)

print(f"Previously processed articles: {len(eids)}")
print(f"Remaining articles to process: {len(df_filtered)}")
print(f"Last processed batch: {max_batch}")

ValueError: max() arg is an empty sequence

# Process Articles

In [None]:
# Initialize processor
processor = ScopusRefFetcherProcessor()

# Process batches with automatic API key rotation
processor.process_scopus_batches(
    api_keys=api_keys,
    df_to_fetch=df_filtered,
    data_path="../data/01-raw/references",
    last_processed_batch=max_batch,
    batch_size=500,
)