# Fetch Reference Data from Scopus API

Resources:

- [Scopus Abstract Retrieval Views](https://dev.elsevier.com/sc_abstract_retrieval_views.html)
- [Scopus Retrieval API](https://dev.elsevier.com/documentation/AbstractRetrievalAPI.wadl)
- [Interactive Scopus API](https://dev.elsevier.com/scopus.html)
- [API Settings (rate limits)](https://dev.elsevier.com/api_key_settings.html)
- `Remember Logging In to Cisco VPN!!!`


In this notebook, we use the `ScopusReferenceFetcher` to retrieve references for a list of Scopus IDs.

Key features:
1. Automatic API key rotation when rate limits are hit
2. Saves progress after every 500 requests
3. Resumes from last saved state if interrupted
4. Comprehensive logging of progress and errors
5. Handles multiple API keys with different rate limits

The process:
1. Loads previously processed data to avoid duplicates
2. Filters the input dataframe to only unprocessed articles
3. Processes articles in batches, rotating API keys as needed
4. Saves progress after each batch
5. Logs all operations and errors for debugging

# SETUP

## Import Libraries

In [13]:
# Enable autoreload extension
%load_ext autoreload
%autoreload 2

import logging
from src.data_fetching.ScopusProcessor import ScopusRefFetcherPrep, ScopusRefFetcherProcessor

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the processed data, if any


In [None]:
# Set up logging in the correct directory
log_directory = "../data/01-raw/references/logs/"  # Directory for logs
data_path = "../data/01-raw/references/"  # Directory for saved batch data
article_df_path = "../data/02-clean/articles/scopus_cleaned_20250326_081230.pkl"  # DataFrame with EIDs

# Make sure log directory exists
import os
os.makedirs(log_directory, exist_ok=True)

# Properly set up logging
ScopusRefFetcherProcessor.setup_logging(log_directory, console_output=False)

# Load already processed EIDs (if any)
processed_eids = []
last_batch = 0
try:
    processed_eids, last_batch = ScopusRefFetcherPrep.load_fetched_reference_data(data_path)
except (FileNotFoundError, ValueError):
    # No previous batches found
    logging.info("No previous batches found. Starting from scratch.")
    pass

logging.info(f'Already processed articles: {len(processed_eids)}')
# Load and filter articles
df_to_fetch = ScopusRefFetcherPrep.load_and_filter_articles(article_df_path, processed_eids)
logging.info(f'Last processed batch: {last_batch}')



In [None]:
# Start batch processing
ScopusRefFetcherProcessor.process_scopus_batches(
    df_to_fetch=df_to_fetch,
    data_path=data_path,
    last_processed_batch=last_batch,
    batch_size=500  # Adjust batch size as needed
)