# iPenguin - Scopus Example

## 0. Settings

In [1]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
import pathlib

In [4]:
CACHING_DIR = 'scopus_cache'
CACHING_DIR = pathlib.Path(CACHING_DIR).resolve()

## 1. Create the Scopus Handler Object

In [5]:
from TELF.pre_processing.iPenguin.Scopus import Scopus

In [6]:
if "SCOPUS_KEY" in os.environ:
    print("Found SCOPUS_KEY environment variable")
    API_KEY = os.environ["SCOPUS_KEY"]
else:
    print("Variable does not exist. Export Scopus API key on your environment using the variable name SCOPUS_KEY.")
    API_KEY = ""

Found SCOPUS_KEY environment variable


In [7]:
scopus = Scopus(
    keys = [API_KEY], 
    mode = 'fs',         # file system caching mode (default)
    name = CACHING_DIR,  # where to cache the files
    verbose = True
)

## 2. Create a Query

Scopus allows the user to enter complex search queries using field codes, boolean and proximity operators to narrow the scope of the search. If using ```iPenguin``` directly, these queries need to be constructed as strings. Higher level libraries like `Bunny` have support for programatically building queries. Syntax errors in the search queries will cause the Scopus class to fail to find/download any papers and a warning that the query is invalid will be provied. See the following [resource](https://service.elsevier.com/app/answers/detail/a_id/11365/supporthub/scopus/#tips) for more information on constructing Scopus queries. 

In [8]:
# search for 'Boian Alexandrov' in all author fields
query = 'AUTH(Boian Alexandrov)'

## 3. Execute Query

### A. Before downloading, check how many papers are available for the query

In [9]:
count = scopus.count(query)

[Scopus]: Found 42 papers in 0.84s


### B. Download the papers found by the query

The ```Scopus.search()``` function takes an optional argument ```n```. This argument sets an upper limit on how many papers to download. Some queries such as "<b><i>LANGUAGE(english)</i></b>" (all English language papers found on Scopus) can return millions of papers so the ```n``` argument is used to limit the scope of the search. Scopus search ranks the results by relevancy so the top ```n``` most "relevant" papers are found for a given query. By default ```n``` is set to 100 but it can be set to any positive integer value. Setting ```n``` equal to 0 will download all available papers for the query.

In [10]:
df, paper_ids = scopus.search(query, n=100)
df.info()

[Scopus API]: Remaining API calls: 9980
              Quota resets at:     2025-05-02 06:23:55

100%|██████████| 42/42 [00:06<00:00,  6.35it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   eid               42 non-null     object
 1   doi               42 non-null     object
 2   title             42 non-null     object
 3   year              42 non-null     int64 
 4   abstract          42 non-null     object
 5   authors           42 non-null     object
 6   author_ids        42 non-null     object
 7   affiliations      42 non-null     object
 8   funding           30 non-null     object
 9   PACs              28 non-null     object
 10  publication_name  42 non-null     object
 11  subject_areas     42 non-null     object
 12  num_citations     42 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 4.4+ KB


[Scopus]: Finished downloading 42 papers in 10.02s
[Parallel(n_jobs=42)]: Using backend ThreadingBackend with 42 concurrent workers.
[Parallel(n_jobs=42)]: Done   2 out of  42 | elapsed:    0.0s remaining:    0.6s
[Parallel(n_jobs=42)]: Done  42 out of  42 | elapsed:    0.1s finished
