**Please note that this notebook is a demonstration of how to use Bunny. Since we are using mock data, and Bunny requires real citations, reference paper IDs, or DOIs to curate a dataset from the citation/reference network, this example serves only as a demo of how to interact with Bunny.**

# Build Dataset by Citation Hops - Core Input is a *list of DOIs*

In [2]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

## 1. DOI's to begin the hopping from

In [3]:
DOIs = [
    "10.1109/ICMLA61862.2024.00258",
    "10.1109/ISDFS60797.2024.10527237"
]

## 2. Setup Core

In [4]:
import os

if "S2_KEY" in os.environ:
    print("Found S2_KEY environment variable")
    S2_API_KEY = os.environ["S2_KEY"]
else:
    print("Variable does not exist. Export SemanticScholar API key on your environment using the variable name S2_KEY.")
    S2_API_KEY = ""

Found S2_KEY environment variable


In [5]:
from TELF.applications.Bunny import Bunny


OUTPUT_PATH = os.path.join('results', '01-example')
bunny = Bunny(s2_key = S2_API_KEY, 
              output_dir = OUTPUT_PATH, 
              verbose=True)

In [6]:
core_df = bunny.form_core(DOIs, 'paper')
core_df.info()

100%|██████████| 2/2 [00:00<00:00,  3.89it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            2 non-null      object
 1   doi             2 non-null      object
 2   year            2 non-null      int64 
 3   title           2 non-null      object
 4   abstract        2 non-null      object
 5   s2_authors      2 non-null      object
 6   s2_author_ids   2 non-null      object
 7   citations       2 non-null      object
 8   references      2 non-null      object
 9   num_citations   2 non-null      int64 
 10  num_references  2 non-null      int64 
 11  type            2 non-null      int64 
dtypes: int64(4), object(8)
memory usage: 324.0+ bytes


[S2]: Finished downloading 2 papers for given query in 1.24s
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.0s finished


## 3. Showcase some Bunny Use Cases

### A. Perform a Few Hops in S2 and Fill in Scopus Information at the End

In [7]:
df = core_df.copy()
df.type.value_counts()

type
0    2
Name: count, dtype: int64

#### Hop 1

In [8]:
df = bunny.hop(df, hops=1, modes='citations')
df.info()

[Bunny]: Downloading papers for hop 1
100%|██████████| 7/7 [00:00<00:00, 35805.03it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            9 non-null      object
 1   doi             8 non-null      object
 2   year            9 non-null      int64 
 3   title           9 non-null      object
 4   abstract        9 non-null      object
 5   s2_authors      9 non-null      object
 6   s2_author_ids   9 non-null      object
 7   citations       6 non-null      object
 8   references      9 non-null      object
 9   num_citations   9 non-null      int64 
 10  num_references  9 non-null      int64 
 11  type            9 non-null      int64 
dtypes: int64(4), object(8)
memory usage: 996.0+ bytes


[S2]: Finished downloading 7 papers for given query in 0.65s
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done   2 out of   7 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=7)]: Done   7 out of   7 | elapsed:    0.0s finished


#### Hop 2

In [9]:
df = bunny.hop(df, hops=1, modes='citations')
df.info()

[Bunny]: Downloading papers for hop 2
100%|██████████| 7/7 [00:00<00:00, 66126.41it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            15 non-null     object
 1   doi             12 non-null     object
 2   year            15 non-null     int64 
 3   title           15 non-null     object
 4   abstract        14 non-null     object
 5   s2_authors      15 non-null     object
 6   s2_author_ids   15 non-null     object
 7   citations       6 non-null      object
 8   references      15 non-null     object
 9   num_citations   15 non-null     int64 
 10  num_references  15 non-null     int64 
 11  type            15 non-null     int64 
dtypes: int64(4), object(8)
memory usage: 1.5+ KB


[S2]: Finished downloading 7 papers for given query in 0.64s
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done   2 out of   7 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=7)]: Done   7 out of   7 | elapsed:    0.0s finished


#### Get Scopus Information 

In [10]:
if "SCOPUS_KEY" in os.environ:
    print("Found SCOPUS_KEY environment variable")
    SCOPUS_API_KEY = os.environ["SCOPUS_KEY"]
else:
    print("Variable does not exist. Export Scopus API key on your environment using the variable name SCOPUS_KEY.")
    SCOPUS_API_KEY = ""

Found SCOPUS_KEY environment variable


In [11]:
df = bunny.get_affiliations(df, [SCOPUS_API_KEY], filters=None)
df.info()

[Scopus API]: Remaining API calls: 9866
              Quota resets at:     2025-05-02 06:23:55

100%|██████████| 5/5 [00:00<00:00,  7.50it/s]


<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, 0 to 14
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   doi               12 non-null     object 
 1   eid               5 non-null      object 
 2   s2id              15 non-null     object 
 3   title             15 non-null     object 
 4   abstract          15 non-null     object 
 5   year              15 non-null     int64  
 6   authors           5 non-null      object 
 7   author_ids        5 non-null      object 
 8   affiliations      5 non-null      object 
 9   funding           2 non-null      object 
 10  PACs              5 non-null      object 
 11  publication_name  5 non-null      object 
 12  subject_areas     5 non-null      object 
 13  s2_authors        15 non-null     object 
 14  s2_author_ids     15 non-null     object 
 15  citations         6 non-null      object 
 16  num_citations     15 non-null     int64  
 17  refe

[Scopus]: Finished downloading 5 papers in 3.98s
[Parallel(n_jobs=5)]: Using backend ThreadingBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   2 out of   5 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=5)]: Done   5 out of   5 | elapsed:    0.0s finished
