**Please note that this notebook is a demonstration of how to use Bunny. Since we are using mock data, and Bunny requires real citations, reference paper IDs, or DOIs to curate a dataset from the citation/reference network, this example serves only as a demo of how to interact with Bunny.**

# Build Dataset by Citation Hops - Core Input is a *list of DOIs*

In [1]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

## 1. DOI's to begin the hopping from

In [2]:
DOIs = [
    "10.1109/ICMLA61862.2024.00258",
    "10.1109/ISDFS60797.2024.10527237"
]

## 2. Setup Core

In [5]:
import os

if "S2_KEY" in os.environ:
    print("Found S2_KEY environment variable")
    S2_API_KEY = os.environ["S2_KEY"]
else:
    print("Variable does not exist. Export SemanticScholar API key on your environment using the variable name S2_KEY.")
    S2_API_KEY = ""

Found S2_KEY environment variable


In [6]:
from TELF.applications.Bunny import Bunny


OUTPUT_PATH = os.path.join('results', '01-example')
bunny = Bunny(s2_key = S2_API_KEY, 
              output_dir = OUTPUT_PATH, 
              verbose=True)

In [7]:
core_df = bunny.form_core(DOIs, 'paper')
core_df.info()

100%|██████████| 2/2 [00:00<00:00,  2.09it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            2 non-null      object
 1   doi             2 non-null      object
 2   year            2 non-null      int64 
 3   title           2 non-null      object
 4   abstract        2 non-null      object
 5   s2_authors      2 non-null      object
 6   s2_author_ids   2 non-null      object
 7   citations       2 non-null      object
 8   references      2 non-null      object
 9   num_citations   2 non-null      int64 
 10  num_references  2 non-null      int64 
 11  type            2 non-null      int64 
dtypes: int64(4), object(8)
memory usage: 324.0+ bytes


[S2]: Finished downloading 2 papers for given query in 1.89s
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.0s finished


## 3. Showcase some Bunny Use Cases

### A. Perform a Few Hops in S2 and Fill in Scopus Information at the End

In [8]:
df = core_df.copy()
df.type.value_counts()

type
0    2
Name: count, dtype: int64

#### Hop 1

In [9]:
df = bunny.hop(df, hops=1, modes='citations')
df.info()

[Bunny]: Downloading papers for hop 1
100%|██████████| 6/6 [00:00<00:00,  6.67it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            8 non-null      object
 1   doi             6 non-null      object
 2   year            8 non-null      int64 
 3   title           8 non-null      object
 4   abstract        8 non-null      object
 5   s2_authors      8 non-null      object
 6   s2_author_ids   8 non-null      object
 7   citations       5 non-null      object
 8   references      8 non-null      object
 9   num_citations   8 non-null      int64 
 10  num_references  8 non-null      int64 
 11  type            8 non-null      int64 
dtypes: int64(4), object(8)
memory usage: 900.0+ bytes


[S2]: Finished downloading 6 papers for given query in 2.05s
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   2 out of   6 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=6)]: Done   6 out of   6 | elapsed:    0.0s finished


#### Hop 2

In [12]:
df = bunny.hop(df, hops=1, modes='citations')
df.info()

[Bunny]: Downloading papers for hop 2
100%|██████████| 4/4 [00:00<00:00,  5.78it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            11 non-null     object
 1   doi             8 non-null      object
 2   year            11 non-null     int64 
 3   title           11 non-null     object
 4   abstract        10 non-null     object
 5   s2_authors      11 non-null     object
 6   s2_author_ids   11 non-null     object
 7   citations       5 non-null      object
 8   references      11 non-null     object
 9   num_citations   11 non-null     int64 
 10  num_references  11 non-null     int64 
 11  type            11 non-null     int64 
dtypes: int64(4), object(8)
memory usage: 1.2+ KB


[S2]: Finished downloading 4 papers for given query in 1.61s
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.0s finished


#### Get Scopus Information 

In [16]:
if "SCOPUS_KEY" in os.environ:
    print("Found SCOPUS_KEY environment variable")
    SCOPUS_API_KEY = os.environ["SCOPUS_KEY"]
else:
    print("Variable does not exist. Export Scopus API key on your environment using the variable name SCOPUS_KEY.")
    SCOPUS_API_KEY = ""

Found SCOPUS_KEY environment variable


In [18]:
df = bunny.get_affiliations(df, [SCOPUS_API_KEY], filters=None)
df.info()

[Scopus API]: Remaining API calls: 9997
              Quota resets at:     2025-03-25 05:25:34

100%|██████████| 3/3 [00:01<00:00,  2.92it/s]


<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, 0 to 10
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   doi               8 non-null      object 
 1   eid               3 non-null      object 
 2   s2id              11 non-null     object 
 3   title             11 non-null     object 
 4   abstract          11 non-null     object 
 5   year              11 non-null     int64  
 6   authors           3 non-null      object 
 7   author_ids        3 non-null      object 
 8   affiliations      3 non-null      object 
 9   funding           1 non-null      object 
 10  PACs              3 non-null      object 
 11  publication_name  3 non-null      object 
 12  subject_areas     3 non-null      object 
 13  s2_authors        11 non-null     object 
 14  s2_author_ids     11 non-null     object 
 15  citations         5 non-null      object 
 16  num_citations     11 non-null     int64  
 17  refe

[Scopus]: Finished downloading 3 papers in 4.39s
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    0.0s finished
