**Please note that this notebook is a demonstration of how to use Bunny. Since we are using mock data, and Bunny requires real citations, reference paper IDs, or DOIs to curate a dataset from the citation/reference network, this example serves only as a demo of how to interact with Bunny.**

In [1]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

## 1. Load Data

In [3]:
import pandas as pd
import os

df = pd.read_csv(os.path.join("..", "..", "data", "sample2.csv")).head(5)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   eid               5 non-null      object 
 1   s2id              5 non-null      object 
 2   doi               5 non-null      object 
 3   title             5 non-null      object 
 4   abstract          5 non-null      object 
 5   year              5 non-null      int64  
 6   authors           5 non-null      object 
 7   author_ids        5 non-null      object 
 8   affiliations      5 non-null      object 
 9   funding           0 non-null      object 
 10  PACs              1 non-null      object 
 11  publication_name  5 non-null      object 
 12  subject_areas     5 non-null      object 
 13  s2_authors        5 non-null      object 
 14  s2_author_ids     5 non-null      object 
 15  citations         5 non-null      object 
 16  references        4 non-null      object 
 17  n

## 2. Setup AutoBunny Procedure

AutoBunny uses a list of steps to take. Each step (wrapped in an ```AutoBunnyStep``` class consists of three arguments:

    1. Hop Mode(s): modes, list of str, str
         Which features to use for expansion of the dataset (currently 'citations', 'references',
         's2_author_ids')
    2. Max Papers: max_papers, int
         The upper bound on how many papers to return for a given hop. If not set, as many papers as possible
         are returned.
    3. Hop Priority: hop_priority, str
          How papers should be prioritized in the lookup if max_papers is defined. The options are `random` and 
          `frequency`. The `random` option shuffles the items prior to search. The `frequency` option looks for 
          the most common items first. 
    4. Cheetah Settings:  cheetah_settings, dict
         Which settings to use for filtering the search results
    5. Vulture Setttings: vulture_settings, dict
         Which settings to use for text cleaning. In this example this third argument is not passed 
         and left as the default implemented in the AutoBunnyStep. If a third argument is passed, 
         the default is overwritten.
         
For each step, the dataset will be expanded and pruned automatically using the specified settings. Early termination conditions like no papers left after filtering or maximum allowed papers in expansion apply.

A single AutoBunnyStep is defined as:

```python
AutoBunnyStep(
    modes: list
    max_papers: int = 0
    hop_priority: str = 'random'  # 'random', 'frequency`
    cheetah_settings: dict = field(default_factory = lambda: {'query': None})
    vulture_settings: dict = field(default_factory = lambda: [])
)
```

In [4]:
from TELF.applications.Bunny import AutoBunnyStep

cheetah_settings = {
    "query": "tensor",
    "in_title":False, 
    "in_abstract":True,
}

steps = [
    AutoBunnyStep(['references'], cheetah_settings=cheetah_settings),
    AutoBunnyStep(['citations'], cheetah_settings=cheetah_settings),
]

## 3. Use Bunny

In [5]:
if "S2_KEY" in os.environ:
    print("Found S2_KEY environment variable")
    API_KEY = os.environ["S2_KEY"]
else:
    print("Variable does not exist. Export SemanticScholar API key on your environment using the variable name S2_KEY.")
    API_KEY = ""

Found S2_KEY environment variable


In [6]:
from TELF.applications.Bunny import AutoBunny

ab = AutoBunny(df, s2_key=API_KEY, verbose=True)

In [7]:
ab_df = ab.run(steps)
ab_df.info()

[Vulture]: Cleaning 5 documents
  0%|          | 0/1 [00:00<?, ?it/s][Vulture]: Running SimpleCleaner module
[Parallel(n_jobs=5)]: Using backend MultiprocessingBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   2 out of   5 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=5)]: Done   5 out of   5 | elapsed:    0.4s finished
100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


Overwriting existing index.
Indexing abstract


100%|██████████| 5/5 [00:00<00:00, 5786.84it/s]
100%|██████████| 398/398 [00:00<00:00, 3551772.32it/s]


Indexing years


100%|██████████| 5/5 [00:00<00:00, 127875.12it/s]


Indexing author IDs


100%|██████████| 5/5 [00:00<00:00, 135300.13it/s]
100%|██████████| 1/1 [00:00<00:00, 33825.03it/s]


Indexing affiliations and countries


100%|██████████| 5/5 [00:00<00:00, 26149.03it/s]
100%|██████████| 3/3 [00:00<00:00, 107546.26it/s]
100%|██████████| 3/3 [00:00<00:00, 112347.43it/s]
Found 3 papers in 0.0011 seconds
[Bunny]: Downloading papers for hop 1
  0%|          | 0/18 [00:00<?, ?it/s]




[S2]: Finished downloading 0 papers for given query in 0.99s
[Vulture]: Cleaning 5 documents
  0%|          | 0/1 [00:00<?, ?it/s][Vulture]: Running SimpleCleaner module
[Parallel(n_jobs=5)]: Using backend MultiprocessingBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   2 out of   5 | elapsed:    0.4s remaining:    0.5s
[Parallel(n_jobs=5)]: Done   5 out of   5 | elapsed:    0.4s finished
100%|██████████| 1/1 [00:00<00:00,  2.31it/s]


Overwriting existing index.
Indexing abstract


100%|██████████| 5/5 [00:00<00:00, 6636.56it/s]
100%|██████████| 398/398 [00:00<00:00, 1570397.92it/s]


Indexing years


100%|██████████| 5/5 [00:00<00:00, 59074.70it/s]


Indexing author IDs


100%|██████████| 5/5 [00:00<00:00, 84222.97it/s]
100%|██████████| 1/1 [00:00<00:00, 20560.31it/s]


Indexing affiliations and countries


100%|██████████| 5/5 [00:00<00:00, 13609.03it/s]
100%|██████████| 3/3 [00:00<00:00, 62291.64it/s]
100%|██████████| 3/3 [00:00<00:00, 64527.75it/s]
Found 3 papers in 0.0021 seconds
[Bunny]: Downloading papers for hop 1
  0%|          | 0/70 [00:01<?, ?it/s]




[S2]: Finished downloading 0 papers for given query in 1.64s
[Vulture]: Cleaning 3 documents
  0%|          | 0/1 [00:00<?, ?it/s][Vulture]: Running SimpleCleaner module
[Parallel(n_jobs=3)]: Using backend MultiprocessingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    0.3s finished
100%|██████████| 1/1 [00:00<00:00,  2.63it/s]


Overwriting existing index.
Indexing abstract


100%|██████████| 3/3 [00:00<00:00, 2573.72it/s]
100%|██████████| 320/320 [00:00<00:00, 1446311.72it/s]


Indexing years


100%|██████████| 3/3 [00:00<00:00, 22753.91it/s]


Indexing author IDs


100%|██████████| 3/3 [00:00<00:00, 42945.09it/s]
100%|██████████| 1/1 [00:00<00:00, 21183.35it/s]


Indexing affiliations and countries


100%|██████████| 3/3 [00:00<00:00, 13603.15it/s]
100%|██████████| 3/3 [00:00<00:00, 59918.63it/s]
100%|██████████| 3/3 [00:00<00:00, 67288.30it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   eid               3 non-null      object 
 1   s2id              3 non-null      object 
 2   doi               3 non-null      object 
 3   title             3 non-null      object 
 4   abstract          3 non-null      object 
 5   year              3 non-null      int64  
 6   authors           3 non-null      object 
 7   author_ids        3 non-null      object 
 8   affiliations      3 non-null      object 
 9   funding           0 non-null      object 
 10  PACs              0 non-null      object 
 11  publication_name  3 non-null      object 
 12  subject_areas     3 non-null      object 
 13  s2_authors        3 non-null      object 
 14  s2_author_ids     3 non-null      object 
 15  citations         3 non-null      object 
 16  references        2 non-null      object 
 17  n

Found 3 papers in 0.0016 seconds
