# iPenguin - Semantic Scholar Example

## 0. Settings

In [1]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
import pathlib

In [4]:
CACHING_DIR = 's2_cache'
CACHING_DIR = pathlib.Path(CACHING_DIR).resolve()

## 1. Create the S2 Handler Object

In [5]:
from TELF.pre_processing.iPenguin.SemanticScholar import SemanticScholar

In [6]:
if "S2_KEY" in os.environ:
    print("Found S2_KEY environment variable")
    API_KEY = os.environ["S2_KEY"]
else:
    print("Variable does not exist. Export SemanticScholar API key on your environment using the variable name S2_KEY.")
    API_KEY = ""

Found S2_KEY environment variable


In [7]:
s2 = SemanticScholar(
    key = API_KEY,
    mode = 'fs',         # file system caching mode (default)
    name = CACHING_DIR,  # where to cache the files
    verbose=True
)

## 2. Use the S2 Handler

### A. Lookup Papers by their ID

In [8]:
PAPERS = [
    '59e4d6475c41096befeafec55ea5ad97432de527', 
    '9806df234e723cd348112f348ee0724f52bc8f73', 
]

len(PAPERS)

2

In [9]:
count = s2.count(PAPERS, mode='paper')

[S2]: Found 2 papers in 0.46s


In [10]:
df, paper_ids = s2.search(PAPERS, mode='paper')
df.info()

100%|██████████| 2/2 [00:01<00:00,  1.50it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            2 non-null      object
 1   doi             2 non-null      object
 2   year            2 non-null      int64 
 3   title           2 non-null      object
 4   abstract        2 non-null      object
 5   s2_authors      2 non-null      object
 6   s2_author_ids   2 non-null      object
 7   citations       2 non-null      object
 8   references      2 non-null      object
 9   num_citations   2 non-null      int64 
 10  num_references  2 non-null      int64 
dtypes: int64(3), object(8)
memory usage: 308.0+ bytes


[S2]: Finished downloading 2 papers for given query in 1.88s
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.0s finished


In [11]:
paper_ids[:2]

['59e4d6475c41096befeafec55ea5ad97432de527',
 '9806df234e723cd348112f348ee0724f52bc8f73']

### B. Get Papers by Author ID(s)

In [12]:
AUTHORS = [
    '2025666',     # Boian Alexandrov
]

Get the **maximum** number of papers that will be returned from this search. Note that Semantic Scholar has no support for quickly finding the total number of papers for a group of authors. The count is computed by taking the sum of the number of papers for each author. This means that the same paper will be counted multiple times if the authors being examined are co-authors. Use the output of ```SemanticScholar.count(*, mode='author')``` as an upper bound on how many papers will be returned.

In [13]:
count = s2.count(AUTHORS, mode='author')

[S2]: Found 148 papers in 0.54s


In [14]:
df, paper_ids = s2.search(AUTHORS, mode='author')
df.info()

100%|██████████| 148/148 [00:05<00:00, 27.74it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            148 non-null    object
 1   doi             135 non-null    object
 2   year            148 non-null    int64 
 3   title           148 non-null    object
 4   abstract        102 non-null    object
 5   s2_authors      148 non-null    object
 6   s2_author_ids   148 non-null    object
 7   citations       110 non-null    object
 8   references      117 non-null    object
 9   num_citations   148 non-null    int64 
 10  num_references  148 non-null    int64 
dtypes: int64(3), object(8)
memory usage: 12.8+ KB


[S2]: Finished downloading 148 papers for given query in 6.35s
[Parallel(n_jobs=148)]: Using backend ThreadingBackend with 148 concurrent workers.
[Parallel(n_jobs=148)]: Done   2 out of 148 | elapsed:    0.0s remaining:    1.9s
[Parallel(n_jobs=148)]: Done 148 out of 148 | elapsed:    0.1s finished


### C. Get Papers by Query

Establish how many papers can be found for some query on S2.

In [15]:
count = s2.count('tensor decomposition', mode='query')

[S2]: Found 1,213,639 papers in 1.07s


The ```SemanticScholar.search()``` function takes an optional argument ```n```. This argument sets an upper limit on how many papers to download. SemanticScholar search ranks the results by relevancy so the top ```n``` most "relevant" papers are found for a given query. By default ```n``` is set to 100 but it can be set to any positive integer value. Set ```n``` equal to 0 download all available papers for a query

In [16]:
df, paper_ids = s2.search('tensor decomposition', mode='query', n=5)
df.info()

100%|██████████| 5/5 [00:02<00:00,  2.09it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            5 non-null      object
 1   doi             4 non-null      object
 2   year            5 non-null      int64 
 3   title           5 non-null      object
 4   abstract        5 non-null      object
 5   s2_authors      5 non-null      object
 6   s2_author_ids   5 non-null      object
 7   citations       5 non-null      object
 8   references      5 non-null      object
 9   num_citations   5 non-null      int64 
 10  num_references  5 non-null      int64 
dtypes: int64(3), object(8)
memory usage: 572.0+ bytes


[S2]: Finished downloading 5 papers for given query in 3.77s
[Parallel(n_jobs=5)]: Using backend ThreadingBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done   2 out of   5 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=5)]: Done   5 out of   5 | elapsed:    0.0s finished


In [17]:
len(paper_ids)

5

## 3. Get all Cached Papers and Form Single DataFrame

The ```SemanticScholar.get_df``` method can be used to access the file system cache directly. This function also takes an optional argument ```targets```. If None (the default) all papers in the cache directory are returned. Otherwise ```targets``` is expected to be an iterable (list, set, tuple) of Semantic Scholar IDs to be fetched from the cache

In [18]:
df, paper_ids = SemanticScholar.get_df(CACHING_DIR)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   s2id            153 non-null    object
 1   doi             139 non-null    object
 2   year            153 non-null    int64 
 3   title           153 non-null    object
 4   abstract        107 non-null    object
 5   s2_authors      153 non-null    object
 6   s2_author_ids   153 non-null    object
 7   citations       115 non-null    object
 8   references      122 non-null    object
 9   num_citations   153 non-null    int64 
 10  num_references  153 non-null    int64 
dtypes: int64(3), object(8)
memory usage: 13.3+ KB
