# iPenguin - OSTI Example

## 0. Settings

In [1]:
# allows async co-routines to work inside of jupyter notebook
import nest_asyncio
nest_asyncio.apply()

In [14]:
import os
import pathlib

CACHING_DIR = 'osti_cache'
CACHING_DIR = pathlib.Path(CACHING_DIR).resolve()

## 1. Create the OSTI Handler Object

In [3]:
from TELF.pre_processing.iPenguin.OSTI import OSTI

In [4]:
osti = OSTI(
    mode = 'fs',         # file system caching mode (default)
    name = CACHING_DIR,  # where to cache the files
    verbose=True
)

## 2. Use the OSTI Handler

### A. Lookup Papers by their ID

In [5]:
PAPERS = [
    '2528082',
    '2328630',
    '2006438',
    '2246858',
]

len(PAPERS)

4

In [6]:
count = osti.count(PAPERS, mode='paper')

[OSTI]: Found 4 papers in 0.11s


In [7]:
df = osti.search(PAPERS, mode='paper')
df.info()

100%|██████████| 4/4 [00:00<00:00, 11.09it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   osti_id              4 non-null      object
 1   doi                  4 non-null      object
 2   title                4 non-null      object
 3   year                 4 non-null      int64 
 4   abstract             3 non-null      object
 5   authors              4 non-null      object
 6   author_ids           4 non-null      object
 7   affiliations         1 non-null      object
 8   country_publication  4 non-null      object
 9   report_number        2 non-null      object
 10  doe_contract_number  4 non-null      object
 11  publisher            2 non-null      object
 12  language             2 non-null      object
dtypes: int64(1), object(12)
memory usage: 548.0+ bytes


[OSTI]: Finished downloading 4 papers in 0.39s
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    0.0s finished


### B. Get Papers by Query

In [9]:
query = 'Boian Alexandrov'

Establish how many papers can be found for some query on OSTI.

In [10]:
count = osti.count(query, mode='query')

[OSTI]: Found 143 papers in 1.00s


The ```OSTI.search()``` function takes an optional argument ```n```. This argument sets an upper limit on how many papers to download. By default ```n``` is set to 100 but it can be set to any positive integer value. Set ```n``` equal to 0 download all available papers for a query

In [11]:
df = osti.search(query, mode='query', n=0)
df.info()

0it [00:02, ?it/s]       


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   osti_id              143 non-null    object
 1   doi                  127 non-null    object
 2   title                143 non-null    object
 3   year                 143 non-null    int64 
 4   abstract             129 non-null    object
 5   authors              143 non-null    object
 6   author_ids           143 non-null    object
 7   affiliations         105 non-null    object
 8   country_publication  143 non-null    object
 9   report_number        103 non-null    object
 10  doe_contract_number  138 non-null    object
 11  publisher            89 non-null     object
 12  language             136 non-null    object
dtypes: int64(1), object(12)
memory usage: 14.7+ KB


[OSTI]: Finished downloading 143 papers in 3.47s
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done   2 out of  12 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=12)]: Done  12 out of  12 | elapsed:    0.0s finished
