# Quickstart

This demo is a minimal working adaptation of the directions from the README. It involves 
- creating a loading a one-publication atlas from a bibtex file specifying a single reference,
- querying SemanticScholar for this publication, obtaining the document embedding, and 
- expanding the atlas outwards to 1,000 publications via `iterate_expand`.

In [7]:
# Create a cartographer with a Semantic Scholar librarian and a SciBERT vectorizer
from sciterra import Cartographer
from sciterra.librarians import SemanticScholarLibrarian # or ADSLibrarian
from sciterra.vectorization import SciBERTVectorizer # among others

crt = Cartographer(
    librarian=SemanticScholarLibrarian(),
    vectorizer=SciBERTVectorizer(),
)

Using device: cpu.


In [8]:
# Use the cartographer and a bib file to create an atlas
atl = crt.bibtex_to_atlas('./example.bib')

Querying Semantic Scholar for 1 total papers.


progress using call_size=10: 100%|██████████| 1/1 [00:02<00:00,  2.77s/it]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 1/1 [00:00<00:00, 20867.18it/s]


In [9]:
# Get abstracts
docs = [atl[identifier].abstract for identifier in atl.ids]

# Embed abstracts
result = crt.vectorizer.embed_documents(docs)
embeddings = result["embeddings"]

# depending on the vectorizer, sometimes not all embeddings can be obtained due to out-of-vocab issues
success_indices = result["success_indices"] # shape `(len(embeddings),)`
fail_indices = result["fail_indices"] # shape `(len(docs) - len(embeddings))`

success_indices, fail_indices

embedding documents: 64it [00:00, 238.48it/s]             


(array([0]), array([], dtype=int64))

In [10]:
from sciterra.mapping.tracing import iterate_expand

# Assuming the initial atlas contains just one publication
(atl.center, ) = atl.publications.keys()
# build out an atlas to contain 1,000 publications, with increasing dissimilarity to the initial publication, saving progress in binary files to the directory named "atlas".
atl = iterate_expand(
    atl=atl,
    crt=crt,
    atlas_dir="atlas",
    target_size=1000,
    center=atl.center,
)


Expansion 1
-------------------------------


embedding documents: 64it [00:00, 153.23it/s]             


Expansion will include 291 new publications.
Querying Semantic Scholar for 291 total papers.


progress using call_size=10:  17%|█▋        | 50/291 [00:42<04:30,  1.12s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x2b6e88540> 2 times to get a response.


progress using call_size=10: 100%|██████████| 291/291 [02:56<00:00,  1.64it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 291/291 [00:00<00:00, 2506247.36it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
No history to save, skipping.
Overwriting existing file at atlas/center.pkl.
18 publications were filtered due to missing crucial data or incorrect field of study. There are now 18 total ids that will be excluded in the future.
Found 273 publications not contained in Atlas projection.
Embedding 273 total documents.


Atlas has 292 publications and 1 embeddings.


embedding documents: 320it [01:34,  3.38it/s]                         


Atlas has 274 publications and 274 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 274 embeddings with batch size 274.


100%|██████████| 1/1 [00:00<00:00, 274.87it/s]
calculating converged kernel size: 100%|██████████| 274/274 [00:00<00:00, 35255.98it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.



Expansion 2
-------------------------------
Expansion will include 4000 new publications.
Querying Semantic Scholar for 4000 total papers.


progress using call_size=10: 100%|██████████| 4000/4000 [22:14<00:00,  3.00it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 4000/4000 [00:00<00:00, 4571448.50it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
No history to save, skipping.
Overwriting existing file at atlas/center.pkl.
1921 publications were filtered due to missing crucial data or incorrect field of study. There are now 1939 total ids that will be excluded in the future.
Found 2079 publications not contained in Atlas projection.
Embedding 2079 total documents.


Atlas has 4274 publications and 274 embeddings.


embedding documents: 2112it [11:21,  3.10it/s]                          


Atlas has 2353 publications and 2353 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 2353 embeddings with batch size 1000.


100%|██████████| 3/3 [00:00<00:00, 37.68it/s]
calculating converged kernel size: 100%|██████████| 2353/2353 [00:00<00:00, 4612.18it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.


Exiting loop.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 2353 embeddings with batch size 1000.


100%|██████████| 3/3 [00:00<00:00, 53.57it/s]
calculating converged kernel size: 100%|██████████| 2353/2353 [00:00<00:00, 4600.66it/s]

Expansion loop exited with atlas size 2353 after 2 iterations meeting criteria: {'target_size': True, 'max_failed_expansions': False, 'convergence_func': False}.



Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.


In [20]:
# Take a look at 10 publications
[atl[id].abstract for id in atl.ids[-10:]]

['The Astrophysics Source Code Library (ASCL), founded in 1999, is a free on-line registry for source codes of interest to astronomers and astrophysicists. The library is housed on the discussion forum for Astronomy Picture of the Day (APOD) and can be accessed at this http URL The ASCL has a comprehensive listing that covers a significant number of the astrophysics source codes used to generate results published in or submitted to refereed journals and continues to grow. The ASCL currently has entries for over 500 codes; its records are citable and are indexed by ADS. The editors of the ASCL and members of its Advisory Committee were on hand at a demonstration table in the ADASS poster room to present the ASCL, accept code submissions, show how the ASCL is starting to be used by the astrophysics community, and take questions on and suggestions for improving the resource.',
 'Observations with the Hubble Space Telescope show that halos of ionized gas are common around star-forming gala