# Quickstart

This demo is a minimal working adaptation of the directions from the README. It involves 
- creating a loading a one-publication atlas from a bibtex file specifying a single reference,
- querying SemanticScholar for this publication, obtaining the document embedding, and 
- expanding the atlas outwards to 1,000 publications via `iterate_expand`.

In [1]:
# Create a cartographer with a Semantic Scholar librarian and a SciBERT vectorizer
from sciterra import Cartographer
from sciterra.librarians import SemanticScholarLibrarian # or ADSLibrarian
from sciterra.vectorization import SciBERTVectorizer # among others

crt = Cartographer(
    librarian=SemanticScholarLibrarian(),
    vectorizer=SciBERTVectorizer(),
)

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


Using device: cpu.


In [2]:
# Use the cartographer and a bib file to create an atlas
atl = crt.bibtex_to_atlas('./example.bib')

Querying Semantic Scholar for 1 total papers.


progress using call_size=10: 100%|██████████| 1/1 [00:02<00:00,  2.83s/it]
100%|██████████| 1/1 [00:00<00:00, 24244.53it/s]


In [3]:
# Get abstracts
docs = [atl[identifier].abstract for identifier in atl.ids]

# Embed abstracts
result = crt.vectorizer.embed_documents(docs)
embeddings = result["embeddings"]

# depending on the vectorizer, sometimes not all embeddings can be obtained due to out-of-vocab issues
success_indices = result["success_indices"] # shape `(len(embeddings),)`
fail_indices = result["fail_indices"] # shape `(len(docs) - len(embeddings))`

success_indices, fail_indices

embedding documents: 64it [00:00, 251.67it/s]             


(array([0]), array([], dtype=int64))

In [4]:
from sciterra.mapping.tracing import iterate_expand

# Assuming the initial atlas contains just one publication
(atl.center, ) = atl.publications.keys()
# build out an atlas to contain 1,000 publications, with increasing dissimilarity to the initial publication, saving progress in binary files to the directory named "atlas".
atl = iterate_expand(
    atl=atl,
    crt=crt,
    atlas_dir="atlas",
    target_size=1000,
    center=atl.center,
)


Expansion 1
-------------------------------


embedding documents: 64it [00:00, 232.51it/s]             


Expansion will include 291 new publications.
Querying Semantic Scholar for 291 total papers.


progress using call_size=10:  86%|████████▌ | 250/291 [02:41<00:41,  1.02s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x284c6c7c0> 2 times to get a response.


progress using call_size=10: 100%|██████████| 291/291 [03:04<00:00,  1.57it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 291/291 [00:00<00:00, 2941066.18it/s]
Recursively creating atlas data directory at atlas.
Writing to atlas/publications.pkl.
Writing to atlas/projection.pkl.
Writing to atlas/bad_ids.pkl.
No history to save, skipping.
Writing to atlas/center.pkl.
18 publications were filtered due to missing crucial data or incorrect field of study. There are now 18 total ids that will be excluded in the future.
Found 273 publications not contained in Atlas projection.
Embedding 273 total documents.


Atlas has 292 publications and 1 embeddings.


embedding documents: 320it [01:27,  3.67it/s]                         
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/center.pkl.


Atlas has 274 publications and 274 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 274 embeddings with batch size 274.


100%|██████████| 1/1 [00:00<00:00, 312.70it/s]
calculating converged kernel size: 100%|██████████| 274/274 [00:00<00:00, 34918.55it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Writing to atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.



Expansion 2
-------------------------------
Expansion will include 500 new publications.
Querying Semantic Scholar for 500 total papers.


progress using call_size=10: 100%|██████████| 500/500 [02:50<00:00,  2.94it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 500/500 [00:00<00:00, 3052623.00it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
No history to save, skipping.
Overwriting existing file at atlas/center.pkl.
222 publications were filtered due to missing crucial data or incorrect field of study. There are now 240 total ids that will be excluded in the future.
Found 278 publications not contained in Atlas projection.
Embedding 278 total documents.


Atlas has 774 publications and 274 embeddings.


embedding documents: 320it [01:28,  3.60it/s]                         


Atlas has 552 publications and 552 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 552 embeddings with batch size 552.


100%|██████████| 1/1 [00:00<00:00, 158.11it/s]
calculating converged kernel size: 100%|██████████| 552/552 [00:00<00:00, 19278.38it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.



Expansion 3
-------------------------------
Expansion will include 500 new publications.
Querying Semantic Scholar for 500 total papers.


progress using call_size=10:  88%|████████▊ | 440/500 [04:38<00:53,  1.12it/s]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x284c6e0c0> 2 times to get a response.


progress using call_size=10: 100%|██████████| 500/500 [05:09<00:00,  1.61it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 500/500 [00:00<00:00, 4032984.62it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
No history to save, skipping.
Overwriting existing file at atlas/center.pkl.
139 publications were filtered due to missing crucial data or incorrect field of study. There are now 379 total ids that will be excluded in the future.
Found 361 publications not contained in Atlas projection.
Embedding 361 total documents.


Atlas has 1052 publications and 552 embeddings.


embedding documents: 384it [01:51,  3.44it/s]                         


Atlas has 913 publications and 913 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 913 embeddings with batch size 913.


100%|██████████| 1/1 [00:00<00:00, 83.34it/s]
calculating converged kernel size: 100%|██████████| 913/913 [00:00<00:00, 11717.44it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.



Expansion 4
-------------------------------
Expansion will include 500 new publications.
Querying Semantic Scholar for 500 total papers.


progress using call_size=10:   6%|▌         | 30/500 [00:37<11:23,  1.46s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x2b6e8b740> 2 times to get a response.


progress using call_size=10:  12%|█▏        | 60/500 [01:15<10:38,  1.45s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x2b6e8a020> 2 times to get a response.


progress using call_size=10:  14%|█▍        | 70/500 [01:35<11:50,  1.65s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x2848f84a0> 2 times to get a response.


progress using call_size=10:  18%|█▊        | 90/500 [02:04<11:00,  1.61s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x284c6c7c0> 2 times to get a response.


progress using call_size=10:  40%|████      | 200/500 [03:34<05:25,  1.09s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x284c6d800> 2 times to get a response.


progress using call_size=10:  58%|█████▊    | 290/500 [04:53<04:03,  1.16s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x28528d440> 2 times to get a response.


progress using call_size=10:  82%|████████▏ | 410/500 [06:37<02:12,  1.47s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x28528e520> 3 times to get a response.


progress using call_size=10:  94%|█████████▍| 470/500 [07:32<00:37,  1.25s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x28528e840> 2 times to get a response.


progress using call_size=10:  96%|█████████▌| 480/500 [07:52<00:29,  1.47s/it]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x28528e980> 2 times to get a response.


progress using call_size=10: 100%|██████████| 500/500 [08:15<00:00,  1.01it/s]

Had to call <function SemanticScholarLibrarian.get_publications.<locals>.get_papers at 0x28528ea20> 2 times to get a response.





huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 500/500 [00:00<00:00, 3355443.20it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
No history to save, skipping.
Overwriting existing file at atlas/center.pkl.
167 publications were filtered due to missing crucial data or incorrect field of study. There are now 546 total ids that will be excluded in the future.
Found 333 publications not contained in Atlas projection.
Embedding 333 total documents.


Atlas has 1413 publications and 913 embeddings.


embedding documents: 384it [01:34,  4.06it/s]                         


Atlas has 1246 publications and 1246 embeddings.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 1246 embeddings with batch size 1000.


100%|██████████| 2/2 [00:00<00:00, 55.78it/s]
calculating converged kernel size: 100%|██████████| 1246/1246 [00:00<00:00, 8390.90it/s]
Overwriting existing file at atlas/publications.pkl.
Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.


Exiting loop.
Tracking atlas...
Calculating degree of convergence for all publications.
computing cosine similarity for 1246 embeddings with batch size 1000.


100%|██████████| 2/2 [00:00<00:00, 111.06it/s]
calculating converged kernel size: 100%|██████████| 1246/1246 [00:00<00:00, 8386.33it/s]
Overwriting existing file at atlas/publications.pkl.


Expansion loop exited with atlas size 1246 after 4 iterations meeting criteria: {'target_size': True, 'max_failed_expansions': False, 'convergence_func': False}.


Overwriting existing file at atlas/projection.pkl.
Overwriting existing file at atlas/bad_ids.pkl.
Overwriting existing file at atlas/history.pkl.
Overwriting existing file at atlas/center.pkl.


<sciterra.mapping.atlas.Atlas at 0x2c7901c50>

['04da6471743468b6bb1d26dd9a6eac4c03ca73ee']