-
Notifications
You must be signed in to change notification settings - Fork 5
Text Snowball
petermr edited this page Aug 18, 2021
·
7 revisions
- To create a list of terms (words and phrases) which are effective at searching the literature for a desired topic (cyclic voltammetry/batteries).
- to snowball the precision (and repeat query)
- to download and section the papers
- to extract images
- to triage images (NYI)
- to demo extraction of curves -> tables
Why not arxiv?
(We
- set relevancy metric (e.g. Caption contains "cyclic voltammogram" or other precise terms) image pixel characteristics.
- small precise rapid
- query pygetpapers with a relevant query (AND OR NOT) (maybe fairly general) and TERMS (OR'ed cyclic voltammetry) NOT generator (AG to add dictionary-> TERMS)
- download small corpus (<= 100 papers)
- rapidly inspect these for frequent relevant terms
- RAKE / YAKE (adjust phrase length) (SH report)
- human eyeballs (dictionary editor - GUI started - PMR)
- SPacy
- triage the list (human/GUI) - proto-dictionary
- create/append to list of terms
- repeat until we get enough, or give up
now we have a list of terms
-
term
-
create dictionary from terms (ami3 -> ?pyami (NYI-PMR))
-
section the paper (e.g. methods and figures) (PMR - ALL test)
-
retrieve images (ami3 and possibly pyami/pdfminer) and image captions (section) and display (PMR)
-
GUI / browser display
-
resurrect old examples of extracting curves from diagrams
(emphasize generality - plants + terpenes + genes - everything is reusable) - climate ,
Set up on Google Collab
Choose voltammetry
Use EPMC
pygetpapers -q "cyclic voltammetry" -n
INFO: Final query is cyclic voltammetry
INFO: Total number of hits for the query are 30078
refine
pygetpapers -q "(cyclic voltammetry) AND (lithium ion)" -n
INFO: Final query is (cyclic voltammetry) AND (lithium ion)
INFO: Total number of hits for the query are 2675
Now snowball
- download 100 papers and examine for additional precision terms (OR)
(Open Notebook Science)
driven by real queries
- Testing with CV queries (ALL) - results on wiki
- download PDFs
- supplemental data (determine "caption")
- commandline-driven
- section (works but needs uploading)
- download sections for term extraction (SH and PMR)
- search by caption OK, but link to image (NYI)
- extract images from PDF (ami)
- ami output -> TERMS