Skip to content

Text Snowball

petermr edited this page Aug 20, 2021 · 7 revisions

Text snowball

Goal

  • To create a list of terms (words and phrases) which are effective at searching the literature for a desired topic (cyclic voltammetry/batteries).
  • to snowball the precision (and repeat query)
  • to download and section the papers
  • to extract images
  • to triage images (NYI)
  • to demo extraction of curves -> tables

Why not arxiv?

Method

(We

  • set relevancy metric (e.g. Caption contains "cyclic voltammogram" or other precise terms) image pixel characteristics.
  • small precise rapid
  • query pygetpapers with a relevant query (AND OR NOT) (maybe fairly general) and TERMS (OR'ed cyclic voltammetry) NOT generator (AG to add dictionary-> TERMS)
  • download small corpus (<= 100 papers)
  • rapidly inspect these for frequent relevant terms
  • RAKE / YAKE (adjust phrase length) (SH report)
  • human eyeballs (dictionary editor - GUI started - PMR)
  • SPacy
  • triage the list (human/GUI) - proto-dictionary
  • create/append to list of terms
  • repeat until we get enough, or give up

now we have a list of terms

  • term

  • create dictionary from terms (ami3 -> ?pyami (NYI-PMR))

  • section the paper (e.g. methods and figures) (PMR - ALL test)

  • retrieve images (ami3 and possibly pyami/pdfminer) and image captions (section) and display (PMR)

  • GUI / browser display

  • resurrect old examples of extracting curves from diagrams

(emphasize generality - plants + terpenes + genes - everything is reusable) - climate ,

demo

github.com/petermr/battery_demo

(package all core dictionaries and projects) - either pyami or battery_demo

Set up on Google Collab

Choose voltammetry

Use EPMC

pygetpapers -q "cyclic voltammetry" -n
INFO: Final query is cyclic voltammetry
INFO: Total number of hits for the query are 30078

refine

pygetpapers -q "(cyclic voltammetry) AND (lithium ion)" -n
INFO: Final query is (cyclic voltammetry) AND (lithium ion)
INFO: Total number of hits for the query are 2675

Now snowball

  • download 100 papers and examine for additional precision terms (OR)

battery cathode anode

pygetpapers -q "(cyclic voltammetry) AND (lithium ion)" --terms extra_terms.txt voltammetry_dict.xml -n
```

=>> docanalysis

py4ami -p ${battery_demo} --dict ${cyclic_v}

add sections and extract images and classify captions


# tools

(Open Notebook Science)

driven by real queries

## pygetpapers 

* Testing with CV queries (ALL) - results on wiki
* download PDFs
* supplemental data (determine "caption")

## pyami

* commandline-driven
* section (works but needs uploading)
* download sections for term extraction (SH and PMR) 
* search by caption OK, but link to image (NYI)
* extract images from PDF (ami)


## docanalysis

* ami output -> TERMS





Clone this wiki locally