pubmedcentral_scraper.R

pubmedcentral_scraper.R is an R script that retrieves a list of Pubmed Central article ID's returned by a query to the Pubmed Central database at the National Center for Biotechnology Information (NCBI) via the Entrez Programming Utilities interface http://www.ncbi.nlm.nih.gov/books/NBK25499/. For each article ID, the scraper scans the captions of every figure in the online versions of the full-text articles for matches to the user-specified query topics. Metadata for matching figures are stored in a SQLite database.

Also check out the blog post about the scraper at EFavDB.com.

Prerequisite to running pubmedcentral_scraper.R

The scraper works with a SQLite database adhering to the schema specified in createSQLiteDatabase.R

Order of steps in pubmedcentral_scraper.R

User defines search terms (one set defining the topic, another set defining the plot type to capture) in pubmedcentral_scraper.R
Retrieve Pubmed Central Id's (pmcid) from a query to Pubmed Central via eSearch.R
Send each pmcid to scrapeArticle.R, which returns metadata for images matching the search terms.
If an article contains at least one matching image, also capture the article metadata (e.g. title, journal, year, etc) via a call to eFetch.R
Save results of scraping to SQLite database.

To visualize results of scraping

markdown_and_plot.R queries the SQLite database where the scraping results are stored for the user-specified topic and plot type, then generates Rmarkdown code that is knitted to html displaying the matching plots and their associated captions and article information. For an example html report, go to the subdirectory example and download download "scraper_TGI_plots_for_trastuzumab.html" or simply click on scraper_TGI_plots_for_trastuzumab.md in github, which will render the same html file. Note that the plots are generated via href's to their Pubmed Central full-text articles and are not stored locally on the user's harddisk.

Generic functions for accessing Entrez utilities

eFetch.R and eSearch.R are not constrained to searches on Pubmed Central (e.g., Pubmed is equally valid). However, pubmedcentral_scraper.R will only work with id's from a query to the Pubmed Central database, since those articles are guaranteed to have full-text online versions.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
R		R
documentation		documentation
example		example
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

documentation

documentation

example

example

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

pubmedcentral_scraper.R

Prerequisite to running pubmedcentral_scraper.R

Order of steps in pubmedcentral_scraper.R

To visualize results of scraping

Generic functions for accessing Entrez utilities

About

Releases

Packages

Languages

License

mikepsinn/PubmedCentral_Scraper

Folders and files

Latest commit

History

Repository files navigation

pubmedcentral_scraper.R

Prerequisite to running pubmedcentral_scraper.R

Order of steps in pubmedcentral_scraper.R

To visualize results of scraping

Generic functions for accessing Entrez utilities

About

Resources

License

Stars

Watchers

Forks

Languages