An R package to scrape meta-data from scientific articles hosted on Scielo
Switch branches/tags
Nothing to show
Clone or download
Latest commit 0f34250 May 15, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Update get_xml_article Oct 4, 2016
man Update get_xml_article Oct 4, 2016
.Rbuildignore New commit Aug 18, 2016
.gitattributes 🎉 Added .gitattributes & .gitignore files Aug 18, 2016
.gitignore New commit Aug 18, 2016
.travis.yml New commit Aug 18, 2016
DESCRIPTION Update get_xml_article Oct 4, 2016
NAMESPACE New commit Aug 18, 2016
README.Rmd Update README Aug 18, 2016
README.md Update README.md May 14, 2017
appveyor.yml Add travis and appveyor Aug 18, 2016
cran-comments.md New commit Aug 18, 2016
rScielo.Rproj First commit Aug 18, 2016

README.md

rScielo

Travis-CI Build Status AppVeyor Build Status Package-License CRAN_Status_Badge

rScielo provides a set of functions to scrape meta-data from scientific articles hosted on the Scientific Electronic Library Online Platform (Scielo.br). The meta-data information includes authors' names, articles' titles, year of the publication, among others. The package also provides additional functions to summarize the scrapped data.

How does it work?

Getting a journal's ID

The rScielo package scrapes data based on a journal ID (or pid). For example, consider the link to the Brazilian Political Science Review homepage on Scielo:

http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso

The ID is located between &pid= and &lng (i.e., 1981-3821). Most of rScielo functions depend on this argument. To automatically extract an ID from a journal hosted on Scielo, you may also use the get_id_journal() function:

get_id_journal("http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso")
#> [1] "1981-3821"

Scraping data

To scrape meta-data from all articles of a journal hosted on Scielo, use the get_journal() function:

df <- get_journal("1981-3821")

Then summarize the scrapped data with summary:

summary(df)
#> 
#> ### JOURNAL SUMMARY: Brazilian Political Science Review (2012 - 2016)
#> 
#> 
#>  Total number of articles:  98 
#>  Total number of articles (reviews excluded):  67
#> 
#>  Mean number of authors per article:  1.61 
#>  Mean number of pages per article:  29.38

The rScielo package also provides a function to scrape meta-data from a single article:

# The article's URL on Scielo
url <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"

# Scrape the data
article <- get_article(url)

Finally, get_journal_info() and get_journal_list() scrapes a journal's meta-information (publisher, ISSN, and mission) and a list of all journals hosted on Scielo, respectively:

# Get a journal's meta-information
meta_info <- get_journal_info("1981-3821")

# Get a list with all journals names, URLs and IDs
journals <- get_journal_list()

Scraping metrics

With the rScielo, it is possible to scrape several publication and citation metrics of a journal hosted on Scielo:

# Gets citation metrics
cit <- get_journal_metrics("1981-3821")

# Plots the data for a quick visualization
plot(cit)

Functions

Here is a description of the rScielo functions:

  • get_id_journal(): Gets a journal's ID from its url.
  • get_journal(): Gets meta-data from all articles published by a journal.
  • get_article(): Gets meta-data from a single article.
  • get_journal_info(): Gets a journal's description.
  • get_journal_list(): Gets a list with all journals' names, URLs and ID's.
  • get_journal_metrics(): Gets publication and citation metrics of a journal.

Installation

Install the latest stable release from CRAN via:

install.packages("rScielo")

Alternatively, install the latest pre-release version from GitHub via:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("meirelesff/rScielo")

Author

Fernando Meireles

License

GPL (>= 2)