Authors: Fernando Meireles, Denisson
Silva, and Rogerio Barbosa
rscielo
offers functions to easily scrape bibliometric information
from scientific journals and articles hosted on the Scientific
Electronic Library Online Platform
(Scielo.org). The retrieved data includes a
journal’s details and citation counts; article’s contents, footnotes,
bibliographic references; and several other common information used in
bibliometric studies. The package also provides functions to quickly
summarize the scraped data.
To install the latest stable release of rscielo
from
CRAN, use:
install.packages("rscielo")
Alternatively, one may install the latest pre-release version from GitHub via:
if(!require("remotes")) install.packages("remotes")
remotes::install_github("meirelesff/rscielo")
At its core, rscielo
is a scraper that offers a transparent and
reproducible approach to gather data from the Scientific Electronic
Library Online Platform (Scielo.br), one of the
largest open repositories for scientific publications in the world. In
particular, the package provides functions to automatically extract and
parse different types of information from (1) scientific journals
(pointed by _journal
or _journal_
in their names) and (2) articles
(with functions that contains _article
or _article_
in their names).
To get data from a particular journal, such as citation counts and
ISSN,
the rscielo
relies on an ID (or pid) that uniquely identifies each
journal within the Scielo repository. As an
example, this is the URL of the Brazilian Political Science
Review homepage on
Scielo:
http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso
The journal ID can be found between &pid=
and &lng
(i.e.,
1981-3821
). Most of rscielo
’s functions that retrieve data from
journals rely on this information to work. To automatically extract an
ID from the URL of a journal, one may use the get_journal_id()
function:
get_journal_id("http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso")
#> [1] "1981-3821"
With a journal ID in hand, use the get_journal()
function to scrape
meta-data from all articles published in its last issue:
df <- get_journal("1981-3821")
This code returns a tibble
in which the observations correspond to the
articles that appeared in the selected journal’s lastest issue. Among
the returned variables are authors’ names, institutional affiliations,
and home countries; articles’ abstracts, keywords, and the number of
pages (check the get_journal
documentation executing
help(get_journal)
for a full description of the retrieved data).
For a quick glimpse at the scraped data, one may use the summary
method:
summary(df)
#>
#> ### JOURNAL: Brazilian Political Science Review
#>
#>
#> Total number of articles: 1
#> Total number of articles (reviews excluded): 1
#>
#> Mean number of authors per article: 5
#> Mean number of pages per article: Not available
get_journal()
also extracts data from all articles ever published by a
journal. To do that, set the argument last_issue
to FALSE
:
get_journal("1981-3821", last_issue = "FALSE")
rscielo
contains functions to scrape and report publication and
citation counts of a journal:
# Gets citation metrics
cit <- get_journal_metrics("1981-3821")
# Plots the data for a quick visualization
plot(cit)
get_journal_info()
and get_journal_list()
scrapes a journal’s
meta-information (publisher, ISSN, and mission) and a list of all
journals hosted on Scielo, respectively:
# Get a journal's meta-information
meta_info <- get_journal_info("1981-3821")
# Get a list with all journals names, URLs and IDs
journals <- get_journal_list()
Scientific articles stored on Scielo are also
identified by a unique ID, which is formed by a combination between
their Digital Object Identifiers
(DOI) plus
other characters. These IDs can se seen in each article’s URL (after
&pid=
until &lng=
):
# URL of an article
url_article <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"
By design, rscielo
handles full articles’ URLs as inputs, but users
may obtain the IDs by using the get_article_id
function:
get_article_id(url_article)
#> [1] "S1981-38212016000200201"
To scrape the content of a single scientific article, the rscielo
provides the get_article()
function:
# Scrape the meta-data
article <- get_article(url_article)
As can be seen, the function returns the full text of the requested
article as a character
vector. Users may also pass the article’s ID to
the function to achieve the same results:
article <- get_article("S1981-38212016000200201")
Or set the argument output_text
to FALSE
to get a tibble
with the
article’s DOI (which might be useful in bibliometric analysis):
article <- get_article("S1981-38212016000200201", output_text = FALSE)
Similar to the get_journal()
function, get_article_meta
returns
meta-data of a selected article hosted on
Scielo:
url <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"
article_meta <- get_article_meta(url)
To retrieve a list of bibliographic items cited by an article, use
get_article_referencs()
:
article_references <- get_article_references(url)
The function outputs a tibble
in which every bibliographic item
corresponds to an observation. get_article_footnotes()
returns a
similar object, but with footnotes in the rows:
article_foots <- get_article_footnotes(url)
For convenience, here is a description of the rscielo
functions.
Function to extract data from journals:
get_journal_id()
: Get a journal’s ID from its URL.get_journal()
: Get meta-data of all articles published by a journal.get_journal_info()
: Get a journal’s description.get_journal_list()
: Get a list with all journals’ names, URLs and ID’s.get_journal_metrics()
: Get publication and citation counts of a journal.
Function to extract data from articles:
get_article_id()
: Get an article’s ID from its URL.get_article()
: Get the full text of a single article.get_article_meta()
: Get meta-data of a single article.get_article_referencs()
: Get the list of bibliographic references cited by a single article.get_article_footnotes()
: Get the list of the footnotes of a single article.
Methods:
summary.Scielo()
: Summarize the data of atibble
returned byget_journal
.plot.scielo_metrics()
: Plot citation counts of a journal retrieved byget_journal_metrics
.
The rscielo
‘s functions extract data directly from the
Scielo online repository. In any event,
sometimes users might find errors or obtain incomplete information when
using its functions, mainly when using the _article
ones to scrape
articles’ full contents. This happens when journals feeds invalid or
wrongly formatted information into the Scielo platform. In most
situations, a bit of data cleaning solves the issues, but users must be
aware that the retrieved data still might be lacking.
To cite rscielo
in publications, use:
citation("rscielo")
#>
#> To cite package 'rscielo' in publications use:
#>
#> Fernando Meireles, Denisson Silva and Rogerio Barbosa (2019).
#> rscielo: A Scraper for Scientific Journals Hosted on Scielo. R
#> package version 1.0.0.
#> https://CRAN.R-project.org/package=rscielo
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {rscielo: A Scraper for Scientific Journals Hosted on Scielo},
#> author = {Fernando Meireles and Denisson Silva and Rogerio Barbosa},
#> year = {2019},
#> note = {R package version 1.0.0},
#> url = {https://CRAN.R-project.org/package=rscielo},
#> }
We welcome comments or suggestions to improve the package. Feel free to start a issue at our GitHub repository.