scrapol

The goal of scrapol is to facilitate academic study on American city council representatives whose information is publicly available on Ballotpedia. This package enables users to gather politicians’ party and district information based solely on a url. Users could either utilize ScrapingParty() and ScrapingDistrict() to collect party and district information respectively or leverage the scrapol_city() function to gather both information simultaneously.

This package was tailored for American city council members, although the ScrapingParty() function can be used to scrape party information generally.

The package might be updated in the future with two aims.

To accommodate the need of data collection for state- and federal-level politicians.
Based on a test of 1,041 city council members, the present version of the package can achieve an accuracy rate of 91.65% (954/1041*100%) and 95.49% (994/1041*100%) in terms of parsing party and district information. I will likely modify the code to further improve the accuracy rate.

Legal Concerns

Users interested in using this package for commercial purposes should attend to legal responsibility. According to Ballotpedia,

The use of automated systems or software to extract data from this website for commercial purposes (‘screen scraping’) is prohibited unless the third party has directly concluded a written license agreement with Ballotpedia.

In addition, there are some limitations on scraping. Per Ballotpedia’s robots file, crawlers are allowed only between 06:00 and 08:45 UTC (GMT) with a maximum rate of one page every 5 seconds. Users should set a sleep time of ten seconds or more to reduce the pressure on the website.

Installation

You can install the development version of scrapol from GitHub with:

install.packages('devtools', repos='http://cran.us.r-project.org') # if not already installed
devtools::install_github("huizhou68/scrapol")

Dependencies

The use of scrapol depends on the following packages:

xml2: to request a webpage
rvest: to parse a webpage
stringr: to handle text
magrittr: to utilize the pipe operator

Please make sure to install them before using the scrapol package.

Usage

This example shows you how to collect party and district information readily with the scrapol_city() function, which is built upon a for loop that allows you to iterate over a number of urls.

library(scrapol)
urls <- c("https://ballotpedia.org/Maryanne_Jordan",
          "https://ballotpedia.org/Jeffrey_Lalloway",
          "https://ballotpedia.org/Jim_Duggan")
scrapol_city(url=urls, sleep=15)
#> Progress: 3 in total, 2 remaining. 
#> Progress: 3 in total, 1 remaining. 
#> Progress: 3 in total, 0 remaining. 
#> It's done!
#>                                        url     partisanship            district
#> 1  https://ballotpedia.org/Maryanne_Jordan Democratic Party  Boise City Council
#> 2 https://ballotpedia.org/Jeffrey_Lalloway             <NA> Irvine City Council
#> 3       https://ballotpedia.org/Jim_Duggan      Nonpartisan  Plano City Council

Alternatively, you may leverage the ScrapingParty() and ScrapingDistrict() functions to collect information about party and district respectively, but these functions will take only one url at a time. You will need to write loops to deal with multiple urls.

library(xml2)
ScrapingParty(read_html("https://ballotpedia.org/Maryanne_Jordan"))
#> [1] "Democratic Party"
ScrapingDistrict(read_html("https://ballotpedia.org/Jim_Duggan"))
#> [1] "Plano City Council"

Contact

If you encounter any bugs or have any questions about this package, please email me at hzhou13(at)uh.edu. Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
scrapol.Rproj		scrapol.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

R

R

man

man

.Rbuildignore

.Rbuildignore

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

LICENSE

LICENSE

LICENSE.md

LICENSE.md

NAMESPACE

NAMESPACE

README.Rmd

README.Rmd

README.md

README.md

scrapol.Rproj

scrapol.Rproj

Repository files navigation

scrapol

Legal Concerns

Installation

Dependencies

Usage

Contact

About

Licenses found

Releases

Packages

Languages

License

Licenses found

huizhou68/scrapol

Folders and files

Latest commit

History

Repository files navigation

scrapol

Legal Concerns

Installation

Dependencies

Usage

Contact

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages