Code for webscraping rugby player statistics.
-
Dataset is
1.main_data.csv. It contains the names, playing statistics and birthplaces for every person to have played test match rugby for the major nations. -
1.webscrapedirectory contains the Python code used to scrape the player names, playing statistics and some birthplaces from the web. It also contains the Python data objects sorted according to country as well as the compiled scraped data to be cleaned.1.Scrape.pyscript scrapes the raw player data from the ESPN rugby website.2.Find_Country.pyscript uses the GeoPy package to identify the country in which players were born (the initial webscrape only yields the city or region).3.Gather.pyscript compiles the resulting data from the above two programs to be cleaned by the R scripts.
-
2.cleandirectory contains the R code used to clean and process the raw data from the webscrape, as well as incorporate manual adjustments to the data from my own research and the New Zealand Herald data.4.clean_up.Rscript cleans and performs the adjustments to the raw data in order to produce the final dataset.
-
3.updatedirectory contains code used to update the data by only scraping the most recent player data.1.newscrape.pyscript scrapes raw player data for most recent players.2.merge.Rmerges the newly scraped data with the original data.
-
Notesdirectory contains the code used to create the blog post at http://hautahi.com/rugbywanderersblog_code.Ranalyzes the scraped dataset.BlogArticle.Rmdis the markdown script used to write the blog post.
To my knowledge, the player names and playing statistics are accurate, but the birthplace information remains a work in progress. Birthplace data is especially sparse for Canada, the USA and the Pacific Island countries. I welcome any contributions and corrections that can be pointed out. Feel free to create a pull request or to email me (hautahikingi@gmail.com) with any help you can provide.