Explore and learn from the activity patterns of some of the most liked open source projects on GitHub.
This project collects and stores data of 1% most starred github repositories, including their general details, languages, stargazers, contributors statistics, number of commits per hour each day, issues, issues events and issue comments.
The data points are first scraped into separate csv files, then concatenated and imported into MySQL for easier aggregation and analysis. A [Shiny Dashboard] was hence built to use this database to explore activity timelines of single reposities and patterns of different repository groups.
The code consists of four major parts:
- database: shemas of the database and toolkits to import local csv files.
- scraper: scrape the data from GitHub API. Supports parallel scraping with the future package.
- shiny: the shiny app.
- report: data aggregations and preliminary data analysis reports.
Generate the seed of top repositories
data/ directory, contains a list of top repositories (
data/available_repos.csv) we generated using the GHTorrent snapshot data on April 1, 2017.
To repeat this seeding process with newer data:
- Download from GHTorrent.org the latest MySQL database dumps.
- Restore the
watcherstables to a local database.
- Run "database/seed.sql" on the database you restored to generate the list of popular projects.
- Export the generated
popular_projectstable to a csv file and save it as
Of course you can use repository lists from other sources, just make sure you put them in a
csv file and a
repo column exists in it.
Set environment variables
This application uses environemnt variables to connect to MySQL and setting tokens for GitHub. Make sure you have these variables set in your
.Renviron, which can be put into either your home directory or the working directory of R.
GITHUB_TOKENS=... GITHUB_DATA_DIR="./scrape-log" R_ENV=production MYSQL_DBNAME=ghtorrent_restore MYSQL_HOST=127.0.0.1 MYSQL_PORT=3306 MYSQL_USER=ghtorrentuser MYSQL_PASSWD=ghtorrentpassword
Please make sure these packages were successfully installed before you run the scraper.
install.packages(c("tidyverse", "dplyr", "lubridate", "future", "shiny", "shinyjs", "shinydashboard")) install.packages("devtools") devtools::install_github("r-pkgs/gh") devtools::install_github("rstats-db/DBI") devtools::install_github("ktmud/RMySQL@upsert") devtools::install_github("hadley/ggplot2") devtools::install_github("ropensci/plotly") devtools::install_github("kirillseva/cacher")
Depending on your machine, compling some of the packages may need additional libraries. If you are using a fresh Ubuntu 17.04 server, you might want to do:
sudo apt update sudo apt upgrade sudo apt install mysql-server r-base sudo apt install libmysqlclient-dev libmariadb-client-lgpl-dev sudo apt install libxml2-dev libssl-dev libcurl4-openssl-dev
Feel free to install RStudio Server and Shiny Server, too:
sudo apt install gdebi-core wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-184.108.40.2068-amd64.deb sudo su - \ -c "R -e \"install.packages('shiny', repos='https://cran.rstudio.com/')\"" sudo gdebi rstudio-server-1.0.143-amd64.deb sudo gdebi shiny-server-220.127.116.118-amd64.deb
- Start a MySQL service, create a database and the tables using
database/schema.sql. DON'T add any indexes yet.
- Make sure all packages required are successfully installed.
- If you want parallel scraping, run
scrape_cluser.R, otherwise dive into
scrape.Rand run appropriate functions as you needed.
- After the scraping is done, then you can add MySQL indexes using
database/indexes.sql. There are other scripts in the
databasefolders as well, but they were for when you want to scrape all data into local csv files before importing to the database (which would not be necessary if you can have the mysql service up and running from the start).
- Please make sure
max_allowed_packetfor both MySQL client and server is set to an appropriate high value, otherwise writing
issuesbatches might fail. We are already splitting the scraped results into chunks.
- GitHub at most returns 60,000 items per query, so you might miss some data for very large projects. Especially for the issue events and issue comments data.
scrape-logstores when was the last item your scraped updated, and GitHub allows us to appoint a
sinceparameter to filter more recent data points. So you can rerun the scraper at any time to collect newly generated data after your last scraping attempt. Already scraped data will NOT be scraped again (unless you specifically set
- Make this app a docker dontainer
- Split the scraper and shiny app
- More detailed analysis
- User relationship data