Visualising NHL Time on Ice Data
This is a project to visualize shared time on ice data. A visualization of this data is available here.
When I first started watching hockey, I was consistently astounded by line changes. Even today, as an avid hockey fan, I am always eager to learn more about which defensive pairings work best, how different lines' chemistry works, and most of all, who is actually playing together on a team.
This project seeks to answer the question of who shares time on ice with who. It also served as a tool for me to learn about graph databases and server and network setup.
Scraping NHL API Data
The NHL API exposes many pages with information on teams, games and players. For this project, I use this teams page to get all teams' IDs. Given those IDs, I scrape each team's schedule. I pull the game IDs from the schedule, and using them can get shift data. For more information about the positions of individual players, I scrape the player's page.
This is where it gets interesting! Shift data only represents when a player enters and leaves the ice for a given shift. But, this project asks the question of who shares the ices with who. So,
scrape_from_nhl_api.rb implements a small algorithm to calculate who is actually on the ice together, and give these pairings shared time. It does this for all pairings of the same position.
Then, the final step of scraping the data is writing it to a csv (
import/all-pos-on-ice-data.csv), which is used in the next step to import the data into Neo4j.
Importing the data into a Neo4j DB
Fortunately, Neo4j has import from csv functionality.
scripts/load_players.cql has the details for this import. It is run with cypher-shell.
The script loads the players as nodes in a weighted graph. The nodes have descriptions of the players including position and current team. Edges between each pair of players are weighted by their shared time on ice. So if two players spent 200 minutes on ice together over the course of a season, the edge connecting them would have a weight of 200. Notably, in an effort not to overcrowd the data, only players who shared the ice for at least two hours are loaded into the csv.
Visualizing with NeoVis
Exposing the DB from a DigitalOcean Droplet
Lastly, to set up a server running Neo4j, I installed and configured Neo4j on a DigitalOcean droplet.
There are many potential avenues to explore this more deeply. In no particular order, some on my mind are:
- Looking into the NHL API, and scraping different data
- Automating the data scraping to happen nightly (maybe with the next season?)
- Using vis.js to have more control over tuning the data visualization
- Putting a load balancer and lambda on top of the server to direct traffic