Skip to content

matteosox/nba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

matteosox/nba

Status
Data Pipeline Setup, Test, & Build Website Status
Languages
Python: 3.10 Bash: 5.1 Typescript:
Services
VCS: Github CI: Github Actions Container Repo: Docker Hub VPN: ProtonVPN Storage: AWS S3 CD & Hosting: Vercel Domain: Hover
Dependencies
Analysis environment: Jupyter Config: Dynaconf Web framework: Next.js UI: React NBA stats: pbpstats Models: pymc3
Testing
Python style: Black Python quality: Pylint Bash quality: ShellCheck

General Info

NBA Stats and Analysis

Description

This repo has three main parts:

  1. pynba: a Python package of stuff — utilities, data loaders/serializers, scripts — to analyze nba data.
  2. notebooks: a collection of Jupyter notebooks analyzing nba data using pynba.
  3. app: a Next.js web app hosted at nba.mattefay.com, displaying the latest stats.

User Notes

Jupyter Notebook Environment

TL;DR: To start up the notebook environment, run notebooks/run.sh, which will open up a browser tab for you.

We use a Dockerized Jupyter notebook environment for data analysis. The notebooks/run.sh bash script starts this container and opens up a web browser to the Jupyter server for you, with the repo mounted to /root/nba. This allows you to edit the pynba package without needing to restart the container, since it is installed in editable mode. The Jupyter notebook directory is the repo's notebooks directory, which contains version controller notebooks.

Future Work

  • Analysis
    • Travel and rest adjustments
    • Re-evaluate priors
    • Confirm reduction in home court advantage
    • Fix 2020 bubble games
    • Playoffs?!
  • App
    • Theme/style
    • Replace images with interactives
    • Improve tables (sortable, hover for definition, colorize for z-scores)

Contributing

Getting started

We use Docker for a clean environment within which to build, test, analyze, and so on. The setup.sh script in the cicd directory will build the relevant images for you. Running things natively isn't a supported/maintained thing.

In addition to Docker, you'll need some developer secrets, in build/notebook.local.env and build/app.local.env. Those files are git ignored for obvious reasons, so you'll need to ask around to get those credentials.

Finally, while you don't need it to do most workflows, it's probably a good idea to setup docker so you have push access to the Docker hub registry. That'll require a personal access token, which you can then use in combination with your Docker ID to login using docker login.

Code Style

We use PEP8 for Python, but don't trip, just run test/black_lint.sh to get all your spaces in a row.

Pull Requests

The main branch has branch protections turned on in Github, requiring one reviewer to approve a PR before merging. We also use the code owners feature to specify who can approve certain PRs. As well, merging a PR requires status checks to complete successfully.

When naming a branch, please use the syntax firstname/branch-name-here. If you plan to collaborate with others on that branch, use team/branch-name-here.

Developer Workflows

Updating python requirements

TL;DR: Run requirements/update_requirements_inside_docker.sh.

There are two requirements files checked into this directory:

  1. requirements.in
  2. requirements.txt

The .in files are where we collect immediate dependencies, described in PyPI format (with versions pinned only as needed). The .txt files are generated by running the requirements/update_requirements_in_docker.sh script. This script runs the requirements/update_requirements.sh inside the notebook Docker container. We do this because pip-compile should be run from the same virtual environment as your project so conditional dependencies that require a specific Python version, or other environment markers, resolve relative to your project’s environment.

This gives us both a flexible way to describe dependencies while still achieving reproducible builds. Inspired by this and this.

Handling Config

While the constants.py module contains values that don't change with each run, the config.py module makes configuration values available in the Python runtime that DO change. This uses dynaconf to inject and load dynamic configuration from 1) settings.toml for defaults for each envionment, and 2) environment variables, prefixed with PYNBA and registered in settings.toml. The meta_config.py module provides a convenient syntax for creating a config dataclass with typed values that loads each parameter dynamically from dynaconf. You can see an example of this in the config.py module. The dynaconf environment is determined by the ENV_FOR_DYNACONF environment variable.

To pass environment variables into the Docker runtime, either for dynaconf or other purposes, you have two options:

  1. export them in your development environment, then register them in the notebook.env. For example, to select a dynaconf environment other than default, you'll need to export it as ENV_FOR_DYNACONF. Note that this variable is already registered in notebook.env to be passed in.
  2. add them to your notebook.local.env file, which is in the .gitignore so it won't be committed. This is where we keep developer credentials for example.

Inspired by the 12-factor application guide.

Developing the NextJS App

TL;DR: Run app/run.sh.

To ease developing the NextJS web app, we use npm run dev in a Docker container with the app mounted. This starts the app in development mode, which takes advantage of NextJS's fast refresh functionality, which catches exceptions and loads code updates near-instantaneously.

Updating node packages

TL;DR: Run app/run.sh npm update.

To install a new npm package using npm install new-package, you can use the same script with an optional command, e.g. app/run.sh YOUR CMD HERE. To update all packages, run app/run.sh npm update.

Continuous Integration

We use Github actions to run our CI pipeline on every pull request. The configuration can be found in .github/workflows/setup_test_push.yaml. That said, every step of CI can also be run locally.

Settin' up

TL;DR: To run tests, run cicd/setup.sh.

This builds the two relevant docker images, notebook, and app.

We do a couple of neat cacheing tricks to speed things up. First off, in the Dockerfiles themselves, we use the RUN --mount=type=cache functionality of Docker BuildKit to cache Python packages stored in ~/.cache/pip. This keeps you local machine from re-downloading new Python packages each time. We don't use this for OS-level packages, i.e. those installed using apt, to reduce the size of the images. I tried and failed to get this to work for npm install and the node_modules directory, with mysteriously useless results. This was inspired by this blog post

Second, we use the new BUILDKIT_INLINE_CACHE feature to cache our images using Docker Hub. This is configured in the docker build command, and is smart enough to only download the layers you need. This was inspired by this blog post. This DOES work in Github Actions, while the prior functionality does not.

Testin'

TL;DR: To run tests, run cicd/test.sh.

Python Autoformatting

We use Black to check our Python code for proper formatting. If it doesn't pass, you can autoformat your code using Black by running test/black_lint.sh. The settings for this can be found in pyproject.toml.

Python Linting

We use Pylint for more general linting of our Python code. Pylint has some crazy settings and defaults, so use the pylintrc config file generously to use pylint to your favor, and not to your detriment.

Python Unit Tests

We use the built-in Python module unittest's test discovery functionality. This requires that all of the test files must be modules or packages importable from the root of the repo. As well, they must match the pattern test*.py. Our practice is to put tests for a module in a test folder in the same directory, which can then also contain data and other files needed to run those tests.

The package is installed using setuptools's find_packages function. We use the exclude feature to exclude all test code, i.e. exclude=["*.tests", "*.tests.*", "tests.*", "tests"].

Thus, to run tests, we mount the root of the repo to the location in the container it's been installed. All of this is handled nicely by running test.sh, which uses the notebook container.

Bash Script Tests

We use ShellCheck to test all the bash scripts in the repo. This helps us avoid some of the many sharp corners of bash, improving quality, readability, and style. This is run in test/shellcheck.sh.

Web App Tests

TBD

Pushin'

TL;DR: To push docker images to Docker Hub, run cicd/push.sh.

Note that you'll need to be logged in to Docker be able to do this locally. We use the docker/login-action@v1 build action to login for the automated CI, and it uses a personal access token named github-actions from my Docker Hub account to do that, with the username and token stored as secrets.

In Docker Hub, we have one repository per image: - notebook -> matteosox/nba-notebook - app -> matteosox/nba-app

In addition to pushing each image with a tag as the git SHA of the code producing it, we also push an untagged (i.e. tagged as latest) image when pushing from the main branch.

Buildin'

We don't build the Next.js app in the setup_test_push.yaml Github Actions workflow because Vercel is configured to do this for us. That said, as with all other CI workflows, we support running these locally. To build the app, use app/run.sh npm run build.

Data Pipeline

We run a nightly data pipeline job using Github Actions to update the data hosted on the site. This is configured in .github/workflows/data_pipeline.yaml. Again, each step of the process can be run locally.

VPN

Unfortunately, the NBA blocks API traffic from many cloud hosting providers, e.g. AWS. In our case, Github Actions runs on Azure. To get around this issue, I've setup a free ProtonVPN account. The various secrets needed to get that to work (a base certificate, TLS authentication key, and openvpn username & password) are stored as secrets in Github and injected as environment variables in the relevant step. An openvpn config file — config.ovpn — can be found in the vpn directory, along with a connect.sh bash script used by the workflow to set things up.

Update Data

The first step of the data pipeline runs the pynba_update Python console script inside of the notebook Docker image using the cicd/etl.sh bash script. This queries for data for each league in the current year, saving any updates to the local data directory (mounted to the container). For seasons with new games found, we also calculate updated team ratings and generate updated plots.

Sync Data to S3

The second (and final) step of the data pipeline runs the pynba_sync Python console script using the same environment/mechanism as before. This local data — pbpstats files, season parquet files, incremental possessions parquet files, team ratings & plots — is then synced to s3, where it can be accessed by the site.

Github Actions Artifacts

We store an artifact of the local data directory at the completion of each run of the data pipeline, both for historical data, and ease of debugging.

Continuous Deployment

The Python package pynba is strictly for code refactoring in this repo's Jupyter notebook environment, so it isn't packaged up and released, e.g. to PyPI.org.

Vercel

The NextJS app is deployed to nba.mattefay.com by Vercel, the company behind NextJS. The deployment process is integrated with Github, so that any commit to the main branch results in a new deploy. Conveniently, Vercel also builds and deploys a "staging" site for every commit that changes the app directory, making them available through comments in your pull request for example.

Data and plots are stored in the nba-mattefay bucket on AWS S3. To access these files, we inject AWS credentials with environment variables. Unfortunately, Vercel reserves the usual environment variables for this, i.e. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. To get around this, we store them as the non-standard AccessKeyId and SecretAccessKey environment variables, and manually load credentials in the aws_s3.ts javascript library, similar to this approach on Vercel's website. These credentials are from the matteosox-nba-vercel AWS IAM user, with read-only access to this one bucket.

DNS

I own the domain mattefay.com through hover.com. I host my blog there, using format.com. This repo's site is hosted at the nba.mattefay.com subdomain. Since Vercel is hosting this site, I have a CNAME DNS record in Hover to alias that subdomain to them, i.e. CNAME nba cname.vercel-dns.com.