A Video Vision Transformer (ViViT) model for detecting incidences of contact between between NFL players in play-by-play video footage
- Containerized project using Docker to recreate project env (including OpenCV, PyTorch, hugginface, etc...)
- Uploaded data and Docker image to AWS S3 and ECR, respectively
- Finetuned pretrained ViViT model (from hugginface) on AWS EC2 GPU-enabled instance
- Model based on ViViT: A Video Vision Transformer
Assuming you have docker installed:
-
Pull data from: https://www.kaggle.com/competitions/nfl-player-contact-detection/data
-
Store data in project directory
├── ActionRecognition <- Project directory │ ├── data <- Data directory │ ├── ... <- Data files
-
Build docker image
$cd path/to/ActionRecognition
$docker build -t action-rec:latest .
- Run docker container
$docker run -v $(pwd)/data:/usr/src/data -v $(pwd)/models:/usr/src/models -it action-rec:latest sh
- Within container: Run inference
$python inference.py
├── README.md <- Top-level README
|
├── models <- Trained and serialized models
│ ├── ...
│
├── requirements.txt <- Requirements file for reproducing the analysis environment
|
├── .dockerignore <- Files to omit when building image
│
├── Dockerfile <- Instructions for building docker image
│
├── config.yaml <- Model configurations/hyperparameters
|
├── dataset.py <- Functions and class for building dataset for model training
|
├── train.py <- Model training
|
├── inference.py <- Model inference/testing