A real time employee productivity analytics system that tracks behavior: time on the floor, phone usage, and idle time. The camera feed is processed entirely on device and analytics are sent to the cloud for storage and reporting. Raw video never leaves the machine.
demo.mp4
I wanted a tool that could actually measure employee productivity in a real workspace, in real time. Not just the number of people on the floor, but how they behave: how long is someone working? How often are they picking up their phone? Are they idle?
Every existing solution I looked at was either way too invasive (facial recognition, full video streamed to a cloud server) or too shallow to be useful. Streaming raw video burns compute you don't need to spend. I wanted something smarter: rich behavioral data with an approach I could actually defend.
The core insight is that you don't need to know who someone is to track how they behave. Cloud Collar gives every person an anonymous ID based on their appearance and tracks their session from there. The cloud never sees anything except numbers, which keeps the analysis at the team level. One or two people spending too much time on their phone? The data shows it.
[Camera / Raspberry Pi]
|
v
YOLOv8 detects people each frame
ByteTrack assigns short-lived track IDs
ResNet18 re-ID matches returning people to stable person IDs
Tracker accumulates per-person session data
Uploader POSTs anonymous JSON snapshots every 30s
|
| HTTPS (JSON only, never video)
v
AWS API Gateway → Lambda → DynamoDB
Standard trackers like ByteTrack assign a track ID to each visible person. When someone steps out of frame, the track dies, and when they walk back in, they're a brand new ID. That's fine for object counting, but useless for session-level analytics.
Cloud Collar layers appearance-based re-identification on top to maintain continuity:
- Each person crop is run through a ResNet18 backbone (pretrained on ImageNet, classifier removed) to produce a 512-d L2-normalized embedding: a fingerprint of how that person looks.
- New tracks buffer a few frames before a decision is made, so one blurry frame can't cause a misidentification.
- The averaged embedding is compared against all known persons using cosine similarity. Above the match threshold, the track is merged into the existing record. Below it, a new person ID is created.
- People currently in frame are excluded from matching: one person can't be in two places at once.
- Embeddings are refreshed as a running average every 15 frames, adapting to lighting and pose changes over time.
| Metric | Description |
|---|---|
| Time on floor | Cumulative seconds the person has been visible |
| Phone usage | Number of sightings + total seconds holding a phone |
| Idle time | Seconds stationary beyond the movement threshold |
| Away time | Seconds a known person was absent from the frame |
| Metric | Description |
|---|---|
| Drinks made | Count of items prepared per session |
| Orders fulfilled | Orders completed start to finish |
| Product wasted | Instances of discarded or unused product |
These metrics push Cloud Collar beyond employee behavior into the broader cost of running a business: inventory turnover, fulfillment efficiency, and waste. The same system tracking how people work can also track what's being produced, what's being lost, and where money is going.
🎯 YOLOv8s: Fast enough for real time inference on edge hardware without a dedicated GPU. The small variant hits the right balance of speed and accuracy for person detection.
🔁 ByteTrack: Built into Ultralytics, handles short-lived track assignment with no extra configuration. Paired with the re-ID layer, it becomes a robust full-session tracker.
🧠 ResNet18: Lightweight enough to run on a Raspberry Pi. With the classification head removed, the final feature layer produces appearance embeddings that generalize well to unseen people without any fine-tuning.
⚡ AWS Lambda: Serverless, so the backend scales to zero when the system isn't running. No instances to manage, and the cost for a single-camera deployment is negligible.
🗄️ DynamoDB: Pay-per-request and schemaless. Session records have a variable number of persons, so a fixed relational schema would just add friction.
🏗️ Terraform: One terraform apply stands up the entire backend in about 30 seconds. terraform destroy tears it down cleanly. The whole infrastructure is version-controlled and reproducible.
- Python 3.12+
- A camera or video file
- AWS account + AWS CLI configured (
aws configure) - Terraform installed
git clone https://github.com/nickleigh05/cloud-collar.git
cd cloud-collar
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtcd infra
terraform init
terraform apply -var="api_key=pick-any-secret-string"This creates the DynamoDB table, Lambda function, and API Gateway in about 30 seconds. It prints an api_invoke_url when done.
To tear everything down: terraform destroy -var="api_key=same-secret"
cp .env.example .env
# edit .env — paste in the api_invoke_url and the same api_keycd edge
python main.pyPress q to quit. Set VIDEO_PATH = 0 in main.py for a live camera feed, or point it at a video file.
cloud-collar/
├── edge/
│ ├── main.py # detection + tracking loop
│ ├── tracker.py # Person and Tracker classes
│ ├── reid.py # ResNet18 embedding extractor
│ └── uploader.py # batches and POSTs session snapshots
├── cloud/
│ └── lambda/
│ └── handler.py # Lambda: auth, validation, DynamoDB upsert
├── infra/
│ ├── main.tf # DynamoDB, Lambda, API Gateway, IAM
│ ├── variables.tf
│ └── outputs.tf
├── tests/
│ └── test_tracker.py # tracker + re-ID logic (no camera or GPU needed)
├── requirements.txt
├── requirements-dev.txt
└── .env.example
This was a first-class design constraint, not an afterthought.
- Raw video never leaves the device. All inference (detection, tracking, re-ID) runs locally.
- No facial recognition. Embeddings describe whole-body appearance and are never uploaded.
- The cloud receives only anonymous numeric IDs, timestamps, and durations.
- Embeddings live only in memory for the duration of a run and are discarded on exit.
- Any real deployment requires informed consent and visible signage.
pip install -r requirements-dev.txt
pytest tests/ -vTests use fake embeddings. No camera, no GPU, no coworkers needed. CI runs on every push.