Skip to content

tools: migrate to Toolforge buildservice#564

Open
lgelauff wants to merge 27 commits into
masterfrom
tools/buildservice
Open

tools: migrate to Toolforge buildservice#564
lgelauff wants to merge 27 commits into
masterfrom
tools/buildservice

Conversation

@lgelauff
Copy link
Copy Markdown
Collaborator

@lgelauff lgelauff commented May 4, 2026

Summary

Migrates Montage from toolforge webservice python3.13 to the Toolforge buildservice (Cloud Native Buildpacks). Adds Procfile, .python-version, and a root-level package.json shim so the Python and Node buildpacks both activate — baking the Vue frontend into the container image and eliminating the separate node20 frontend-build job.

Background and design rationale: #563.

Also incorporates the Python 3.13 dependency fixes from #512 (cffi 1.17.1, setuptools<82 pin).

What changed

  • Procfile: defines the gunicorn web process; binds to $PORT (Toolforge injects PORT=8000)
  • .python-version: pins Python 3.13 for the buildpack; removed from .gitignore
  • package.json (repo root): shim to trigger Node buildpack; build script delegates to frontend/
  • montage/utils.py: get_env_name() checks MONTAGE_ENV env var first (LDAP unavailable in containers); load_env_config() reads secrets from individual MONTAGE_* env vars when MONTAGE_OAUTH_CONSUMER_TOKEN is set, falls back to YAML for local dev
  • app.py: removes urllib3.contrib.pyopenssl shim (deprecated since urllib3 2.0, unnecessary on Python 3.13)
  • requirements.in: adds gunicorn; removes pyopenssl; unpins cryptography
  • requirements.txt: cffi 1.16.0 → 1.17.1 (adds cp313 wheels, no source compilation needed); setuptools pinned <82 (setuptools 82 removed pkg_resources, needed by python-graph-core — tracked in Automate setuptools installation #421)
  • dockerfile: Python 3.9 → 3.13 to match deployment; adds gcc libffi-dev build deps
  • deployment.md: rewritten for the buildservice workflow

Secrets

All credentials are now set via toolforge envvars (Kubernetes Secrets) — no config YAML file needed at runtime. See deployment.md step 3 for the full list of env vars and the warning about shell history on the shared bastion.

Test plan

  • npm run build from repo root (package.json shim) — assets land in montage/static/
  • cd frontend && npm run dev — Vite dev server starts
  • pytest montage/tests/ — 6 passed
  • Python config loads correctly with devtest (no env vars needed)
  • load_env_config() reads from MONTAGE_* env vars when set
  • gunicorn app:app starts and serves — / → 200, /meta/ → route list
  • CI passes
  • Node buildpack triggers from root package.json shim on Toolforge (needs working kube)
  • App is reachable via Toolforge ingress on port 8000 (needs working kube)
  • MONTAGE_ENV + secret envvars correctly configure the running service (needs working kube)
  • $TOOL_DATA_DIR is NFS-mounted in the webservice pod; log paths resolve (needs working kube)

Note: PR #534 (tools/build-frontend-script) is kept open as a fallback until the buildservice is confirmed working on Toolforge.

lgelauff and others added 13 commits May 4, 2026 15:47
Replace toolforge webservice python3.13 with the Toolforge buildservice
(Cloud Native Buildpacks). Adds Procfile, .python-version, and a root-level
package.json shim so both Python and Node buildpacks activate, baking the
Vue frontend into the container image and eliminating the separate node20
frontend-build job.

Config and env detection are updated in utils.py to work inside containers:
- get_env_name() checks MONTAGE_ENV first (LDAP/getpass.getuser unavailable
  in containers per Toolforge docs)
- load_env_config() loads secrets from individual MONTAGE_* environment
  variables (stored via toolforge envvars) when MONTAGE_OAUTH_CONSUMER_TOKEN
  is set; falls back to YAML for local dev

Removes the urllib3/pyopenssl SSL shim (deprecated since urllib3 2.0, no
longer needed on Python 3.13) and its dependency from requirements.in.
Removes the cryptography version pin, which was a uwsgi+PyO3 workaround
that does not apply to gunicorn. Adds gunicorn to requirements.in.

Rewrites deployment.md for the buildservice workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the Python version used on Toolforge (.python-version). Required
by gunicorn>=20 which dropped support for Python <3.10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cffi 1.16.0 ships a cp39 manylinux wheel but not a cp313 one, so pip
falls back to compiling from source on Python 3.13. The slim image has
no C toolchain. gcc and libffi-dev are the standard build requirements
for cffi. The underlying fix (cffi>=1.17.1, which has cp313 wheels) is
tracked in #512.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cffi 1.17.1 adds cp313 wheels, eliminating source compilation on Python
3.13 (and the need for gcc in the CI image). setuptools is pinned <82
because setuptools 82 removed pkg_resources, which python-graph-core
depends on via namespace packages. Remove once python-graph-core is
replaced (#421).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Heroku Node.js buildpack v5 requires a package-lock.json to run
npm ci and npm run build. Without it, the buildpack installs Node.js
but skips the install/build steps, leaving montage/static/ empty.

The build script runs: cd frontend && npm ci && npm run toolforge:build
which builds the Vite frontend and copies assets to montage/static/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Toolforge buildservice startup probe checks port 8000. The earlier
change to 5000 was incorrect and caused the startup probe to fail with
"connection refused", putting pods into CrashLoopBackOff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NFS must be mounted so Montage can write logs to /data/project/<toolname>/.
Also adds URLs to the multi-environment table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
frontend/.env is gitignored, so the buildpack never had it. Without it,
VITE_API_ENDPOINT was undefined at build time, making every Axios call go
to /undefined/v1/... and fail with 404. .env.production sets it to empty
string so Vite produces a relative /v1/ base URL, correct for same-origin
Toolforge deployments.

Also adds build verification and debugging guidance to deployment.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
**Important for deployers**: Montage now uses OAuth 2.0. If you have an
existing OAuth 1.0a consumer registered on Meta-Wiki, you will need to
register a new OAuth 2.0 consumer at Special:OAuthConsumerRegistration.

Changes:
- Replace mwoauth Handshaker/RequestToken with direct OAuth 2.0 PKCE flow
  using requests against rest.php/oauth2/ endpoints
- New env vars: MONTAGE_OAUTH_CLIENT_ID, MONTAGE_OAUTH_CLIENT_SECRET,
  MONTAGE_OAUTH_REDIRECT_URI (replaces MONTAGE_OAUTH_CONSUMER_TOKEN /
  MONTAGE_OAUTH_SECRET_TOKEN)
- Remove mwoauth, oauthlib, requests-oauthlib, pyjwt from dependencies
- debug mode in complete_login now reads debug_username/debug_userid
  from config instead of hardcoded values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The juror round listing was filtering on campaign status only, so
finalized and cancelled rounds inside active campaigns still appeared.
Jurors clicking them got a confusing 400 "round not active" error.

Now only active and paused rounds are shown in the active view.

Suspected cause: e3069dd (PR #350, merged 2025-10-23) added a
finalize_round endpoint that can finalize individual rounds without
advancing, creating a state the juror listing never filtered for.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff lgelauff force-pushed the tools/buildservice branch from 1ada08b to 6b68758 Compare May 21, 2026 18:41
lgelauff and others added 10 commits May 21, 2026 20:43
Automates: build → poll for completion → verify SHA + port →
warn if running SHA already matches → restart → smoke-test /meta/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- git pull --ff-only at start so the script stays current
- default ref changed to master (was tools/buildservice)
- document one-time clone in script header

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paused rounds were still appearing in the juror round list because
the only_active filter allowed both ACTIVE_STATUS and PAUSED_STATUS.
However confirm_active() only permits ACTIVE_STATUS, so clicking a
paused round always produced a 400 "round not active" error.

Fix restricts the juror index to strictly active rounds only.

Suspected introduced by e3069dd (#350).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _load_config_from_env now raises ValueError if MONTAGE_DEBUG=true
  in prod, beta, or devlabs; prevents a single envvar command from
  disabling auth on a live Toolforge tool
- UserMiddleware auto-login path split: devtest keeps the test fixture
  user (6024474); debug mode uses userid=0 / '__montage_debug__' which
  has no match in any real DB, producing an immediate and obvious error
  if debug mode somehow reaches a production database
- complete_login debug defaults changed to the same sentinel values so
  config.dev.yaml must explicitly set debug_userid / debug_username

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unvalidated ?next= parameter allowed open redirects: any URL passed
to /login?next= or /logout?next= was stored in a signed cookie and
redirected to after auth, enabling phishing via crafted links.

_safe_next() accepts only same-origin paths (starts with / but not //)
and falls back to root_path for anything else. Applied at all three
sites: login (cookie store), logout (direct redirect), and complete_login
(cookie read, defence-in-depth).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_index() accepted an only_active parameter but hard-coded True
when calling get_all_rounds_task_counts, so get_all_campaigns()
(the archive endpoint at /juror/campaigns/all) was always returning
an empty list instead of finalized rounds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously scm_secure was only True when env_name == 'prod', so
beta and devlabs deployments (both served over HTTPS on Toolforge)
shipped session cookies without the Secure flag. A MONTAGE_ENV typo
(e.g. 'production' instead of 'prod') would silently drop it on prod.

- scm_secure is now True for all envs except dev and devtest
- app startup raises on unknown env names so typos fail loudly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tests

- LoggingMiddleware now redacts code/state/access_token/token from
  request.args before writing to api_log; these are auth material and
  should never appear on disk
- complete_login no longer swallows exceptions silently; OAuth exchange
  failures are now logged via api_log so operators can distinguish
  'user cancelled' from 'client secret is wrong' or 'MW is down'
- Four new tests cover the PKCE flow using mocked MW endpoints:
  login redirect params, state mismatch, token exchange failure,
  and successful login with cookie verification

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: refuse to start if cookie_secret is the default placeholder
  or shorter than 32 chars
- public_endpoints.py: use hmac.compare_digest for state comparison
  (constant-time, prevents timing oracle on the CSRF token)
- utils.py: print effective superuser list at startup so
  misconfigured MONTAGE_SUPERUSERS is immediately visible in logs
- deploy.sh: fail explicitly when git ls-remote cannot resolve the ref
  instead of silently using the literal branch name as the SHA
- dev.md: update OAuth keys to oauth_client_id/secret/redirect_uri
  and note that debug_userid/debug_username must be set for a useful
  local dev session

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@mahmoud mahmoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big PR! First impressions: There are a few other things in the yaml that aren't secret per se? I wonder if that's too many env vars... and what's the deal with the huge .secrets.baseline ?

Comment thread frontend/.env.production
@@ -0,0 +1 @@
VITE_API_ENDPOINT=
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm does this need to be in here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try removing it - I recall that it keeps crashing if you do. Needs to be declared? I came across this earlier in #516


cookie['return_to_url'] = request.args.get('next', root_path)
return redirect(redirect_url)
def login(request, oauth_config, cookie, root_path):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why switching buildpacks would require changing this?

Copy link
Copy Markdown
Collaborator Author

@lgelauff lgelauff May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oauth is upgrading here from v1 to v2. This was because i had to register a new consumer after deleting the old credentials accidentally.
Perhaps not the most lean implementation but made life easier.
Of course, if there is some reason to go back to v1, happy to register that instead.

Comment thread montage/rdb.py
users_rounds_query = users_rounds_query.where(campaigns_t.c.status == 'active')
users_rounds_query = users_rounds_query.where(campaigns_t.c.is_archived == False)
users_rounds_query = users_rounds_query.where(
rounds_t.c.status == ACTIVE_STATUS
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, seems unrelated to buildpacks?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate bug, that I think got in the way during testing. Was easier to fix than work around.

Can cherrypick it as I can see there might be a conversation whether to hide paused (or completed) rounds now i think more about it.

Comment thread montage/utils.py
elif env_name == 'devtest':
return dict(DEVTEST_CONFIG)

if os.environ.get('MONTAGE_OAUTH_CLIENT_ID'):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is being used as a detector for if env vars are in use, might be worth commenting that or something?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OAuth 1.0a → 2.0 PKCE rewrite (413ea20)

Comment thread tools/deploy.sh
fi

echo ""
echo "==> Deploy complete: $TOOL_URL"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of those scripts where if it works, it works, and if it doesn't we'll just fix it :P if there's a way to test it, it'll be pretty involved. I wonder if it makes sense to have a skill that runs it so that some automatic diagnosis and such can happen.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, personal preference, I guess :) Especially the smoke test was helpful to me, as it is explicit about when it's really ready.

Comment thread .python-version
@@ -0,0 +1 @@
3.13
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. i assume it passes tox and your clicktesting?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pointer, upgraded tox, and confirmed this passes. See #573

Comment thread package.json
@@ -0,0 +1,7 @@
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want this at the root? not in the frontend dir?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread requirements.txt
# via python-graph-core (pkg_resources namespace; implicit dep on Python 3.13)
# Pinned <82: setuptools 82 removed pkg_resources entirely; python-graph-core
# depends on it via namespace packages. Remove this pin once python-graph-core
# no longer uses pkg_resources (tracked in hatnote/montage#421).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff
Copy link
Copy Markdown
Collaborator Author

lgelauff commented May 24, 2026

Big PR! First impressions: There are a few other things in the yaml that aren't secret per se? I wonder if that's too many env vars... and what's the deal with the huge .secrets.baseline ?

Thanks for the comments and questions. In the new setup, we don't use a persistent file system any longer. It now also includes environment-specific variables. Moved it to env var. If you want, we can make that distinction clearer and move them to the USER_ENV_MAP.

The secrets baseline was for a pre-commit hook that I installed just in case Claude goes overboard and wants to commit something with a secret. The baseline are all known false positives. I realize now that I look more carefully that this can easily be condensed by skipping the test data.

lgelauff and others added 2 commits May 24, 2026 09:58
test_data/wlm2015_fr_12k.csv contains 1,798 Wikimedia img_sha1 hashes
that detect-secrets flagged as high-entropy strings. Excluding the
directory shrinks the baseline from 12,745 lines to 159.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gunicorn 25.x (added in this PR) requires Python >=3.10, so Python 3.9
is no longer supported. The minimum supported version is now Python 3.11
(local dev / CI); the deployment target is Python 3.13 (buildservice).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff
Copy link
Copy Markdown
Collaborator Author

Python version note: This PR adds gunicorn 25.3.0, which requires Python >=3.10 — Python 3.9 is no longer supported as of this change. Minimum supported version is now Python 3.11 (local dev/CI); deployment target remains Python 3.13 (buildservice). Dropped py39 from the tox matrix in ec20524.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff
Copy link
Copy Markdown
Collaborator Author

Merge dependency: deployment.md references tools/migrate_prod_db.sql (line 92), which is added by #505. This PR should not be merged before #505, otherwise the upgrade deployment instructions will reference a file that doesn't exist yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants