-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Github Actions Workflow for Tinker and Azure #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ghtning into chore/tinker-pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds GitHub Actions workflows to automate CI testing for Tinker and Azure examples. Key improvements include parameterizable test configurations with --ci flags for faster verification, fixed dependency versions for reproducibility, enhanced timeout configurations, and improved error handling.
- Adds automated CI workflows for Tinker and Azure examples
- Introduces
--cimode for faster test execution with smaller datasets - Pins crewai version and removes git commit hash from tinker-cookbook dependency
- Improves LiteLLM configuration with longer timeouts and better error handling
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Pins crewai to 1.2.0, unpins tinker-cookbook commit, adds new dependencies (ipython, pyvis, etc.) |
| scripts/litellm_run.sh | Makes port configurable via CLI argument with proper variable substitution |
| pyproject.toml | Pins crewai to ==1.2.0, removes git rev from tinker-cookbook |
| examples/tinker/tests/test_tinker_llm.py | Adds threaded store, launch mode config, and trajectory assertions |
| examples/tinker/q20_train.py | Adds CI mode support with smaller datasets and batch sizes |
| examples/tinker/q20_evaluate.py | Adds CI mode with sample limits, timeout increases, and result validation |
| examples/tinker/hello.py | Adds CI mode with reduced dataset sizes |
| examples/tinker/agl_tinker/llm.py | Switches to tinker tokenizer utility, adds timeout config, fixes callback setting |
| examples/azure/train_capital_agent.py | Adds CLI args for iterations and cleanup, adds try-finally for cleanup |
| examples/azure/aoai_finetune.py | Updates to new Azure API format, adds cleanup_deployments method |
| .github/workflows/examples-tinker.yml | New workflow for Tinker CI with multiple test scenarios |
| .github/workflows/examples-azure.yml | New workflow for Azure OpenAI examples |
| .github/workflows/benchmark.yml | Placeholder benchmark workflow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If not adding return token ids, we need to add the opentelemetry callback. | ||
| # Otherwise, we set it to default. | ||
| callbacks=["opentelemetry"] if not add_return_token_ids else None, |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment logic is inverted or unclear. The code sets callbacks to ['opentelemetry'] when add_return_token_ids is False, but the comment says 'If not adding return token ids, we need to add the opentelemetry callback. Otherwise, we set it to default.' This suggests the callback should be added when NOT adding token ids, which matches the code. However, the original code at line 309 was callbacks=['opentelemetry'] if add_return_token_ids else None, which is the opposite. Either clarify the comment to explain why the logic was inverted, or verify this is the correct behavior.
| method={ | ||
| "type": "supervised", | ||
| "supervised": { | ||
| "hyperparameters": { | ||
| "batch_size": self.finetune_batch_size, | ||
| "learning_rate_multiplier": self.finetune_learning_rate, | ||
| "n_epochs": self.finetune_epochs, | ||
| } | ||
| }, | ||
| }, |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API structure change from hyperparameters as a direct parameter to nested under method.supervised.hyperparameters appears to be a breaking change to the Azure OpenAI API. Verify this is compatible with the API version specified in the workflow (AZURE_OPENAI_API_VERSION: 2025-04-01-preview) and consider documenting the minimum required API version.
| sleep 5 | ||
| python q20_train.py runner --n-runners 4 & | ||
| sleep 5 | ||
| python q20_train.py algo --model qwen4b --ci | ||
| sleep 5 | ||
|
|
||
| pkill -f agl && echo "SIGTERM sent to agl" || echo "No agl process found" | ||
| while pgrep -f agl; do | ||
| echo "Waiting for agl to finish..." | ||
| sleep 5 | ||
| done | ||
| pkill -f q20_train.py && echo "SIGTERM sent to q20_train.py" || echo "No q20_train.py process found" | ||
| while pgrep -f q20_train.py; do | ||
| echo "Waiting for q20_train.py to finish..." | ||
| sleep 5 | ||
| done | ||
| echo "q20_train.py has finished." |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process cleanup logic uses pkill -f with broad pattern matching (agl and q20_train.py) which could inadvertently kill unrelated processes on a shared runner. Consider using more specific process identifiers, storing PIDs from the background jobs, or adding a unique identifier to the commands to ensure only the correct processes are terminated.
| sleep 5 | |
| python q20_train.py runner --n-runners 4 & | |
| sleep 5 | |
| python q20_train.py algo --model qwen4b --ci | |
| sleep 5 | |
| pkill -f agl && echo "SIGTERM sent to agl" || echo "No agl process found" | |
| while pgrep -f agl; do | |
| echo "Waiting for agl to finish..." | |
| sleep 5 | |
| done | |
| pkill -f q20_train.py && echo "SIGTERM sent to q20_train.py" || echo "No q20_train.py process found" | |
| while pgrep -f q20_train.py; do | |
| echo "Waiting for q20_train.py to finish..." | |
| sleep 5 | |
| done | |
| echo "q20_train.py has finished." | |
| AGL_PID=$! | |
| sleep 5 | |
| python q20_train.py runner --n-runners 4 & | |
| RUNNER_PID=$! | |
| sleep 5 | |
| python q20_train.py algo --model qwen4b --ci | |
| sleep 5 | |
| # Cleanup: terminate background processes by PID | |
| if kill -0 "$AGL_PID" 2>/dev/null; then | |
| kill "$AGL_PID" && echo "SIGTERM sent to agl (PID $AGL_PID)" || echo "Failed to send SIGTERM to agl" | |
| else | |
| echo "No agl process found (PID $AGL_PID not running)" | |
| fi | |
| while kill -0 "$AGL_PID" 2>/dev/null; do | |
| echo "Waiting for agl (PID $AGL_PID) to finish..." | |
| sleep 5 | |
| done | |
| if kill -0 "$RUNNER_PID" 2>/dev/null; then | |
| kill "$RUNNER_PID" && echo "SIGTERM sent to q20_train.py runner (PID $RUNNER_PID)" || echo "Failed to send SIGTERM to q20_train.py runner" | |
| else | |
| echo "No q20_train.py runner process found (PID $RUNNER_PID not running)" | |
| fi | |
| while kill -0 "$RUNNER_PID" 2>/dev/null; do | |
| echo "Waiting for q20_train.py runner (PID $RUNNER_PID) to finish..." | |
| sleep 5 | |
| done | |
| echo "q20_train.py runner has finished." |
Support Tinker by $150 with personal Tinker API key.
Support Azure via Managed Identity.