Add .gitignore file#1
Merged
Merged
Conversation
kolodkin
pushed a commit
that referenced
this pull request
May 18, 2026
Last run got past every Ray-Jobs infrastructure issue and proved the job runs. The remaining problem is pure performance: Ray Data was killed at the 800s internal deadline mid-Multi-key-group-by, having made it through 9 of 11 ops in 12 minutes. Profile of the surviving ops: Column sum: <1s OK Column multiply: 96s ds.map(per-row lambda) <- fix #1 Filter rows: 78s ds.filter(per-row lambda) <- fix #1 Sort: ~12s OK Count distinct: <1s OK Group-by sum: 175s inherent shuffle cost Group-by count: 169s inherent shuffle cost Group-by multi-agg: 176s inherent shuffle cost Multi-key group-by: killed Fix 1: replace per-row `ds.map`/`ds.filter` with `ds.map_batches` (numpy / pandas batch_format). Should drop those two ops by an order of magnitude. Fix 2: groupby shuffles are unavoidable - Ray Data on a single 4-CPU node just isn't optimized for this. Bump the Ray-specific timeout to 1500s (15 -> 25 min) and the internal job deadline to 1400s. Other frameworks keep the 900s cap. Net expected Ray runtime: ~5-8 min, fits comfortably under 1500s. https://claude.ai/code/session_01Xu3bd81PR7VhGVfAdkG1ge
kolodkin
pushed a commit
that referenced
this pull request
Jun 21, 2026
New stage after the bootstrap CI fits one decision threshold per genre on the validation split (maximizing per-genre F1) and applies them to test, reporting fixed-0.5 vs tuned F1 plus the chosen thresholds. Pure post-processing, no retraining.
kolodkin
pushed a commit
that referenced
this pull request
Jun 21, 2026
…training) - Drop the per-class threshold-tuning stage (improvement #1). - Improvement #3: NUM_EPOCHS 3 -> 5, LR 5e-5 -> 2e-5 (canonical BERT recipe; best val-macro-F1 checkpoint is kept, so extra epochs are low-risk). - Add a truncation check in the tokenize cell reporting plot token-length percentiles and the share of plots exceeding MAX_LENGTH.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a comprehensive
.gitignorefile to the repository to prevent common files and directories from being tracked by Git.Changes
.gitignorewith exclusions for:Details
This standard
.gitignoreconfiguration follows Python best practices and covers development tools, testing frameworks, and environment-specific files that should not be committed to version control.https://claude.ai/code/session_01TTH8YZk58oLrTaaCq8Xu37