Skip to content

Add .gitignore file#1

Merged
kolodkin merged 1 commit into
mainfrom
claude/create-python-gitignore-bYiMn
Mar 20, 2026
Merged

Add .gitignore file#1
kolodkin merged 1 commit into
mainfrom
claude/create-python-gitignore-bYiMn

Conversation

@kolodkin

Copy link
Copy Markdown
Owner

Summary

This PR adds a comprehensive .gitignore file to the repository to prevent common files and directories from being tracked by Git.

Changes

  • Added .gitignore with exclusions for:
    • Python artifacts (bytecode, compiled files, eggs, wheels)
    • Virtual environment directories (.venv, venv, env, etc.)
    • IDE configuration files (.idea, .vscode, vim swap files)
    • Testing and coverage reports (.pytest_cache, .coverage, htmlcov, etc.)
    • OS-specific files (.DS_Store, Thumbs.db)
    • Jupyter notebook checkpoints
    • mypy cache files

Details

This standard .gitignore configuration follows Python best practices and covers development tools, testing frameworks, and environment-specific files that should not be committed to version control.

https://claude.ai/code/session_01TTH8YZk58oLrTaaCq8Xu37

@kolodkin kolodkin merged commit 6a8351c into main Mar 20, 2026
kolodkin pushed a commit that referenced this pull request May 18, 2026
Last run got past every Ray-Jobs infrastructure issue and proved the
job runs. The remaining problem is pure performance: Ray Data was
killed at the 800s internal deadline mid-Multi-key-group-by, having
made it through 9 of 11 ops in 12 minutes.

Profile of the surviving ops:
  Column sum:         <1s    OK
  Column multiply:    96s    ds.map(per-row lambda)         <- fix #1
  Filter rows:        78s    ds.filter(per-row lambda)      <- fix #1
  Sort:              ~12s    OK
  Count distinct:     <1s    OK
  Group-by sum:      175s    inherent shuffle cost
  Group-by count:    169s    inherent shuffle cost
  Group-by multi-agg: 176s   inherent shuffle cost
  Multi-key group-by: killed

Fix 1: replace per-row `ds.map`/`ds.filter` with `ds.map_batches`
(numpy / pandas batch_format). Should drop those two ops by an order
of magnitude.

Fix 2: groupby shuffles are unavoidable - Ray Data on a single 4-CPU
node just isn't optimized for this. Bump the Ray-specific timeout
to 1500s (15 -> 25 min) and the internal job deadline to 1400s. Other
frameworks keep the 900s cap.

Net expected Ray runtime: ~5-8 min, fits comfortably under 1500s.

https://claude.ai/code/session_01Xu3bd81PR7VhGVfAdkG1ge
kolodkin pushed a commit that referenced this pull request Jun 21, 2026
New stage after the bootstrap CI fits one decision threshold per genre on the
validation split (maximizing per-genre F1) and applies them to test, reporting
fixed-0.5 vs tuned F1 plus the chosen thresholds. Pure post-processing, no
retraining.
kolodkin pushed a commit that referenced this pull request Jun 21, 2026
…training)

- Drop the per-class threshold-tuning stage (improvement #1).
- Improvement #3: NUM_EPOCHS 3 -> 5, LR 5e-5 -> 2e-5 (canonical BERT recipe;
  best val-macro-F1 checkpoint is kept, so extra epochs are low-risk).
- Add a truncation check in the tokenize cell reporting plot token-length
  percentiles and the share of plots exceeding MAX_LENGTH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants