Add .gitignore file by kolodkin · Pull Request #1 · kolodkin/samples

kolodkin · 2026-03-20T16:04:09Z

Summary

This PR adds a comprehensive .gitignore file to the repository to prevent common files and directories from being tracked by Git.

Changes

Added .gitignore with exclusions for:
- Python artifacts (bytecode, compiled files, eggs, wheels)
- Virtual environment directories (.venv, venv, env, etc.)
- IDE configuration files (.idea, .vscode, vim swap files)
- Testing and coverage reports (.pytest_cache, .coverage, htmlcov, etc.)
- OS-specific files (.DS_Store, Thumbs.db)
- Jupyter notebook checkpoints
- mypy cache files

Details

This standard .gitignore configuration follows Python best practices and covers development tools, testing frameworks, and environment-specific files that should not be committed to version control.

https://claude.ai/code/session_01TTH8YZk58oLrTaaCq8Xu37

Last run got past every Ray-Jobs infrastructure issue and proved the job runs. The remaining problem is pure performance: Ray Data was killed at the 800s internal deadline mid-Multi-key-group-by, having made it through 9 of 11 ops in 12 minutes. Profile of the surviving ops: Column sum: <1s OK Column multiply: 96s ds.map(per-row lambda) <- fix #1 Filter rows: 78s ds.filter(per-row lambda) <- fix #1 Sort: ~12s OK Count distinct: <1s OK Group-by sum: 175s inherent shuffle cost Group-by count: 169s inherent shuffle cost Group-by multi-agg: 176s inherent shuffle cost Multi-key group-by: killed Fix 1: replace per-row `ds.map`/`ds.filter` with `ds.map_batches` (numpy / pandas batch_format). Should drop those two ops by an order of magnitude. Fix 2: groupby shuffles are unavoidable - Ray Data on a single 4-CPU node just isn't optimized for this. Bump the Ray-specific timeout to 1500s (15 -> 25 min) and the internal job deadline to 1400s. Other frameworks keep the 900s cap. Net expected Ray runtime: ~5-8 min, fits comfortably under 1500s. https://claude.ai/code/session_01Xu3bd81PR7VhGVfAdkG1ge

New stage after the bootstrap CI fits one decision threshold per genre on the validation split (maximizing per-genre F1) and applies them to test, reporting fixed-0.5 vs tuned F1 plus the chosen thresholds. Pure post-processing, no retraining.

…training) - Drop the per-class threshold-tuning stage (improvement #1). - Improvement #3: NUM_EPOCHS 3 -> 5, LR 5e-5 -> 2e-5 (canonical BERT recipe; best val-macro-F1 checkpoint is kept, so extra epochs are low-risk). - Add a truncation check in the tokenize cell reporting plot token-length percentiles and the share of plots exceeding MAX_LENGTH.

Add Python .gitignore

6460dfc

https://claude.ai/code/session_01TTH8YZk58oLrTaaCq8Xu37

kolodkin merged commit 6a8351c into main Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .gitignore file#1

Add .gitignore file#1
kolodkin merged 1 commit into
mainfrom
claude/create-python-gitignore-bYiMn

kolodkin commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kolodkin commented Mar 20, 2026

Summary

Changes

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants