-
-
Notifications
You must be signed in to change notification settings - Fork 0
faster 10 % #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster 10 % #6
Conversation
WalkthroughThis PR adds comprehensive benchmarking infrastructure via a new benchmark_test.go file with 18+ benchmarks covering CSV processing, preprocessing, and validation operations. Concurrent performance optimizations reduce memory allocations through string builder pre-allocation, map-based lookups, and state-machine-based string processing. Test coverage expands with CSV edge-case scenarios and validator boundary tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Metrics Report
Details | | main (a2c6969) | #6 (8d19df3) | +/- |
|---------------------|----------------|--------------|-------|
+ | Coverage | 88.8% | 89.2% | +0.3% |
| Files | 10 | 10 | 0 |
| Lines | 1340 | 1383 | +43 |
+ | Covered | 1191 | 1234 | +43 |
+ | Test Execution Time | 13s | 1s | -12s |Code coverage of files in pull request scope (94.2% → 94.6%)
Reported by octocov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
processor.go (1)
380-405: Consider usingw.Writedirectly to avoid an extra allocation.The
io.WriteString(w, lineBuf.String())call on line 400 creates a string copy from the builder. If the writer implementsio.StringWriter, it's handled efficiently, but for a generalio.Writer, this allocates.Consider writing the bytes directly:
- if _, err := io.WriteString(w, lineBuf.String()); err != nil { + if _, err := w.Write([]byte(lineBuf.String())); err != nil {Alternatively, since
bytes.Buffer(the typical writer here) implementsio.StringWriter, the current approach may already be optimal. This is a minor optimization.prep.go (1)
475-487: Minor:Growunderestimates for multi-byte pad characters.The
Grow(len(value) + padCount)calculation assumes single-byte padding, butpadCharis arunethat could be multi-byte. For most practical use cases (padding with'0',' '), this is fine and the builder will just grow as needed.For perfect pre-allocation with multi-byte pad characters:
- result.Grow(len(value) + padCount) + result.Grow(len(value) + padCount*utf8.RuneLen(p.padChar))This is a minor optimization since the builder handles growth automatically.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
Makefile(1 hunks)README.md(1 hunks)benchmark_test.go(1 hunks)prep.go(8 hunks)processor.go(2 hunks)processor_test.go(1 hunks)validate.go(1 hunks)validate_test.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
processor_test.go (2)
processor.go (1)
NewProcessor(25-29)filetype.go (1)
FileTypeCSV(13-13)
processor.go (1)
errors.go (1)
ProcessResult(99-110)
benchmark_test.go (3)
cross_field_test.go (1)
Password(556-559)processor.go (2)
NewProcessor(25-29)Processor(14-16)filetype.go (1)
FileTypeCSV(13-13)
🪛 LanguageTool
README.md
[grammar] ~398-~398: Ensure spelling is correct
Context: ...lts processing CSV files with a complex struct containing 21 columns. Each field uses ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🔇 Additional comments (20)
Makefile (1)
12-18: LGTM! Well-structured benchmark targets.The separation of quick benchmarks (Small/Medium) from the full suite is a practical choice for development workflow. The regex pattern correctly matches the benchmark function names in
benchmark_test.go.processor.go (2)
143-148: Good pre-allocation strategy for error slice.The 10% error rate assumption with a minimum of 16 is a reasonable heuristic that balances memory efficiency with avoiding excessive reallocations.
152-156: Effective struct slice pre-allocation.Pre-allocating the slice capacity to
len(records)eliminates repeated slice growth during the processing loop.README.md (1)
396-419: Comprehensive performance documentation.The benchmark section effectively documents the test setup, results, and how to reproduce them. Including the hardware note is helpful for interpreting the numbers.
validate.go (1)
532-555: Excellent optimization for oneOf validator.The map-based lookup provides O(1) validation instead of O(n), and pre-computing the error message avoids allocation on validation failures. Using
struct{}as the map value is idiomatic Go for set semantics. The error message correctly preserves the original order of allowed values.validate_test.go (3)
564-576: Good regression test for error message order preservation.This test ensures the map-based optimization doesn't accidentally change the user-facing error message format.
578-611: Comprehensive edge case coverage for oneOf validator.The test cases effectively cover single values, empty strings in allowed list, whitespace handling, case sensitivity, and set position lookups. This provides good confidence in the map-based optimization.
755-813: Valuable documentation of email regex behavior.These boundary tests serve as documentation for the current regex behavior, making it clear which edge cases are intentionally allowed. The comments noting "allowed by current regex" are particularly helpful for future maintainers.
processor_test.go (4)
322-327: Well-designed edge case test struct.The
EdgeCaseRecordwith explicitnametags provides clean mapping for testing various CSV scenarios.
329-479: Excellent edge case test coverage.The test suite covers important scenarios: very long lines, many columns, uneven rows, empty files, header-only files, quoted fields with commas/newlines, unicode content, and whitespace handling. This provides confidence in the CSV parser's robustness.
525-541: Potential duplicate headers inmakeHeadershelper.The header generation logic may produce duplicate header names. For example, when
n=10:
i=0: "col" (no suffix since i%10=0)i=10: "col0" (since 10%10=0)However, lines 535-539 override the first three headers to
col1,col2,col3, which resolves the issue for struct mapping. For the test purposes here (verifying column count), this is acceptable.
586-611: Good stress test for many rows.Testing with 1000 rows verifies the processor handles bulk data without issues and correctly reports row counts.
prep.go (3)
271-287: Effective optimization with early return and builder.The fast path check
!strings.ContainsAny(value, "\r\n")avoids allocation entirely when no newlines are present, which is likely the common case. The builder-based approach with pre-allocation handles the remaining cases efficiently.
295-327: Well-implemented state machine replaces regex.The non-regex implementation is cleaner and more performant. The state tracking with
inSpacecorrectly collapses consecutive whitespace characters into a single space.
349-356: Good addition ofGrowpre-allocation.Adding
Grow(len(value))to the character filtering preprocessors reduces allocations during string building.benchmark_test.go (5)
10-55: Comprehensive benchmark record design.The
BenchmarkRecordstruct effectively exercises a wide range of preprocessing and validation scenarios including text normalization, HTML stripping, numeric extraction, URL handling, and cross-field validation.
57-108: Well-designed benchmark data generator.The
generateBenchmarkCSVfunction creates realistic test data with:
- Whitespace variations requiring trim
- Case variations requiring normalization
- HTML content for stripping
- Special characters in phone/salary fields
This exercises the preprocessors meaningfully.
96-99: Generated UUIDs may not pass validation for all indices.The UUID format
%08x-%04x-%04x-%04x-%012xgenerates strings like00000000-0000-0000-0000-000000000000fori=0, which is a valid UUID format. However, for small values ofi, the last segment will be short (e.g., fori=1:00000001-0001-0001-0001-000000000001).Since these benchmarks test processing throughput rather than validation pass rates, this is acceptable for benchmarking purposes.
110-176: Well-structured benchmark suite.The benchmarks properly:
- Generate data before the timer reset
- Use
b.ReportAllocs()for allocation tracking- Reuse the processor instance across iterations
- Cover multiple data sizes (100, 1K, 10K, 50K records)
This provides good coverage for performance regression detection.
484-504: Useful isolated benchmark for CSV output.Testing the output path separately helps identify if performance issues are in processing vs. serialization.
Summary by CodeRabbit
Documentation
Chores
make benchcommand for running quick performance benchmarks.Performance
✏️ Tip: You can customize this high-level summary in your review settings.