Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

Track rotated #182

Merged
merged 1 commit into from Jun 16, 2021
Merged

Conversation

rockb1017
Copy link
Contributor

@rockb1017 rockb1017 commented Jun 11, 2021

continuing #168
fixes #85

instead of closing all files at the end of poll cycle, keep them open till next poll cycle. Only after opening all matched files, consume files that kept open from previous poll and close. By having this "overlap" at each cycle, we make sure we have trace of moved away files and consume any logs that were written to it but not yet read before being rotated.

In addition, there were "Move/Create" rotation tests not being skipped for windows. I skipped those.

@rockb1017 rockb1017 requested a review from a team June 11, 2021 21:39
@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #182 (9e316c3) into main (44b6bf5) will increase coverage by 0.0%.
The diff coverage is 91.6%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #182   +/-   ##
=====================================
  Coverage   75.6%   75.7%           
=====================================
  Files         95      95           
  Lines       4360    4371   +11     
=====================================
+ Hits        3299    3311   +12     
  Misses       738     738           
+ Partials     323     322    -1     
Impacted Files Coverage Δ
operator/builtin/input/file/config.go 73.4% <0.0%> (ø)
operator/builtin/input/file/file.go 73.5% <100.0%> (+2.1%) ⬆️
operator/builtin/input/file/reader.go 65.8% <100.0%> (+0.7%) ⬆️

@rockb1017 rockb1017 mentioned this pull request Jun 12, 2021
@rockb1017
Copy link
Contributor Author

https://github.com/open-telemetry/opentelemetry-log-collection/pull/183/checks?check_run_id=2808466320
i tested that configuration on github actions. it failed on windows.

@djaglowski
Copy link
Member

https://github.com/open-telemetry/opentelemetry-log-collection/pull/183/checks?check_run_id=2808466320
i tested that configuration on github actions. it failed on windows.

@rockb1017 What conclusion are you drawing from this?

@rockb1017
Copy link
Contributor Author

I naively deleted a blocked of code that seemed to be inefficient and unnecessary.
it was for "copy/trunc" cases and we are having flakiness on "copy/trunc" tests. I added those back and it passed.
Thanks for the discussion we had today. @djaglowski

Could you run a few more times just to be sure ?

@rockb1017
Copy link
Contributor Author

image
100% data ingestion rate.
(i am generating 100,000 + 3 messages per pod and i deployed 3 pod. so total 300,009

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

A few notes on this PR.

  1. See explanation here of what was wrong with the previous attempt.
  2. In addition to believing that we have a good handle on the issues involved, I've rerun unit tests about 10 times and have observed no failures.
  3. @rockb1017 has incorporated this branch into a build of the collector, and run it on k8s as part of a benchmark. He observed 100% delivery rate, and has posted the results here.
  4. I'm confident that this is an improvement upon the previous release because it verifiably addresses a clear problem that was identified and does not regress in any regard that is currently tested. However, it is possible that further edge cases may be identified in the future.
  5. I am drafting a design doc that explains how this operator works. I'll share this and will welcome feedback, especially as it pertains to possible improvements.
  6. The codebase for this operator could be improved through refactoring. This is out of scope for this PR, but will be an ongoing focus for this project.
  7. Similarly, the test suite for this operator needs improvement. Some tests are currently non-deterministic. Some edge cases could be tested independently. This will also be an ongoing effort, but should not hold up this PR.

@tigrannajaryan, I'd appreciate any thoughts you have on these changes, or on the suggestion that we merge this and continue to improve from there.

@tigrannajaryan
Copy link
Member

@rockb1017 can you please update the PR description, squash the commit into one and add a commit message that fully explains why the changes are made and how they fix the problem.
Thanks for spending time on this for your patience. @djaglowski thanks a lot for your help and for the summary in your approval.

…s open till next poll cycle. Fix test cases for new implementaion and skip move-create tests for windows
@tigrannajaryan tigrannajaryan merged commit cdbb6d6 into open-telemetry:main Jun 16, 2021
@tigrannajaryan
Copy link
Member

Thanks @rockb1017 and @djaglowski !

@rockb1017 rockb1017 deleted the track_rotated branch June 16, 2021 15:56
@jsirianni jsirianni mentioned this pull request Jun 25, 2021
4 tasks
jsirianni added a commit to observIQ/stanza that referenced this pull request Jul 8, 2021
* reorg test files only, ported from open-telemetry/opentelemetry-log-collection#165

* port otel enhanced file input benchmarking open-telemetry/opentelemetry-log-collection#166

* skip closing files that are already closed

* port otel file rotation tracking open-telemetry/opentelemetry-log-collection#182

* fix poll()
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

file_input - Data loss when tailing symlink file with log rotation (k8s container logs)
3 participants