Move filewatch lib here, refactor and add new features. #171

guyboertje · 2018-03-16T01:09:53Z

NOTES for Reviewers:

There are a lot of code changes, mainly to split the monolithic procedures filewatch had before into separate classes.

BEFORE (filewatch 0.9.0):
watched_file.rb -> WatchedFile
  Responsibilities
    Holds lifecycle state
    Holds file info of physical file
    Holds last-known progress position

tail_base.rb & observing_tail.rb -> ObservingTail
  @sincedb -> 'sincedb records' hash, inode => position (int) 
  Responsibilities
    Reading and writing sincedb file
    Reacting to the block callback signals (e.g. create, unignore)
    Opening files
    Adding or removing sincedb hash entries depending on the callback signal
    Reading the opened file, calling the file input with content and updating the watched_file position and sincedb position
    
watch.rb -> Watch class
  Holds WatchedFile hash, path => WatchedFile instance 
  Responsibilities
    "Discovery"
      Discover new files in glob, create WatchedFile instance
      Discard if path is excluded or ignorable
      Add to @files.
    "Iteration" with block based callback to ObserveTail
      select and iterate over watched_files in closed state
        move it to watched state if necessary.
      select and iterate over watched_files in ignored state
        move it to watched state if necessary and signal 'unignore'
      if there is space in the active files window
        select some files and iterate over watched_files in watched state
          move to the active state
          signal 'create' or 'create_initial'
      else
        periodically warn that there are watched entries waiting to be activated.
      select and iterate over watched_files in active state
        if its not readable
          prepare to delete entry
          move to unwatched state
          signal 'delete'
        if its closable
          signal 'timeout'
          move to closed state
        if it has changed inode
          swap the inode
          signal 'delete'
          signal 'create'
          # probable bug, should be delete (old inode), swap, create perhaps
        if it has shrunk in size
          signal 'delete'
          signal 'create'
        if it has grown
          signal 'modify'
      actually delete entries

AFTER (filewatch in file input):
  bootstrap.rb
    Defines Constants and InodeStruct as well as Windows vs *nix specific adaptations.
    Requires dependencies

  watched_files_collection.rb -> WatchedFilesCollection
    Wraps the watched files hash, with extra methods.

  sincedb_collection.rb -> SincedbCollection
    Wraps the sincedb hash, with extra methods.
    Read and write physical sincedb file.
    Serialize and deserialize (v1, v2) records.
    Encapsulates logic for finding and associating discovered files with previously seen sincedb records.
  
  sincedb_value.rb -> SincedbValue
    Is the 'value' in the sincedb collection.
    Tracks position, last changed time and last known path.
    Holds the currently associated WatchedFile reference.
  
  settings.rb -> Settings
    The options passed from the user config + defaults
    Globally available under the OPTS constant
  
  discoverer.rb -> Discoverer
    All discover functionlity moved to this class.
    Holds references to watched_files and sincedb collections.
    Adds and associates discovered files to the collections.
  
  watch.rb -> Watch
    Provides mutex access to the discover, iterate and quit functions.
    Delegates discover to Discoverer
    Delegates processing of closed, ignored, watched and active watched_files collection value entries to the appropriate processor (Tail or Read).
  
  observing_base.rb -> ObservingBase
    Is a mixin module for ObservingTail and ObservingRead
    Provides the external entry points
      new(opts)
      quit
      watch_this(path)
  
  observing_tail.rb -> ObservingTail
    Builds TailHandlers::Processor.new
           TailHandlers::Dispatch.new
  
  observing_read.rb -> ObservingRead
    Builds ReadHandlers::Processor.new
           ReadHandlers::Dispatch.new
  
  tail_handlers/*
    Processor and Dispatch classes to adapt the WatchedFile handling for Tail.
    Handler classes (provide specific `handle_specifically(watched_file)` and `update_existing_specifically(watched_file, sincedb_value)`:
      Base (handlers super class)
      Create
      CreateInitial
      Delete
      Grow
      Shrink
      Timeout
      Unignore
  
  read_handlers/*
    Processor and Dispatch classes to adapt the WatchedFile handling for Read.
    Handler classes (provide specific `handle_specifically(watched_file)`:
      Base (handlers super class)
      ReadFile
      ReadZipFile

colinsurprenant · 2018-03-28T16:59:27Z

Rakefile

+end
+
+desc "Run full check with custom Logstash path"
+task :custom_ls_check, :ls_dir do |task, args|


not sure why this is required? what I usually do is

LOGSTASH_SOURCE=1 bundle exec ...

not sure we need/want to add yet another custom rake task?

I use this to run the specs against any downloaded LS as well as my LS dev repo.
I have this:

logstash-2.1.3 logstash-2.2.2 logstash-2.3.4 logstash-2.4.0 logstash-2.4.1 logstash-5.0.0 logstash-5.2.1 logstash-5.2.2 logstash-5.3.0 logstash-5.4.3 logstash-5.5.0 logstash-5.5.1 logstash-5.6.2 logstash-5.6.4 logstash-5.6.8-SNAPSHOT logstash-6.0.0 logstash-6.1.0 logstash-6.1.1 logstash-6.1.2 logstash-6.2.1 logstash-6.2.2

ok, that looks super useful, my problem is that this is specific to your workflow and is not really documented or used anywhere else, or is it?

colinsurprenant · 2018-03-28T17:02:47Z

build.gradle

+project.sourceCompatibility = 1.8
+
+dependencies {
+    compileOnly group: 'org.jruby', name: 'jruby-complete', version: "9.1.13.0"


do we need this double dependencies on jruby-complete ?

I'm unsure. If buildscript defines repositories and dependencies - does this mean that we can remove the root repositories and dependencies at lines 41 and 48?

removed buildscript - seems to work.

colinsurprenant · 2018-03-28T17:06:36Z

About rake/gradle: I see we have very different rake/gradle tasks from other "hybrid" java plugins. I am wondering if we want to or should try to be consistent with that?

colinsurprenant · 2018-03-28T17:08:53Z

lib/filewatch/bootstrap.rb

+#  all the required constants and files
+#  defined in one place
+module FileWatch
+  HOST_OS_WINDOWS = (RbConfig::CONFIG['host_os'] =~ /mswin|mingw|cygwin/) != nil


FYI in Logstash https://github.com/elastic/logstash/blob/5e49b5d9cbb0a66482555a2ae57c0e89379f3923/lib/bootstrap/environment.rb#L45-L47 this is how we test for the windows platform. Should we be consistent?

def windows? ::Gem.win_platform? end

Thanks for pointing this out. It is a leftover from filewatch being standalone.

Will refactor this to use logstash/environment.

colinsurprenant · 2018-03-28T17:12:14Z

lib/filewatch/bootstrap.rb

+
+  if HOST_OS_WINDOWS
+    require "winhelper"
+    FILEWATCH_INODE_METHOD = :win_inode


suggestion: like we don in LS for the windows? method instead of using a constant, maybe it would be cleaner to define methods likes win_inodes? and nix_inodes?

colinsurprenant · 2018-03-28T17:12:56Z

lib/filewatch/bootstrap.rb

+    FILEWATCH_INODE_METHOD = :nix_inode
+  end
+
+  if defined?(JRUBY_VERSION)


in which case would that be undefined?

colinsurprenant · 2018-03-28T17:17:37Z

lib/filewatch/bootstrap.rb

+  require_relative "read_handlers/base"
+
+  # TODO [guy] make this a config option, perhaps.
+  CurrentSerializerClass = SincedbRecordSerializer


since we don't support alternate serializers, I suggest we drop the CurrentSerializerClass and directly set the SincedbRecordSerializer into SincedbCollection constructor.

I'm thinking that we will need to serialize to a ES based Shared State store soon - effectively JSON.

yeah, but I am not sure this "empty" decoupling brings any actual of future value? for now it's just an "empty" abstraction.

I can remove it.

colinsurprenant · 2018-03-28T17:20:17Z

lib/filewatch/buftok.rb

+    @input.empty?
+  end
+end
+end


logstash core exposes BufferedTokenizer (logstash-core/lib/logstash/util/buftok.rb) do we also need it here?

Will switch. 👍

This is an addition attr_reader :input. I will have to reopen the class.

Let me see... AFAICT, attr_reader :input is only used for the FileWatch::WatchedFile.reset_buffer method?

def reset_buffer @buffer.input.clear end

But this could be replaced with the BufferedTokenizer.flush method. It will be slighly less efficient but this is not a code hotspot so it should not have any impact?

colinsurprenant · 2018-03-28T17:21:59Z

lib/filewatch/discoverer.rb

+    end
+
+    def add_path(path)
+      return if @watching.member?(path)


this will likely fail if paths are not normalized, is that a consideration here?

let me rephrase that: this will not catch duplicate paths unless they are normalized, is that a consideration here?

I looked at this again. The path is either a file or a glob, if the glob is functionally the same as one we have seen before but not the same string then it will be added but at discovery any already discovered files will not be added to the collection - so a small overhead.

I checked the filewatch repo. This has been so since Aug 30, 2011. I don't recall any issues with this.

ok, that seems reasonable ...

colinsurprenant · 2018-03-28T17:28:00Z

lib/filewatch/helper.rb

+  #   end
+  #
+  # If your temp directory is not on the same filesystem as the file you're
+  # trying to write, you can provide a different temporary directory.


should we add a check for this? or we could create the tmp file in the same directory as the target file?

I don't know. Its copied verbatim from the rails code. We control the call to this method.

ok, upon closer inspection the implementation is not sync'ed with the comments. there is no usage of a temp directory and this implementation does not match the rails link above.

I'll remove the comment.

colinsurprenant · 2018-03-28T17:28:28Z

lib/filewatch/observing_base.rb

@@ -0,0 +1,68 @@
+# encoding: utf-8
+require_relative 'bootstrap' unless defined?(FileWatch)


why is this required?

IMO because I wanted the file input to not need to know about bootstrap.rb. So the entry point is either ObservingTail or ObservingRead.

but can't the require order be done in a way that we don't need this kind of conditional require? in my experience these become hard to maintain.

I can remove the unless.

colinsurprenant · 2018-03-28T18:40:18Z

lib/filewatch/observing_tail.rb

+    include LogStash::Util::Loggable
+    include ObservingBase
+
+    def build_specific_processor(settings)


as discussed, I think it would be useful to document somehow the interface for the ObservingXxx classes

colinsurprenant · 2018-03-28T18:41:20Z

lib/filewatch/observing_tail.rb

+      # observer here is the file input
+      dispatcher = TailHandlers::Dispatch.new(sincedb_collection, observer, @settings)
+      watch.subscribe(dispatcher)
+      sincedb_collection.write("subscribe complete - shutting down")


"subscribe complete - shutting down" is inconsistent with the ObservingRead class

Will make the same.

colinsurprenant · 2018-03-28T18:45:21Z

lib/filewatch/read_handlers/base.rb

+  end
+end end
+
+require_relative "dispatch"


having a require at the end of the file is a bit weird. would it be better do a require "base" at the beginning of the dispatch.rb ?

colinsurprenant · 2018-03-28T18:54:14Z

lib/logstash/inputs/file.rb

+  # If 'delete' is specified then the file will be deleted.
+  # If 'log' is specified then the full path of the file is logged to the file specified
+  # in the `file_completed_log_path` setting.
+  config :file_completed_action, :validate => ["delete", "log"], :default => "delete"


could there be just a "do nothing" option? no delete, no log?

also, are both options mutually exclusive?

"do nothing" - I suppose so. This was a last minute change - people have asked for this since forever. I was going to generate a 'special' event tagged with EOF having the path and force people to handle it in their output section - IIRC Jordan suggested to not do the event thing and give people an action to take.

Doing nothing is problematic though, because if a user is reading 10000 files, say, the user does not know when the file is done with. The discover glob order is indeterminately OS specific. Do you think we should sort the watched file collection before the iteration? The order is then known and we can doc that. Then again with striped reading, files will be done with in an order that is determined by its size with the smaller files finishing before the larger files regardless of the glob order or the collection sorted order.

"do both" - yeah sure.

It would be great to reorder files.
It would be even better to have an option to tell how sort is done : by file name, by last modified file (from the oldest to the newest).
I’d love the second option.

@fbaligand
Do you see yourself using striped reading?

@guyboertje
I just looked at "striped reading", if I understand well, it lets to read files in parallel, with a chunk read mechanism.
IMHO, it doesn't let to read files in a particular order. That feature is particularly useful when start_position=beginning, and at Logstash startup, there are already numerous files. It can be important that files are read using a specific order : from the oldest to the newest, or alphabetically.

@fbaligand
Sorry, my question was a general one, not related to discovered file collection order.

I am happy to add collection ordering, with a new settings file_sort_by -> "date"|"path" and file_sort_direction -> "asc"|"desc".
Q: do these ^ settings names and values make sense? Are they useful in tail mode (or which defaults are better for existing users in tail mode, IMO path + asc)?

That sounds great !

Maybe replace "date" value by "last_modified" would be more meaningful ?

Currently (so in tail mode), even if start_position=end, file input continues to discover new files every 15 seconds (by default) and read them from beginning. During these 15s, several files could arrive, that's why I find it relevant to apply order in tail mode.

IMO, I prefer default as date + asc, as it means that files are processed as they arrived in time.

@guyboertje
I just looked at FileBeat options for that (https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html#scan-sort).
They name it scan.sort and scan.order.
I like the prefix scan_, I find it meaningfull.

colinsurprenant · 2018-03-28T18:55:14Z

lib/logstash/inputs/file.rb

+  # If no changes are detected in tracked files in the last N days their sincedb
+  # tracking record will expire and not be persisted.
+  # This option protects against the well known inode recycling problem. (add reference)
+  config :sincedb_clean_after, :validate => :number,  :default => 14 # days


idea: could the actual file cleanup/delete be triggered by the sincedb cleanup?

I love this new option !
it could potentially fix issue #173

One other feature to fix inode recycling problem would be to regularly remove entries in sincedb where inode file does not exist anymore. For example every discover_interval seconds.
I would love such a feature !

colinsurprenant · 2018-03-28T18:57:22Z

lib/logstash/inputs/file.rb

+  # See the option `max_open_files` for more info, but note in "read" mode, close_older is
+  # not required as files are closed at EOF
+  # The default set internally is very large (the maximum positive Ruby integer)
+  config :read_iterations, :validate => :number,  :default => (2**(0.size * 8 - 2) - 1)


I dont understand this option - maybe we should rephrase it.

why not reuse the FIXNUM_MAX constant?

👍 Constant.

Is the description confusing or the setting name? both?

for me both I guess, but if we have a good description then maybe the setting name will make more sense to me 😆

colinsurprenant · 2018-03-28T18:59:05Z

lib/logstash/inputs/file.rb

+  # Specify the size in bytes of each chunk read in each read iteration
+  # See `read_iterations` to see why and when to change this from the default.
+  # The default set internally is 32768 (32KB)
+  config :file_chunk_size, :validate => :number,  :default => 32768


I see that 32768 is defined twice (also in bootstrap.rb), should it be a contant?

colinsurprenant · 2018-03-28T19:02:43Z

lib/logstash/inputs/file.rb


      # Join by ',' to make it easy for folks to know their own sincedb
      # generated path (vs, say, inspecting the @path array)
-      @sincedb_path = File.join(sincedb_dir, ".sincedb_" + Digest::MD5.hexdigest(@path.join(",")))
+      @sincedb_path = sincedb_dir.join(".sincedb_" + Digest::MD5.hexdigest(@path.join(",")))


I notice code duplication with the sincedb filename generation, we should DRY it up

colinsurprenant · 2018-03-28T19:05:09Z

lib/logstash/inputs/file.rb

    end

+    if tail_mode?
+      require "filewatch/observing_tail"


do we need dynamic class loading here? I usually try to avoid that unless there is a good reason to. sometime, depending on test coverage it will hide errors in ont or the other file.

colinsurprenant · 2018-03-28T19:12:37Z

Ok, just completed a first review pass.
Note that given the complexity & amount of code here (it is basically a completely new file input plugin) it will be hard to meticulously review all the code here. We will have to trust the tests/specs and also proceed with some integration tests and/or beta tests.

Have we considered creating a whole new plugin instead of updating the new one?

guyboertje · 2018-04-03T18:01:24Z

lib/filewatch/bootstrap.rb

+    require "jruby_file_watch"
+  end
+
+  if HOST_OS_WINDOWS && defined?(JRUBY_VERSION)


Note to self: JRuby is guaranteed.

colinsurprenant · 2018-04-05T16:08:10Z

lib/filewatch/buftok.rb

+  # or allow you to specify any delimiter token you so choose, which will then
+  # be used by String#split to tokenize the input data
+
+  attr_reader :delimiter, :input


I also note (following our discussion on attr_reader :input below) that this copy of BufferedTokenizer also exposes delimiter but AFAICT this is not required our code? I will comment on where I believe this was required but I think it is actually a bug?

colinsurprenant · 2018-04-05T16:12:57Z

lib/filewatch/watched_file.rb

+    end
+
+    def buffer_delimiter_byte_size
+      @buffer.delimiter_byte_size


AFAICT BufferedTokenizer does not have a delimiter_byte_size method. It looks like the original intent might have been @buffer.delimiter.bytesize ? hence the need to expose attr_reader :delimiter ?
In any case, can't this be replaced with:

def buffer_delimiter_byte_size @settings.delimiter_byte_size end

or

def buffer_delimiter_byte_size @settings.delimiter.bytesize end

so that attr_reader delimiter is not required and we can reuse the Logstash BufferedTokenizer ?

guyboertje · 2018-04-16T11:46:36Z

CHANGES

Rakefile task :custom_ls_check. Removed

compileOnly group: 'org.jruby'. Removed

change def windows?. Done

define methods likes win_inodes? and nix_inodes? I decided to define two modules to mixin to the WatchedFile class depending on LogStash::Environment.windows? or not. Better?

defined?(JRUBY_VERSION). Fixed

CurrentSerializerClass = SincedbRecordSerializer. Removed CurrentSerializerClass, use SincedbRecordSerializer

use logstash core BufferedTokenizer. See below

also exposes delimiter. See below

replaced with the BufferedTokenizer.flush method? I opted for this solution, removed buftok file but not the test - making sure the LS version of BufferedTokenizer still does what we need it to do

return if @watching.member?(path). Answered inline, this code has been as such since Aug 30, 2011. I don't recall any issues with this

trying to write, you can provide a different temporary directory. should we add a check for this? Replied inline, IMO no action needed

require_relative 'bootstrap' unless defined?(FileWatch). Completely reorganised require order throughout

document somehow the interfaces contracts. Added documentation in the ObservingBase class - however, in preparing the documentation it became clear to me that the Processor - Dispatch separation no longer made sense, so I merged Dispatch into Processor with the necessary adjustments. Renamed ReadHandlers and TailHandlers to ReadMode and TailMode and moved the Handler classes into a Handlers module namespace to better reflect their intention without needing to add _handler to each file and class name

sincedb_collection.write("subscribe complete - shutting down"). Unified both arguments to this call

require_relative "dispatch" - require at the end of the file is a bit weird. Completely reorganised require order throughout

WatchedFile buffer_delimiter_byte_size unused. Removed

could there be just a "do nothing" option? no delete, no log? Answered inline. Do nothing is undesirable IMO

also, are both options mutually exclusive? Refactored to do one, other or both with a third way option log_and_delete QUESTION: Should we do any file exist checks when writing? How will a user effectively rotate the file?

read_iterations - I dont understand this option. Renamed to file_chunk_count and reworded comments - better?

I see that 32768 is defined twice (also in bootstrap.rb), should it be a contant? Refactored to make both constants - needed require "filewatch/bootstrap" v early

sincedb filename generation, we should DRY it up. Refactored to handle sincedb_path determinism in one pass. Added a test to verify that a legacy sincedb file was renamed

do we need dynamic class loading here? Refactored to load both module trees but only use one

ADDITIONS

Added sorting of watched_file_collection as per Fabian's request. Added sort specs.

colinsurprenant · 2018-04-18T16:01:18Z

spec/inputs/file_tail_spec.rb

        run_thread_proc.call
-        sleep 0.25
+        sleep 0.2


colinsurprenant · 2018-04-18T16:02:10Z

spec/filewatch/buftok_spec.rb

+      expect(subject.extract("hello\r\nworld\r\n")).to eq ["hello", "world"]
+    end
+  end
+end


colinsurprenant · 2018-04-18T16:04:58Z

@guyboertje did another review pass! good stuff.
one observation worth mentioning: we seem to have an inconsistency problem WRT env vars... I think we need to verify this against https://github.com/elastic/logstash/blob/57e7a8a56bdbf81647d6e116e75263796d6daf47/bin/logstash.lib.sh#L44-L48

fbaligand · 2018-04-19T09:01:43Z

@guyboertje
As requested, I just opened an issue to add "clean_removed" option : #176

colinsurprenant · 2018-04-20T18:28:31Z

spec/inputs/file_tail_spec.rb

-        sleep 0.25
-        first_lsof = lsof_proc.call
-        expect(first_lsof.scan(file_path).size).to eq(1)
+        wait(1).for{lsof_proc.call.scan(file_path).size}.to eq(1)


colinsurprenant · 2018-04-20T18:37:08Z

JAR_VERSION

@@ -0,0 +1 @@
+1.0.0


ok! I wonder how we manage jar versions in other plugins?! this seems like a reasonable strategy. the other strategy would be to no have any versions number on the jar since it is not published outside the plugin?

colinsurprenant · 2018-04-20T18:40:37Z

I left a few more comments but overall I am LGTM at this stage!

Maybe a closer look at the ensure block potential issue
I suggest we do a beta testing cycle since this is basically an almost complete rewrite?

colinsurprenant · 2018-04-20T18:42:16Z

Ah also, for the jar versioning - we should open a separate thread/issue to define a consistent strategy for jar(s) versioning across plugins.

guyboertje · 2018-04-23T08:17:02Z

For my part, I like the jar to have a version, more so if the Java code is where the main functionality lies. Mainly for my own peace of mind, in seeing which version is inside the gem (I had to yank dissect 1.1.3 because it had the jar from 1.1.1 in it.)

I suppose, for all hybrid plugins, we need to make sure that no jars get committed and pushed to Github, i.e. a gem publish will always rebuild the latest jar and include it in the gem.

guyboertje · 2018-04-23T10:41:15Z

About the GZIP ensure block.
This is shamelessly copied from the S3 input.
I will move the code to a private method and wrap the method call in a rescue.

colinsurprenant · 2018-04-23T14:49:39Z