-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/skip unchanged files #62
Feature/skip unchanged files #62
Conversation
We store the fileSizeBytes as an offset in kafka connect and skip files whose file size has not changed
log.info("Processing records for file {}.", metadata); | ||
while (reader.hasNext()) { | ||
records.add(convert(metadata, reader.currentOffset() + 1, reader.next())); | ||
} | ||
} catch (ConnectException | IOException e) { | ||
//when an exception happens reading a file, the connector continues | ||
log.error("Error reading file [{}]. Keep going...", metadata.getPath(), e); | ||
return new ArrayList<SourceRecord>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why returning here an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment here: #60 (comment)
If we commit offsets for a file containing the fileSizeBytes
but we only processed part way through the file due to an error we will skip the data from the rest of the file. Instead we should only commit the offsets for that file if the file was processed entirely.
Thanks @grantatspothero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments below
Long byteOffset = (Long)fileSizeBytesObject; | ||
if (metadata.getLen() == byteOffset){ | ||
log.info("File {} has byte length and byte offset of: {}, skipping reading the file as it is unchanged since the last execution", metadata.getPath(), byteOffset); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer returning here a "fake" file reader object implementing the FileReader
interface with no elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of methods to implement and it complicates the code a lot for a simple operation (is this thing present?).
Is using java.util.Optional
ok instead? This connector only supports java8 or higher so it seems reasonable to use.
EDIT: I guess the interaction of autocloseable and java.util.optional is not great. I can go ahead with the fake file reader 👍
@@ -78,13 +78,18 @@ public void start(Map<String, String> properties) { | |||
List<SourceRecord> totalRecords = filesToProcess().map(metadata -> { | |||
List<SourceRecord> records = new ArrayList<>(); | |||
try (FileReader reader = policy.offer(metadata, context.offsetStorageReader())) { | |||
if(reader == null){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this validation. See comment related with the "fake" file reader.
if(fileSizeBytesObject != null) { | ||
Long byteOffset = (Long)fileSizeBytesObject; | ||
if (metadata.getLen() == byteOffset){ | ||
log.info("File {} has byte length and byte offset of: {}, skipping reading the file as it is unchanged since the last execution", metadata.getPath(), byteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split the message in the log to avoid a line too large
|
||
import java.util.Map; | ||
|
||
public class EmptyFileReader extends AbstractFileReader<Void>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test coverage dropped a lot because of this file. It doesn't really make sense to test most of these operations since they will never be called/they do nothing (ie: seekFile
is never called and is an empty method)
@mmolimar just bumping this, made changes |
@mmolimar Merged with recently updated develop, the failing test is a flaky test in the HDFS watcher so I think it is unrelated. |
.andReturn(offsetStorageReader) | ||
.times(2); | ||
|
||
// Every time the `offsetStorageReader.offset(params)` method is called we want to capture the offset params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a better way to use EasyMock to do this, but I could not find a way to do it in the EasyMock docs.
Thanks for the contrib! |
Breaking this PR into two: #60
This PR just contains the bits about skipping unchanged files by storing the
fileSizeBytes
in the offset.