Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

Open
jfarwer opened this issue Jan 3, 2022 · 12 comments
Assignees

Comments

@jfarwer
Copy link

jfarwer commented Jan 3, 2022

Hi,

I think there is a slight issue in getMessagePositions() in /fortuna/mstor/data/MboxFile.java.

An Mbox file is split into managable chunks of DEFAULT_BUFFER_SIZE. For each chunk a search is done for the pattern for the beginning of an email: (\\A|\\n{2}|(\\r\\n){2})^From .*$).

When creating the chunks, the last seven characters (FROM__PREFIX.length() + 2) from each chunk are added to the beginning of the next chunk. This is to avoid losing a From_ match due to splitting through the pattern when creating the chunks. Seven characters cater for any split within \n\nFrom_.

However when \r\n\r\nFrom_ is split between the 'm' and the blank then we are left with \n\r\nFrom_. The pattern \\A|\\n{2}|(\\r\\n){2})^From .*$ doesn't match this and the beginning of the email is skipped. Maybe adding \n\r\nFrom_ to the pattern would fix it? This would also cater for the case where an Mbox file contains mixed newline styles (and possibly \n\r\nFrom_ at the beginning of an email).

@benfortuna
Copy link
Member

Thanks for the feedback. I'll review the pattern and try to create a test to reproduce this issue.

@benfortuna benfortuna self-assigned this Jan 4, 2022
@jfarwer
Copy link
Author

jfarwer commented Jan 4, 2022

Thanks for the quick response. We are working with sets of a few hundred thousand emails; it would be great to have it fixed. Unfortunately, I can't give you my files for testing for privacy reasons.

@benfortuna
Copy link
Member

I had a closer look and there is already support for overlapping buffers to prevent loss of the From_ header.

However there was a bug that it wasn't accounting for enough overlap bytes to include the whitespace. I've applied a fix and released as v1.0.1.

As this is the first release in quite a while, there may be some unexpected behaviour changes, but if you can try and provide feedback I'll address as needed.

@jfarwer
Copy link
Author

jfarwer commented Jan 6, 2022

Great, that was quick. I will give feedback to you when I have done some tests.

@jfarwer
Copy link
Author

jfarwer commented Jan 7, 2022

That is looking good. In mstor-1.0.1.pom I had to change the occurrences of <scope>runtime</scope> to <scope>compile</scope>, but that is just because my code uses the dependencies itself without having them in its own pom file.

For using the MStorStore class I had to change net.fortuna.mstor.MStorStore to net.fortuna.mstor.model.MStorStore

I will let you know if I experience anything unexpected.

@jfarwer
Copy link
Author

jfarwer commented Jan 7, 2022

Hi,

Sorry for being a pain. I think the pattern (\\A|\\n{2}|(\\r\\n){2})^From .*$
looks for either the beginning of the string or two Unix newlines or two Windows newlines. I guess the beginning of the string is supposed to cater for the beginning of the Mbox file where there are no newlines in front of the From_. However, reading is done in chunks of 8192 characters, and if a chunk starts with From_, that matches the pattern. I have a large Mbox file with a From_ in a line of email text which is taken as the start of an email. I think this is not a new issue. It didn't crop up before because a different overlap was used, meaning the beginning of the chunks was different, and it didn't happen in the Mbox file I was looking at.

@benfortuna benfortuna reopened this Jan 20, 2022
@stleusc
Copy link

stleusc commented Jan 21, 2022

I am using the current version and I get this issue:
One of the message objects has null ID, null Subject, etc.
Enabled logging and found this:

[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

see the duplicate index?! For some reason the same index is used twice and the second time the message object is broken.
The second should not exist at all...
Pretty sure related to this...

@benfortuna
Copy link
Member

Ok that is starting to make sense. I guess if two concurrent messages are less than 8192 bytes then you could potentially have the same message in consecutive buffers. I'll add a fix to ensure duplicate indexes aren't added.

@stleusc
Copy link

stleusc commented Jan 21, 2022

I just debugged it a bit....
The first time it finds the message is at the end of the buffer and the buffer ends in: \r\n\r\nFrom_
It ends with and EXACT hit. Then the next loop you add that EXACT same text to the beginning of the next buffer which results in another hit.

I believe you detect the length of msg by looking at next or prior offset from posList and take the diff. That means for one message the difference is simply 0! This should be the case for the FIRST 16644222 since the next starts at the same index the length for this object is 0.

That means, once you fix the issue of the double match the other issue cannot happen anymore.
I would not even consider it a bug seeing what I see.

So, fixing the issue with match at the very end is all I would need.

@stleusc
Copy link

stleusc commented Jan 21, 2022

I think I might even have a possible fix...

Current code

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .*$", 8);

Proposal

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .+$", 8);

Your pattern matches \r\n\r\nFrom_ with NOTHING more. But since you add that exact pattern for another round you should not match it here. This can be fixed by requiring at LEAST 1 more character. That way, you NEVER can have a double hit and the above case would NOT match at the end of the buffer but will be caught the next round.

@stleusc
Copy link

stleusc commented Jan 21, 2022

With my change:
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222]
[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

:-)

@benfortuna
Copy link
Member

Yup I will add this change. In the meantime i pushed a crude fix to just check the last message position to avoid adding it to the index a second time. Release is just pushed should be on maven central shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants