Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse RTF attachments/bodies #26

Closed
2 of 3 tasks
jrideout opened this issue Nov 28, 2018 · 8 comments
Closed
2 of 3 tasks

Parse RTF attachments/bodies #26

jrideout opened this issue Nov 28, 2018 · 8 comments

Comments

@jrideout
Copy link
Collaborator

jrideout commented Nov 28, 2018

@petri
Copy link
Member

petri commented Nov 28, 2018

What does this mean exactly? Same as tnefparse --htmlbody but for RTF bodies? I seem to remember from a long time ago that RTF is indeed embedded/wrapped in some funky way in TNEF...

@jrideout
Copy link
Collaborator Author

Exactly, we'll want to support tnefparse --rtfbody

@jrideout
Copy link
Collaborator Author

The one thing I'm not certain about is if we need to decompress the rtf, or if the rtf data is valid even when compressed. https://github.com/delimitry/compressed_rtf seems to do what we need to just decompress the data without fully parsing it.

@petri
Copy link
Member

petri commented Nov 29, 2018

Nice! That's a small dependency well worth it I'd think.

@jrideout
Copy link
Collaborator Author

jrideout commented Nov 30, 2018

This does what I want for RTF parsing: https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676

I'd rather not add a dependency for a full rtf parser. Should we include this file in our source, or just leave it outside the scope of the project?

@petri
Copy link
Member

petri commented Nov 30, 2018

Hm. I consider document format conversions to be outside the scope, but some limited use cases might fall on the borderline. If I may ask, what's the goal - just support extraction of plaintext words for indexing, or something else?

@jrideout
Copy link
Collaborator Author

jrideout commented Nov 30, 2018

what's the goal - just support extraction of plaintext words for indexing

just that

I consider document format conversions to be outside the scope

I agree. Let's stop here. Users can do their own RTF parsing if desired.

jrideout pushed a commit to agaridata/tnefparse that referenced this issue Nov 30, 2018
…ing_from_with_multiple_addresses

BP-205: Fix header From: parsing when multiple addresses exist
@petri
Copy link
Member

petri commented Dec 1, 2018

I gave this some more thought. Conversions of tnef body content in general are out of scope of tnefparse.

But I am pretty sure I remember the RTF/HTML bodies are to some extent specific to the MS TNEF implementations, with some quirks and deviations. That makes me think extraction of plaintext is something that's within the scope here.

So that can be revisited.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants