-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unified2 files that cause UnicodeDecodeError during serialization #7
Comments
First, I did question storing the raw binary bytes at first, thinking they may be useful at some point down the road, but I do not think they are useful for throwing into a database. This is not really a problem with idstools, but more of a problem that the JSON encoder doesn't know how to handle the data. The YAML encoder however, does - it appears to recognize the data as binary as base64 encodes before writing out to the YAML.
Personally I'd remove the fields before JSON encoding them, or convert them to base64. |
Sorry to be a bother, again, but would you have a list of fields which have "raw binary" so I could From looking at packet.py my list so far is:
I'm just trying to ensure the daemon stays up/running and that it doesn't miss any events ... missing |
Personally I would construct a new intermediary object where you assemble the data that will be serialized into JSON. This should prevent any surprises like new fields being added to the decoders that may not be JSON friendly by default. |
Thanks for the suggestion, but I'm also trying to keep the code simple, small, and as fast as possible to keep up with snort. I guess any field derived from an ".unpack" statement within idstools is a possible problem for serialization, so I will just git clone and search for unpack. |
I didn't want to open a new issue, but I'm seeing this message sometimes: |
A unified file is made up of records, where a record can be an event, a packet or extra data. Some unified2 files do not start with an event record, but instead start with a packet or extra data record. As we don't have an event to associate these records with, they are discarded.. The process is, read event records, read the following packet and extra data records and associate them with the event. When a new event record is seen, flush the previous event with its associated data. I believe this is seen as Snort will roll over unified2 log files based on size, and it must be checking after each record, instead of after each event. So you can end up with the event records at the end of one file, and its packets and extra data in the new file. The SpoolEventReader is the work around for this. Its meant to be used with a spool directory that Snort is logging to, using unified2 files with a timestamp suffix. It uses a cache (much like Barnyard2 I believe) to associate records at the start of one file with an event that started in a previous file. |
UnicodeDecodeError is caused by raw fields like:
tcp_options_raw
ip_options_raw
ip6_source_raw
ip6_destination_raw
... the error message contains:
UnicodeDecodeError('utf8', "\x01\x01\x08\n's\x04j\x14\xb2{", 10, 11, 'invalid start byte')
This can be re-created, before indexing into elasticsearch, using:
output["packet_details"] = packet.decode_ethernet(data)
from elasticsearch.serializer import JSONSerializer, Deserializer, DEFAULT_SERIALIZERS, TextSerializer
test_serializer = JSONSerializer().dumps(output["packet_details"])
For testing I'm using unified2-current.log and these files:
https://github.com/mephux/unified2/tree/master/example/seeds
... these log files are from 2010/11, so old but not too old, just trying to test as
many new/old unified2 files as I can find.
The current solution is to ignore these fields and not store
them in elasticsearch, but it might be useful to have them
in some format/encoding.
Suggestions ?
Is there some encoding/decoding/formatting of the "*_raw" fields I should do before
trying to index them ?
The text was updated successfully, but these errors were encountered: