unified2 files that cause UnicodeDecodeError during serialization #7

cleesmith · 2014-09-03T04:33:40Z

UnicodeDecodeError is caused by raw fields like:
tcp_options_raw
ip_options_raw
ip6_source_raw
ip6_destination_raw
... the error message contains:
UnicodeDecodeError('utf8', "\x01\x01\x08\n's\x04j\x14\xb2{", 10, 11, 'invalid start byte')

This can be re-created, before indexing into elasticsearch, using:
output["packet_details"] = packet.decode_ethernet(data)
from elasticsearch.serializer import JSONSerializer, Deserializer, DEFAULT_SERIALIZERS, TextSerializer
test_serializer = JSONSerializer().dumps(output["packet_details"])

For testing I'm using unified2-current.log and these files:
https://github.com/mephux/unified2/tree/master/example/seeds
... these log files are from 2010/11, so old but not too old, just trying to test as
many new/old unified2 files as I can find.

The current solution is to ignore these fields and not store
them in elasticsearch, but it might be useful to have them
in some format/encoding.

Suggestions ?
Is there some encoding/decoding/formatting of the "*_raw" fields I should do before
trying to index them ?

jasonish · 2014-09-03T05:05:04Z

First, I did question storing the raw binary bytes at first, thinking they may be useful at some point down the road, but I do not think they are useful for throwing into a database.

This is not really a problem with idstools, but more of a problem that the JSON encoder doesn't know how to handle the data. The YAML encoder however, does - it appears to recognize the data as binary as base64 encodes before writing out to the YAML.

Is there some encoding/decoding/formatting of the "*_raw" fields I should do before
trying to index them ?

Personally I'd remove the fields before JSON encoding them, or convert them to base64.

cleesmith · 2014-09-03T16:50:49Z

Sorry to be a bother, again, but would you have a list of fields which have "raw binary" so I could
ignore them. Or is there a python-way to detect a binary ... my python skills are not great.

From looking at packet.py my list so far is:

tcp_options_raw
ip_options_raw
ip6_source_raw
ip6_destination_raw
... and the payload data:
["packets"]["data"] - which is saved as base64, and also saved with unprintable chars removed

I'm just trying to ensure the daemon stays up/running and that it doesn't miss any events ... missing
fields are ok.

jasonish · 2014-09-03T18:32:14Z

Personally I would construct a new intermediary object where you assemble the data that will be serialized into JSON. This should prevent any surprises like new fields being added to the decoders that may not be JSON friendly by default.

cleesmith · 2014-09-03T18:58:29Z

Thanks for the suggestion, but I'm also trying to keep the code simple, small, and as fast as possible to keep up with snort. I guess any field derived from an ".unpack" statement within idstools is a possible problem for serialization, so I will just git clone and search for unpack.

cleesmith · 2014-09-04T07:53:22Z

I didn't want to open a new issue, but I'm seeing this message sometimes:
"Discarding non-event type while not in event context."
... what does this mean ?
I can't tell from looking at the add function in unified2.py.
Thanks.

jasonish · 2014-09-04T13:42:15Z

A unified file is made up of records, where a record can be an event, a packet or extra data.

Some unified2 files do not start with an event record, but instead start with a packet or extra data record. As we don't have an event to associate these records with, they are discarded..

The process is, read event records, read the following packet and extra data records and associate them with the event. When a new event record is seen, flush the previous event with its associated data.

I believe this is seen as Snort will roll over unified2 log files based on size, and it must be checking after each record, instead of after each event. So you can end up with the event records at the end of one file, and its packets and extra data in the new file.

The SpoolEventReader is the work around for this. Its meant to be used with a spool directory that Snort is logging to, using unified2 files with a timestamp suffix. It uses a cache (much like Barnyard2 I believe) to associate records at the start of one file with an event that started in a previous file.

jasonish closed this as completed Sep 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unified2 files that cause UnicodeDecodeError during serialization #7

unified2 files that cause UnicodeDecodeError during serialization #7

cleesmith commented Sep 3, 2014

jasonish commented Sep 3, 2014

cleesmith commented Sep 3, 2014

jasonish commented Sep 3, 2014

cleesmith commented Sep 3, 2014

cleesmith commented Sep 4, 2014

jasonish commented Sep 4, 2014

unified2 files that cause UnicodeDecodeError during serialization #7

unified2 files that cause UnicodeDecodeError during serialization #7

Comments

cleesmith commented Sep 3, 2014

jasonish commented Sep 3, 2014

cleesmith commented Sep 3, 2014

jasonish commented Sep 3, 2014

cleesmith commented Sep 3, 2014

cleesmith commented Sep 4, 2014

jasonish commented Sep 4, 2014