Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unified2 files that cause UnicodeDecodeError during serialization #7

Closed
cleesmith opened this issue Sep 3, 2014 · 6 comments
Closed

Comments

@cleesmith
Copy link

UnicodeDecodeError is caused by raw fields like:
tcp_options_raw
ip_options_raw
ip6_source_raw
ip6_destination_raw
... the error message contains:
UnicodeDecodeError('utf8', "\x01\x01\x08\n's\x04j\x14\xb2{", 10, 11, 'invalid start byte')

This can be re-created, before indexing into elasticsearch, using:
output["packet_details"] = packet.decode_ethernet(data)
from elasticsearch.serializer import JSONSerializer, Deserializer, DEFAULT_SERIALIZERS, TextSerializer
test_serializer = JSONSerializer().dumps(output["packet_details"])

For testing I'm using unified2-current.log and these files:
https://github.com/mephux/unified2/tree/master/example/seeds
... these log files are from 2010/11, so old but not too old, just trying to test as
many new/old unified2 files as I can find.

The current solution is to ignore these fields and not store
them in elasticsearch, but it might be useful to have them
in some format/encoding.

Suggestions ?
Is there some encoding/decoding/formatting of the "*_raw" fields I should do before
trying to index them ?

@jasonish
Copy link
Owner

jasonish commented Sep 3, 2014

First, I did question storing the raw binary bytes at first, thinking they may be useful at some point down the road, but I do not think they are useful for throwing into a database.

This is not really a problem with idstools, but more of a problem that the JSON encoder doesn't know how to handle the data. The YAML encoder however, does - it appears to recognize the data as binary as base64 encodes before writing out to the YAML.

Is there some encoding/decoding/formatting of the "*_raw" fields I should do before
trying to index them ?

Personally I'd remove the fields before JSON encoding them, or convert them to base64.

@jasonish jasonish closed this as completed Sep 3, 2014
@cleesmith
Copy link
Author

Sorry to be a bother, again, but would you have a list of fields which have "raw binary" so I could
ignore them. Or is there a python-way to detect a binary ... my python skills are not great.

From looking at packet.py my list so far is:

  1. tcp_options_raw
  2. ip_options_raw
  3. ip6_source_raw
  4. ip6_destination_raw
    ... and the payload data:
  5. ["packets"]["data"] - which is saved as base64, and also saved with unprintable chars removed

I'm just trying to ensure the daemon stays up/running and that it doesn't miss any events ... missing
fields are ok.

@jasonish
Copy link
Owner

jasonish commented Sep 3, 2014

Personally I would construct a new intermediary object where you assemble the data that will be serialized into JSON. This should prevent any surprises like new fields being added to the decoders that may not be JSON friendly by default.

@cleesmith
Copy link
Author

Thanks for the suggestion, but I'm also trying to keep the code simple, small, and as fast as possible to keep up with snort. I guess any field derived from an ".unpack" statement within idstools is a possible problem for serialization, so I will just git clone and search for unpack.

@cleesmith
Copy link
Author

I didn't want to open a new issue, but I'm seeing this message sometimes:
"Discarding non-event type while not in event context."
... what does this mean ?
I can't tell from looking at the add function in unified2.py.
Thanks.

@jasonish
Copy link
Owner

jasonish commented Sep 4, 2014

A unified file is made up of records, where a record can be an event, a packet or extra data.

Some unified2 files do not start with an event record, but instead start with a packet or extra data record. As we don't have an event to associate these records with, they are discarded..

The process is, read event records, read the following packet and extra data records and associate them with the event. When a new event record is seen, flush the previous event with its associated data.

I believe this is seen as Snort will roll over unified2 log files based on size, and it must be checking after each record, instead of after each event. So you can end up with the event records at the end of one file, and its packets and extra data in the new file.

The SpoolEventReader is the work around for this. Its meant to be used with a spool directory that Snort is logging to, using unified2 files with a timestamp suffix. It uses a cache (much like Barnyard2 I believe) to associate records at the start of one file with an event that started in a previous file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants