Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Linear response for sequential iteration of large capture #48

Closed
chrconlo opened this issue Dec 11, 2014 · 6 comments
Closed

Non-Linear response for sequential iteration of large capture #48

chrconlo opened this issue Dec 11, 2014 · 6 comments

Comments

@chrconlo
Copy link

Thanks for the great tool!

I'm attempting to use pyshark to iterate large capture files (180K packets per file) and see non-linear response from iteration. Is there a way to tune pyshark to behave linearly in such circumstances? At around 3K packets is starts getting really slow.

        cap = pyshark.FileCapture(cap_file, keep_packets=False)
        cap.display_filter = ('stp')
        cap.apply_on_packets(getStpData)

bpduTracker chrconlo$ ./bpduTracker.py
Starting bpduTracker...
Capture File Found: /users/xyz/Desktop/20141202/prog/test.cap

2014-12-11 16:05:00.030286 -> Packet Count: 50
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:00.402444 -> Packet Count: 100
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:00.996808 -> Packet Count: 150
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:01.793828 -> Packet Count: 200
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:02.814992 -> Packet Count: 250
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:04.133944 -> Packet Count: 300
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:05.571902 -> Packet Count: 350
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:07.212769 -> Packet Count: 400
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:09.178611 -> Packet Count: 450
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:11.315323 -> Packet Count: 500
BPDU Accounting Dictionary Size: 0

2014-12-11 16:05:13.638662 -> Packet Count: 550
BPDU Accounting Dictionary Size: 0

@chrconlo
Copy link
Author

Not sure if you got a chance to look at this but a look through the code and some debugs it looks like the slowdown is in asyncio; more specifically the stream reader below. Are you seeing the same? Thanks

new_data = yield From(stream.read(self.DEFAULT_BATCH_SIZE))

@chrconlo
Copy link
Author

I've backed down to using older rev of pyShark (0.2.6), prior to asyncio, and see vastly better performance although some other things seem broken like frame_info, display filter on capture and getting certain field matches to work. I was able to process 420K packets in under 8 minutes. Thanks

-- Packet Processing Progress Report --
Runtime: 0:07:21.040105 -> Packet Count: [420000]
Processing Loop Time (Single Packet): 0:00:00.001523
BPDU Accounting Dictionary Size: [6463] entries

@KimiNewt
Copy link
Owner

Yes, it seems asyncio severely reduced performance (on unix anyway, I'm
oddly seeing good performance on windows).
I'll try optimizing it over the weekend hopefully, and if push comes to
shove ill remove asyncio. Conceptually it should not have lower
performance, but we'll see.

On Wednesday, December 17, 2014, chrconlo notifications@github.com wrote:

I've backed down to using older rev of pyShark (0.2.6), prior to asyncio,
and see vastly better performance although some other things seem broken
like frame_info, display filter on capture and getting certain field
matches to work. I was able to process 420K packets in under 8 minutes.
Thanks

-- Packet Processing Progress Report --
Runtime: 0:07:21.040105 -> Packet Count: [420000]
Processing Loop Time (Single Packet): 0:00:00.001523
BPDU Accounting Dictionary Size: [6463] entries


Reply to this email directly or view it on GitHub
#48 (comment).

@KimiNewt
Copy link
Owner

I've (probably) isolated the problem to: https://github.com/KimiNewt/pyshark/blob/master/src/pyshark/capture/capture.py#L147
What seems to be happening is that a large amount of data (tshark XML) is in the subprocess stdout pipe. We read that one packet at a time, and the XML grows faster and larger as time goes on.
That line copies over what might be a very large string. That took ~40ms on a large cap file I tried.

The solution is probably to extract ALL the packets at once from the data received instead of one at-a-time (we can't use lxml for this as it does not support parsing partial XMLs). I'll try finding a solution for that.

@chrconlo
Copy link
Author

chrconlo commented Jan 8, 2015

Interesting. Any luck finding a solution for this? Thanks

@KimiNewt KimiNewt closed this as completed May 9, 2015
@KimiNewt
Copy link
Owner

KimiNewt commented May 9, 2015

Fixed by PR #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants