This script will iterate in all the folders in the same directory and look for pcaps. It will merge the contents of 5 folders at a time, generating a new pcapng inside the directory pcaps. Then it will remove any duplicate they have, and put the new files in the dir cleaned_pcaps.

In [1]:
import os

from pathlib import Path
from concurrent.futures import ThreadPoolExecutor


NTHREADS = 8

Take all the separated pcaps, merge them into a couple of files.
This is an incredibly slow process; it took me 3+ hours.
Consider running it overnight.

In [None]:
args = ""
i = 0
with ThreadPoolExecutor(NTHREADS) as executor:
    for dir in Path(".").iterdir():
        if dir.is_file():
            continue
        args += f"{dir.name}/* "
        i += 1
        if i % 5 == 0:
            command = f"mergecap -w out{i/5:.0f}.pcapng {args}"
            print(command)
            executor.submit(os.system, command)
            args = ""
command = f"mergecap -w pcaps/out{i/5:.0f}.pcapng {args}"
print(command)
os.system(command)

Remove most of the duplicate packets (there's plenty).
Take into account that the biggest files are more than 5G, so keep in mind that 4 threads at a time might be too much. I had a freeze with 4. If you have 16G of RAM in your system, I suggest tuning it down to max 2 threads. If you have 8, make it 1.

In [2]:
# 1m 54.5s
with ThreadPoolExecutor(NTHREADS) as executor:
    for filename in Path("pcaps").iterdir():
        cmd = f"editcap -F pcap -d {filename} cleaned_pcaps/{filename.stem}_cleaned.pcap"
        executor.submit(os.system, cmd)


2670252 packets seen, 2390693 packets skipped with duplicate window of 5 packets.
2950081 packets seen, 2643848 packets skipped with duplicate window of 5 packets.
3257364 packets seen, 2917489 packets skipped with duplicate window of 5 packets.
3550386 packets seen, 3184676 packets skipped with duplicate window of 5 packets.
3822946 packets seen, 3424800 packets skipped with duplicate window of 5 packets.
4133967 packets seen, 3703876 packets skipped with duplicate window of 5 packets.
4393184 packets seen, 3943731 packets skipped with duplicate window of 5 packets.
4596077 packets seen, 4123025 packets skipped with duplicate window of 5 packets.
4897141 packets seen, 4391344 packets skipped with duplicate window of 5 packets.
5233294 packets seen, 4691221 packets skipped with duplicate window of 5 packets.
5569366 packets seen, 4997789 packets skipped with duplicate window of 5 packets.
5936182 packets seen, 5320031 packets skipped with duplicate window of 5 packets.
6399451 packets 

Merge all the cleaned files into a single pcapng. Why not do this first? There's so many duplicates that it would make for an enormous file (which also means high ram consumption that is not possible to parallelize)
I went from 131.8G to 22.2G.  

In [10]:
# 1m 16.8s
cmd = f"mergecap -w out_final.pcap cleaned_pcaps/*.pcap"
print(f"Executing `{cmd}`")
os.system(cmd)
print(f"Done merging all pcaps.")

Executing `mergecap -w out_final.pcapng new_pcaps/*.pcapng`
Done merging all pcaps.


Clean the final pcap for good. I went from 22.2G to 9.9G

In [3]:
# 1m 22.9s
filename = Path("out_final.pcap")
cmd = f"editcap -F pcap -d {filename} {filename.stem}_cleaned.pcap"
print(f"Executing `{cmd}`")
os.system(cmd)


Executing `editcap -F pcap -d out_final.pcapng out_final_cleaned.pcap`


73780810 packets seen, 68460013 packets skipped with duplicate window of 5 packets.


0

Split to multiple pcaps of less than 1G for CICFlowMeter (It doesn't handle large files well https://github.com/ahlashkari/CICFlowMeter/issues/119/)

After this, use the fixed version of CICFlowMeter to compute the flow dataset from cleaned_pcaps/out_final_cleaned.pcapng
