<h1 align='center'> Pcap Payload Analysis<br> For Automatic Generation of Signatures <br>Applied to Intrusion Detection Systems (IDS) </h1>

### GOAL(s): 
1. Retrieve (fastest) the longest set of strings that repeats in the payload of packets of one (D)DoS attack 
2. Generates a signature for most known IDS (BRO, Suricata, and Snort) using the most frequent string in an (D)DoS attack trace
3. Validate which (string matching and IDS) approaches provide the best result on detecting (D)DoS attacks.

## Which libraries I need to pre-process the pcap file? 

In [17]:
import pandas as pd
import numpy as np

from io import StringIO
import re

## What I used to get the payload from a pcap file?

In [244]:
# !tcpdump -nttxv -r test.pcap > test.txt

#### Question? r.t.f.m. http://www.tcpdump.org/tcpdump_man.html and http://www.tcpdump.org/ 

## How to load the (read) pcap file (with payload)?

In [276]:
pcapfile='data/test.txt'

## How the (read) pcap looks like?

In [277]:
lines_tobe_printed = 10

with open(pcapfile) as myfile:
    firstlines=myfile.readlines()[0:lines_tobe_printed] #put here the interval you want
    for x in firstlines:
        print(x.strip())

1489661005.627745 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 130.89.13.20 tell 130.89.15.174, length 46
0x0000:  0001 0800 0604 0001 902b 3431 4150 8259
0x0010:  0fae 0000 0000 0000 8259 0d14 0000 0000
0x0020:  0000 0000 0000 0000 0000 0000 0000
1489661005.696902 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
130.89.14.205 > 224.0.0.251: igmp v2 report 224.0.0.251
0x0000:  46c0 0020 0000 4000 0102 71f6 8259 0ecd
0x0010:  e000 00fb 9404 0000 1600 0904 e000 00fb
1489661005.741122 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 130.89.13.49 tell 130.89.13.14, length 46
0x0000:  0001 0800 0604 0001 a4ba dbf7 a662 8259


## How to isolate only the payload (in a list)?

In [278]:
sio = StringIO()
fast_forward = True
payload=''
appended_data = []

with open(pcapfile, 'rb') as f:
    for line in f:
        line = line.decode("utf-8").strip() # converting Bytes in utf-8

        if re.match(r'0x\d+', line): #Getting lines that have payload
            if line.startswith('0x0000'):
                appended_data.append(payload.strip())
                #print(payload.strip()) #DEBUG: Print the concatenated previous payload (without blank space in the beginning)
                payload = ''
            payload = payload + ' '+ line.split(':  ')[1] 
    appended_data.pop(0) # Removing the first line that is an empty line 

In [279]:
type(appended_data)

list

## Showing the first 5 payloads (lines)

In [280]:
appended_data[0:4]

['0001 0800 0604 0001 902b 3431 4150 8259 0fae 0000 0000 0000 8259 0d14 0000 0000 0000 0000 0000 0000 0000 0000 0000',
 '46c0 0020 0000 4000 0102 71f6 8259 0ecd e000 00fb 9404 0000 1600 0904 e000 00fb',
 '0001 0800 0604 0001 a4ba dbf7 a662 8259 0d0e 0000 0000 0000 8259 0d31 0000 0000 0000 0000 0000 0000 0000 0000 0000',
 '0001 0800 0604 0001 38c9 8617 999b 8259 0d15 0000 0000 0000 8259 0c01 0000 0000 0000 0000 0000 0000 0000 0000 0000']

## How to load the (list of) packets payload into a dataframe?

In [281]:
df = pd.DataFrame(appended_data)

## How to show the first lines of the dataframe?

In [282]:
df.head()

Unnamed: 0,0
0,0001 0800 0604 0001 902b 3431 4150 8259 0fae 0...
1,46c0 0020 0000 4000 0102 71f6 8259 0ecd e000 0...
2,0001 0800 0604 0001 a4ba dbf7 a662 8259 0d0e 0...
3,0001 0800 0604 0001 38c9 8617 999b 8259 0d15 0...
4,0001 0800 0604 0001 902b 3431 4150 8259 0fae 0...


<h1 align='center'>=========================================================================<br>
String Similarity Analysis<br>
=========================================================================</h1>

** String similarity, fuzzy string matching: **
- Cosine distance
- Damerau–Levenshtein distance
- Euclidean distance
- Fuzzy matching
- Hamming distance
- Jaccard distance
- Jaro–Winkler distance
- Levenshtein distance
- Longest Common Substring distance
- q-gram distance
- Manhattan distance
- Optimal matching
- Sørensen–Dice coefficient
- String kernel
- Wagner–Fischer algorithm

 
** Take a look also in:**
- SequenceMatcher from difflib
- get_matching_blocks from SequenceMatcher from difflib
- genoma alignment (e.g., Pairwise Sequence Alignment)