# C2 Beaconing Detection using Statistical Analysis
### Simplified RITA beacon analyzer implementation in Jupyther Notebook.
### RITA framework ingests Zeek logs in V format. This implementation inges PCAP files.
#### Implementation based on https://github.com/Cyb3r-Monk/RITA-J.

## Introduction

How to differ user traffic from beacon traffic using statistical analysis: https://infosecjupyterthon.com/2021/sessions/day2-5-C2_Beaconing_Detection_using_Statistical_Analysis.html. 
TLDR:
- If beacon traffic ==> **uniform** distribution and **small** Median Absolute Deviation of time deltas
- If user traffic ==> **skewed** distribution and **large** Median Absolute Deviation of time deltas


In [215]:
import math
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Preparing the data

### Loading the data

In order to operate on the data later you have to import it into Pandas Dataframe. The easiest way to do so is to convert PCAP file to CSV format using external tools.
- open PCAP file in Wireshark
- File -> Export Packet Dissections -> As CSV...
- Save file in this notebook's location as traffic.csv

Example traffic pcap: 

In [216]:
df = pd.read_csv('traffic1.csv', sep=',')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67276 entries, 0 to 67275
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   No.          67276 non-null  int64  
 1   Time         67276 non-null  float64
 2   Source       67276 non-null  object 
 3   Destination  67276 non-null  object 
 4   Protocol     67276 non-null  object 
 5   Length       67276 non-null  int64  
 6   Info         67276 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 3.6+ MB


## Filtering Required Columns
For this task we don't need *Length*, *Info* and *No.* columns (Pandas creates number(*#*) column by ielf). In order to filter columns in Pandas:
```python
 df.loc[first_row_index:last_row_index , ['column1', 'column3']] #if you need to leave all rows type just: ":"
```

In [217]:
time = 'Time'
src = 'Source'
dst = 'Destination'
proto = 'Protocol'

df = df.loc[:, [time, src, dst, proto]]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67276 entries, 0 to 67275
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Time         67276 non-null  float64
 1   Source       67276 non-null  object 
 2   Destination  67276 non-null  object 
 3   Protocol     67276 non-null  object 
dtypes: float64(1), object(3)
memory usage: 2.1+ MB


## Analysing the Data

### Grouping the Connections

Now we need to group the connections beetween the same hos and agregate timestamps into a list.

In [218]:
df = df.groupby([src, dst, proto]).agg(list) #TODO: tutaj na razie wywalilem proto z kolumn i zobacze co wyjdzie
df.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Time
Source,Destination,Protocol,Unnamed: 3_level_1
10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524..."
10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634..."
10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897..."
10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118..."
10.12.15.101,10.12.15.15,EPM,"[996.40286, 2120.3067, 2120.313517, 5720.28495..."
10.12.15.101,10.12.15.15,KRB5,"[16001.985029, 16001.986043, 16001.987356]"
10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676..."
10.12.15.101,10.12.15.15,LSARPC,"[10589.206797, 10589.20761, 10589.20842]"
10.12.15.101,10.12.15.15,NTP,"[1749.464654, 3797.481993, 5845.527752, 7893.5..."
10.12.15.101,10.12.15.15,RPC_NETLOGON,"[6146.607605, 12447.514515, 18747.513923]"


### Reseting the Indexes

In [219]:
df.reset_index(inplace=True)
df.head(30)

Unnamed: 0,Source,Destination,Protocol,Time
0,10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524..."
1,10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634..."
2,10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897..."
3,10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118..."
4,10.12.15.101,10.12.15.15,EPM,"[996.40286, 2120.3067, 2120.313517, 5720.28495..."
5,10.12.15.101,10.12.15.15,KRB5,"[16001.985029, 16001.986043, 16001.987356]"
6,10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676..."
7,10.12.15.101,10.12.15.15,LSARPC,"[10589.206797, 10589.20761, 10589.20842]"
8,10.12.15.101,10.12.15.15,NTP,"[1749.464654, 3797.481993, 5845.527752, 7893.5..."
9,10.12.15.101,10.12.15.15,RPC_NETLOGON,"[6146.607605, 12447.514515, 18747.513923]"


### Calculationg connection count

In [220]:
count = 'Count'
df[count] = df[time].apply(lambda x: len(x))
df.head(30)

Unnamed: 0,Source,Destination,Protocol,Time,Count
0,10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524...",44
1,10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634...",41
2,10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897...",123
3,10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118...",51
4,10.12.15.101,10.12.15.15,EPM,"[996.40286, 2120.3067, 2120.313517, 5720.28495...",20
5,10.12.15.101,10.12.15.15,KRB5,"[16001.985029, 16001.986043, 16001.987356]",3
6,10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676...",116
7,10.12.15.101,10.12.15.15,LSARPC,"[10589.206797, 10589.20761, 10589.20842]",3
8,10.12.15.101,10.12.15.15,NTP,"[1749.464654, 3797.481993, 5845.527752, 7893.5...",7
9,10.12.15.101,10.12.15.15,RPC_NETLOGON,"[6146.607605, 12447.514515, 18747.513923]",3


### Removing short sessions

Removing short session will help us get rid of many false-positives. We are also resetting the indexes again to get rid of gaps.

In [221]:
df = df.loc[df[count] > 36]
df.reset_index(inplace=True)
df = df.loc[:, [src, dst, proto, time, count]]
df.head(30)

Unnamed: 0,Source,Destination,Protocol,Time,Count
0,10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524...",44
1,10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634...",41
2,10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897...",123
3,10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118...",51
4,10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676...",116
5,10.12.15.101,10.12.15.15,SMB2,"[400.478648, 511.551121, 511.553073, 511.55653...",411
6,10.12.15.101,10.12.15.15,TCP,"[40.176494, 160.177513, 280.178683, 400.192905...",715
7,10.12.15.101,172.241.27.244,HTTP,"[13467.840805, 13527.940792, 13533.041096, 135...",2501
8,10.12.15.101,172.241.27.244,TCP,"[13406.400747, 13406.434748, 13406.483509, 134...",19007
9,10.12.15.101,172.241.27.244,TLSv1.2,"[13406.437224, 13406.578584, 13406.692834, 134...",3466


### Calculationg time deltas

We need to create new column containing time deltas beetween following connections. This step is required for further calculations.

In [222]:
dlt = 'Deltas'
df[dlt] = df[time].apply(lambda x: pd.Series(x).diff().dropna().tolist())
df.head(30)

Unnamed: 0,Source,Destination,Protocol,Time,Count,Deltas
0,10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524...",44,"[272.632248, 0.344744999999989, 1123.555738, 3..."
1,10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634...",41,"[0.001752000000010412, 0.0007269999999834909, ..."
2,10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897...",123,"[102.97347400000001, 28.694436999999994, 0.095..."
3,10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118...",51,"[0.0003510000000233049, 0.0003329999999550637,..."
4,10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676...",116,"[0.0026070000000117943, 0.011744999999905303, ..."
5,10.12.15.101,10.12.15.15,SMB2,"[400.478648, 511.551121, 511.553073, 511.55653...",411,"[111.072473, 0.0019519999999602078, 0.00346300..."
6,10.12.15.101,10.12.15.15,TCP,"[40.176494, 160.177513, 280.178683, 400.192905...",715,"[120.00101900000001, 120.00116999999997, 120.0..."
7,10.12.15.101,172.241.27.244,HTTP,"[13467.840805, 13527.940792, 13533.041096, 135...",2501,"[60.0999869999996, 5.100304000001415, 5.101183..."
8,10.12.15.101,172.241.27.244,TCP,"[13406.400747, 13406.434748, 13406.483509, 134...",19007,"[0.03400099999998929, 0.048761000000013155, 0...."
9,10.12.15.101,172.241.27.244,TLSv1.2,"[13406.437224, 13406.578584, 13406.692834, 134...",3466,"[0.14136000000144122, 0.11424999999871943, 0.2..."


### Generating variables required for score calculation

In [223]:
df['Low'] = df[dlt].apply(lambda x: np.percentile(np.array(x), 20))
df['Mid'] = df[dlt].apply(lambda x: np.percentile(np.array(x), 50))
df['High'] = df[dlt].apply(lambda x: np.percentile(np.array(x), 80))
df['BowleyNum'] = df['Low'] + df['High'] - 2 * df['Mid'] 
df['BowleyDen'] = df['High'] - df['Low'] 
df['Skew'] = df[['Low', 'Mid', 'High', 'BowleyNum', 'BowleyDen']].apply(lambda x: x['BowleyNum'] / x['BowleyDen'] if x['BowleyNum'] != 0 and x['Mid'] != x['Low'] and x['Mid'] != x['High'] else 0.0, axis = 1)
df['Madm'] = df[dlt].apply(lambda x: np.median(np.absolute(np.array(x) - np.median(np.array(x)))))
df['ConnDiv'] = df[time].apply(lambda x: x[-1] - x[0]) 
df.head(30)

Unnamed: 0,Source,Destination,Protocol,Time,Count,Deltas,Low,Mid,High,BowleyNum,BowleyDen,Skew,Madm,ConnDiv
0,10.12.15.101,10.12.15.15,CLDAP,"[723.65251, 996.284758, 996.629503, 2120.18524...",44,"[272.632248, 0.344744999999989, 1123.555738, 3...",0.391889,68.554068,950.212076,813.495829,949.820186,0.856474,68.433433,19317.787009
1,10.12.15.101,10.12.15.15,DCERPC,"[996.402381, 996.404133, 996.40486, 2120.30634...",41,"[0.001752000000010412, 0.0007269999999834909, ...",0.001005,0.002455,1152.898627,1152.894721,1152.897622,0.999997,0.001671,17751.112642
2,10.12.15.101,10.12.15.15,DNS,"[591.575815, 694.549289, 723.243726, 723.33897...",123,"[102.97347400000001, 28.694436999999994, 0.095...",0.488819,86.548064,330.717928,158.110618,330.229108,0.478791,85.759133,19449.862799
3,10.12.15.101,10.12.15.15,DRSUAPI,"[996.405217, 996.405568, 996.405901, 2120.3118...",51,"[0.0003510000000233049, 0.0003329999999550637,...",0.00035,0.000544,123.158353,123.157615,123.158003,0.999997,0.000208,17751.111674
4,10.12.15.101,10.12.15.15,LDAP,"[996.612962, 996.615569, 996.627314, 996.74676...",116,"[0.0026070000000117943, 0.011744999999905303, ...",0.000438,0.001417,0.012073,0.009677,0.011635,0.83168,0.001108,17751.691632
5,10.12.15.101,10.12.15.15,SMB2,"[400.478648, 511.551121, 511.553073, 511.55653...",411,"[111.072473, 0.0019519999999602078, 0.00346300...",0.0002,0.000424,10.732199,10.731551,10.731999,0.999958,0.000256,19213.157736
6,10.12.15.101,10.12.15.15,TCP,"[40.176494, 160.177513, 280.178683, 400.192905...",715,"[120.00101900000001, 120.00116999999997, 120.0...",0.000433,0.281315,60.030791,59.468592,60.030358,0.990642,0.281253,19933.531681
7,10.12.15.101,172.241.27.244,HTTP,"[13467.840805, 13527.940792, 13533.041096, 135...",2501,"[60.0999869999996, 5.100304000001415, 5.101183...",0.046394,3.179426,5.068618,-1.243841,5.022224,-0.247667,1.93166,6582.598068
8,10.12.15.101,172.241.27.244,TCP,"[13406.400747, 13406.434748, 13406.483509, 134...",19007,"[0.03400099999998929, 0.048761000000013155, 0....",0.000185,0.04237,0.110825,0.026271,0.11064,0.237446,0.041961,6646.314089
9,10.12.15.101,172.241.27.244,TLSv1.2,"[13406.437224, 13406.578584, 13406.692834, 134...",3466,"[0.14136000000144122, 0.11424999999871943, 0.2...",0.001465,0.116186,5.162748,4.931841,5.161284,0.955545,0.114731,6646.168672


### Calculating the score

In [224]:
score = 'Score'
df['SkewScore'] = 1.0 - abs(df['Skew'])
df['MadmScore'] = 1.0 - df['Madm']/30.0
df['MadmScore'] = df['MadmScore'].apply(lambda x: 0 if x < 0 else x)
df['ConnCountScore'] = 10 * (df[count]) / df['ConnDiv']
df['ConnCountScore'] = df['ConnCountScore'].apply(lambda x: 1.0 if x > 1.0 else x)
df[score] = (((df['SkewScore'] + df['MadmScore'] + df['ConnCountScore']) / 3.0) * 1000) / 1000
df.sort_values(by= 'Score', ascending=False, inplace=True, ignore_index=True)
df[[score, count, src, dst, proto, dlt]].head(5)


Unnamed: 0,Score,Count,Source,Destination,Protocol,Deltas
0,0.982519,1282,185.125.206.173,10.12.15.101,HTTP,"[60.32279100000051, 60.331709999998566, 5.3335..."
1,0.966995,1282,10.12.15.101,185.125.206.173,HTTP,"[60.326828000001115, 60.331803999999465, 5.328..."
2,0.920385,19007,10.12.15.101,172.241.27.244,TCP,"[0.03400099999998929, 0.048761000000013155, 0...."
3,0.896604,2501,172.241.27.244,10.12.15.101,HTTP,"[60.10095299999921, 5.1002430000007735, 5.0978..."
4,0.895981,2501,10.12.15.101,172.241.27.244,HTTP,"[60.0999869999996, 5.100304000001415, 5.101183..."


We can get rid off all redundant columns.

In [225]:
df = df.loc[:, [score, count, src, dst, proto]]
df.head(30)

Unnamed: 0,Score,Count,Source,Destination,Protocol
0,0.982519,1282,185.125.206.173,10.12.15.101,HTTP
1,0.966995,1282,10.12.15.101,185.125.206.173,HTTP
2,0.920385,19007,10.12.15.101,172.241.27.244,TCP
3,0.896604,2501,172.241.27.244,10.12.15.101,HTTP
4,0.895981,2501,10.12.15.101,172.241.27.244,HTTP
5,0.7285,6293,10.12.15.101,185.125.206.173,TCP
6,0.703349,15858,172.241.27.244,10.12.15.101,TCP
7,0.685554,5350,185.125.206.173,10.12.15.101,TCP
8,0.68021,3466,10.12.15.101,172.241.27.244,TLSv1.2
9,0.671963,3925,172.241.27.244,10.12.15.101,TLSv1.2
