# Feature Extraction and Initial Data Exploration and Cleaning
In the dataset provided, log data was stored in a text (.txt) file in the following format:

`
 2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709215 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:15 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709215 srcip=106.193.78.119 srcport=3082 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.1.176 dstport=990 dstintf="Local_LAN" dstintfrole="undefined" poluuid="ae59ebc2-1562-51e9-555a-fa3846aac163" sessionid=1943996287 proto=6 action="client-rst" policyid=63 policytype="policy" service="FTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=6 sentbyte=124 rcvdbyte=244 sentpkt=3 appcat="unscanned"
 `

However, in order to use this data in our analysis, we need to extract the relevant information from it. Hence, feature extraction is required.

In preliminary feature extraction, we need to identify the following information:

-   The number of unique features in each log line
-   The pattern in which each feature appears in the log lines

After that we use this information to formulate Regular Expressions (RE) for feature extraction.



## Feature Extraction

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
We read the log file from the dataset and store it in a list.


In [None]:
with open('/content/drive/MyDrive/logfiles.txt', 'r') as file:
    lines = file.readlines()
len(lines)

380837

---
We find the longest line in the log file in order to find the maximum number of features, since there is inconsistent number of features in each line.

In [None]:
lines = [line.strip() for line in lines]
longest_line = max(lines, key=len)
print(longest_line)

2023-10-31T04:45:05.644585+05:30 172.26.5.193 logver=506141727 timestamp=1698707640 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=04:44:00 logid="0419016384" type="utm" subtype="ips" eventtype="signature" level="alert" eventtime=1698707640 severity="high" srcip=211.63.167.125 srccountry="Korea, Republic of" dstip=172.26.2.54 srcintf="LLB- Connect" srcintfrole="wan" dstintf="Local_LAN" dstintfrole="undefined" sessionid=1943942664 action="dropped" proto=6 service="HTTP" policyid=42 attack="HTTP.Unix.Shell.IFS.Remote.Code.Execution" srcport=43529 dstport=443 direction="outgoing" attackid=45677 profile="default" ref="http://www.fortinet.com/ids/VID45677" incidentserialno=1625842232 msg="misc: HTTP.Unix.Shell.IFS.Remote.Code.Execution," crscore=30 crlevel="high"


---
Since the longest line in the log file is an attack log, we find a number of non-attack logs that has the maximum number of features to check for inconsistencies.

In [None]:
d = {}
for i, line in enumerate(lines):
    if 'attack' not in line:
      d[i] = line

print(d[max(d.keys())])
del d[max(d.keys())]
print(d[max(d.keys())])
del d[max(d.keys())]
print(d[max(d.keys())])
del d[max(d.keys())]
print(d[max(d.keys())])
del d[max(d.keys())]
print(d[max(d.keys())])

2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709215 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:15 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709215 srcip=106.193.78.119 srcport=3082 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.1.176 dstport=990 dstintf="Local_LAN" dstintfrole="undefined" poluuid="ae59ebc2-1562-51e9-555a-fa3846aac163" sessionid=1943996287 proto=6 action="client-rst" policyid=63 policytype="policy" service="FTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=6 sentbyte=124 rcvdbyte=244 sentpkt=3 appcat="unscanned"
2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709214 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:14 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709214 srcip=65.2.1.109 srcport=57934 srcintf


---
We create a function to parse the log lines and extract the relevant information in the form of a dictionary of features, which we then append to a list. This list is then converted into a pandas dataframe for further analysis.



In [None]:
import pandas as pd
import re

def parse_log(log):
    result = {}

    timestamp = re.search(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+', log)
    if timestamp:
        result['timestamp'] = timestamp.group()

    device = re.search(r'devname="(.+?)" devid="(.+?)"', log)
    if device:
        result['devname'], result['devid'] = device.groups()

    log_fields = [
        'logver', 'tz', 'vd', 'date', 'time',
        'logid', 'type', 'subtype', 'eventtype', 'level', 'eventtime', 'severity',
        'srcip', 'srccountry', 'dstip', 'srcintf', 'srcintfrole', 'dstintf',
        'dstintfrole', 'sessionid', 'action', 'proto', 'service', 'policyid',
        'attack', 'srcport', 'dstport', 'direction', 'attackid', 'profile', 'ref',
        'incidentserialno', 'msg', 'crscore', 'crlevel', 'appcat', 'duration',
        'sentbyte', 'rcvdbyte',
    ]
    for field in log_fields:
        pattern = re.compile(r'\s{}=([^"\s]*)'.format(field))
        value = pattern.search(log)
        if value:
            result[field] = value.group(1)
        pattern = re.compile(r'\s{}="([^"]*)"\s'.format(field))
        value = pattern.search(log)
        if value:
            result[field] = value.group(1)
    return result

testlog = ['2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709215 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:15 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709215 srcip=106.193.78.119 srcport=3082 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.1.176 dstport=990 dstintf="Local_LAN" dstintfrole="undefined" poluuid="ae59ebc2-1562-51e9-555a-fa3846aac163" sessionid=1943996287 proto=6 action="client-rst" policyid=63 policytype="policy" service="FTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=6 sentbyte=124 rcvdbyte=244 sentpkt=3 appcat="unscanned"', '2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709214 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:14 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709214 srcip=65.2.1.109 srcport=57934 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.51 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="3367bf4c-74ff-51e8-3e96-72b0684b3e81" sessionid=1943995865 proto=6 action="client-rst" policyid=49 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=19 sentbyte=320 rcvdbyte=2530 sentpkt=6 appcat="unscanned"', '2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709211 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:11 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709211 srcip=23.22.35.162 srcport=17191 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.66 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="eed00e84-e899-51e8-8443-676ac9d33e22" sessionid=1943996156 proto=6 action="client-rst" policyid=67 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="United States" trandisp="noop" duration=6 sentbyte=216 rcvdbyte=248 sentpkt=4 appcat="unscanned"', '2023-10-31T11:02:35.405983+05:30 172.26.5.193 logver=506141727 timestamp=1698709215 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:15 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709215 srcip=115.97.144.48 srcport=52867 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.65 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="d92146b0-099c-51e9-d9b2-142b72ef82b7" sessionid=1943996365 proto=6 action="close" policyid=60 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=3 sentbyte=416 rcvdbyte=9096 sentpkt=9 rcvdpkt=13 appcat="unscanned"', '2023-10-31T10:52:38.585306+05:30 172.26.5.193 logver=506141727 timestamp=1698709208 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:08 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709208 srcip=194.135.25.85 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.51 dstintf="Local_LAN" dstintfrole="undefined" poluuid="3367bf4c-74ff-51e8-3e96-72b0684b3e81" sessionid=1943993979 proto=1 action="accept" policyid=49 policytype="policy" service="PING" dstcountry="Reserved" srccountry="United Kingdom" trandisp="noop" duration=70 sentbyte=132 rcvdbyte=172 sentpkt=3 rcvdpkt=3 appcat="unscanned"', '2023-10-31T10:52:38.585306+05:30 172.26.5.193 logver=506141727 timestamp=1698709208 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:10:08 logid="0000000011" type="traffic" subtype="forward" level="warning" eventtime=1698709208 srcip=194.135.25.85 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.51 dstintf="Local_LAN" dstintfrole="undefined" poluuid="3367bf4c-74ff-51e8-3e96-72b0684b3e81" sessionid=1943993979 proto=1 action="ip-conn" policyid=49 policytype="policy" service="icmp/0/8" appcat="unscanned" crscore=5 craction=262144 crlevel="low"', '2023-10-31T10:52:38.585306+05:30 172.26.5.193 logver=506141727 timestamp=1698709197 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:09:57 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709197 srcip=106.194.128.164 srcport=59080 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.51 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="3367bf4c-74ff-51e8-3e96-72b0684b3e81" sessionid=1943994822 proto=6 action="client-rst" policyid=49 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=29 sentbyte=268 rcvdbyte=251 sentpkt=5 appcat="unscanned"', '2023-10-31T10:52:38.585306+05:30 172.26.5.193 logver=506141727 timestamp=1698709197 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:09:57 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709197 srcip=3.224.220.101 srcport=4045 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.66 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="eed00e84-e899-51e8-8443-676ac9d33e22" sessionid=1943995745 proto=6 action="client-rst" policyid=67 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="United States" trandisp="noop" duration=6 sentbyte=216 rcvdbyte=248 sentpkt=4 appcat="unscanned"', '2023-10-31T10:52:38.585306+05:30 172.26.5.193 logver=506141727 timestamp=1698709195 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-31 time=05:09:55 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1698709195 srcip=115.97.144.48 srcport=52862 srcintf="LLB- Connect" srcintfrole="wan" dstip=172.26.2.65 dstport=443 dstintf="Local_LAN" dstintfrole="undefined" poluuid="d92146b0-099c-51e9-d9b2-142b72ef82b7" sessionid=1943995685 proto=6 action="close" policyid=60 policytype="policy" service="HTTPS" dstcountry="Reserved" srccountry="India" trandisp="noop" duration=6 sentbyte=212 rcvdbyte=464 sentpkt=5 rcvdpkt=5 appcat="unscanned"']

df = pd.DataFrame([parse_log(log) for log in lines])

In [None]:
df

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,service,policyid,srcport,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,crlevel
0,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:15,13,traffic,...,FTPS,63,3082.0,990.0,,6.0,124.0,244.0,,
1,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:14,13,traffic,...,HTTPS,49,57934.0,443.0,,19.0,320.0,2530.0,,
2,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:11,13,traffic,...,HTTPS,67,17191.0,443.0,,6.0,216.0,248.0,,
3,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:15,13,traffic,...,HTTPS,60,52867.0,443.0,,3.0,416.0,9096.0,,
4,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:08,13,traffic,...,PING,49,,,,70.0,132.0,172.0,,
5,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:08,11,traffic,...,icmp/0/8,49,,,unscanned,,,,5.0,
6,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:09:57,13,traffic,...,HTTPS,49,59080.0,443.0,,29.0,268.0,251.0,,
7,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:09:57,13,traffic,...,HTTPS,67,4045.0,443.0,,6.0,216.0,248.0,,
8,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:09:55,13,traffic,...,HTTPS,60,52862.0,443.0,,6.0,212.0,464.0,,


In [None]:
df.to_csv('log_og.csv', index=False)

Now we have a dataframe with all the data we need. We can start working on some exploratory data analysis in order to identify the most interesting features.

## Exploratory Data Analysis

In [None]:
for col in df.columns:
    print(col, df[col].isna().sum())

timestamp 0
devname 0
devid 0
logver 0
tz 0
vd 0
date 0
time 0
logid 0
type 0
subtype 0
level 0
eventtime 0
srcip 2000
srccountry 2339
dstip 2000
srcintf 2000
srcintfrole 2000
dstintf 2000
dstintfrole 2000
sessionid 2000
action 1860
proto 2000
service 2000
policyid 2000
srcport 95850
dstport 95850
appcat 2003
duration 2342
sentbyte 2342
rcvdbyte 2342
crscore 364046
crlevel 364049
msg 378834
eventtype 380834
severity 380834
attack 380834
direction 380834
attackid 380834
profile 380834
ref 380834
incidentserialno 380834


---
We can see that there are a lot of missing values in the following columns:
- `msg`, `eventtype`, `severity`, `attack`, `direction`, `attackid`, `profile`, `ref`, `incidentserialno`

Upon further analysis, we find that these columns are only present in the case of a prominent attack.

In [None]:
attacks = df[~df.attack.isna()]
attacks

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
276767,2023-10-31T03:07:04.076644,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,03:06:50,419016384,utm,...,,"backdoor: Gh0st.Rat.Botnet,",signature,critical,Gh0st.Rat.Botnet,outgoing,38503,default,http://www.fortinet.com/ids/VID38503,1525791340
292250,2023-10-31T03:25:39.092502,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,03:25:33,419016384,utm,...,,"backdoor: Gh0st.Rat.Botnet,",signature,critical,Gh0st.Rat.Botnet,outgoing,38503,default,http://www.fortinet.com/ids/VID38503,1802804642
361755,2023-10-31T04:45:05.644585,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,04:44:00,419016384,utm,...,,"misc: HTTP.Unix.Shell.IFS.Remote.Code.Execution,",signature,high,HTTP.Unix.Shell.IFS.Remote.Code.Execution,outgoing,45677,default,http://www.fortinet.com/ids/VID45677,1625842232


---
We explore these attacks in more detail:

In [None]:
for col in attacks.columns:
    print(f'{col}')
    for row in attacks[col]:
        print(f'\t{row}')

timestamp
	2023-10-31T03:07:04.076644
	2023-10-31T03:25:39.092502
	2023-10-31T04:45:05.644585
devname
	FGT3600C_HA
	FGT3600C_HA
	FGT3600C_HA
devid
	FG3K6C3A15800081
	FG3K6C3A15800081
	FG3K6C3A15800081
logver
	506141727
	506141727
	506141727
tz
	UTC+5:30
	UTC+5:30
	UTC+5:30
vd
	root
	root
	root
date
	2023-10-31
	2023-10-31
	2023-10-31
time
	03:06:50
	03:25:33
	04:44:00
logid
	0419016384
	0419016384
	0419016384
type
	utm
	utm
	utm
subtype
	ips
	ips
	ips
level
	alert
	alert
	alert
eventtime
	1698701810
	1698702933
	1698707640
srcip
	164.52.0.93
	164.52.0.93
	211.63.167.125
srccountry
	Japan
	Japan
	Korea, Republic of
dstip
	172.26.2.57
	172.26.2.62
	172.26.2.54
srcintf
	LLB- Connect
	LLB- Connect
	LLB- Connect
srcintfrole
	wan
	wan
	wan
dstintf
	Local_LAN
	Local_LAN
	Local_LAN
dstintfrole
	undefined
	undefined
	undefined
sessionid
	1943747726
	1943784658
	1943942664
action
	dropped
	dropped
	dropped
proto
	6
	6
	6
service
	HTTPS
	HTTPS
	HTTP
policyid
	34
	39
	42
srcport
	48705
	56067
	435

---
We get back to data analysis, and fiddle around with the data to see if we can identify some interesting features and get some insights.

In [None]:
for col in df.columns:
    print(col, df[col].isna().sum())

timestamp 0
devname 0
devid 0
logver 0
tz 0
vd 0
date 0
time 0
logid 0
type 0
subtype 0
level 0
eventtime 0
srcip 2000
srccountry 2339
dstip 2000
srcintf 2000
srcintfrole 2000
dstintf 2000
dstintfrole 2000
sessionid 2000
action 1860
proto 2000
service 2000
policyid 2000
srcport 95850
dstport 95850
appcat 2003
duration 2342
sentbyte 2342
rcvdbyte 2342
crscore 364046
crlevel 364049
msg 378834
eventtype 380834
severity 380834
attack 380834
direction 380834
attackid 380834
profile 380834
ref 380834
incidentserialno 380834


In [None]:
df[df.appcat.isna()]

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
364,2023-10-30T23:56:11.892189,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:07,0100038404,event,...,,,,,,,,,,
785,2023-10-30T23:56:11.896887,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:18,0100038404,event,...,,,,,,,,,,
1208,2023-10-30T23:56:11.899906,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:27,0100038404,event,...,,,,,,,,,,
1574,2023-10-30T23:56:11.902529,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:38,0100038404,event,...,,,,,,,,,,
2184,2023-10-30T23:56:11.913532,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:47,0100038404,event,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380357,2023-10-31T06:16:40.381953,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:04:07,0100038404,event,...,,,,,,,,,,
380459,2023-10-31T06:20:01.405501,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:04:17,0100038404,event,...,,,,,,,,,,
380559,2023-10-31T06:20:01.405501,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:04:27,0100038404,event,...,,,,,,,,,,
380639,2023-10-31T06:54:18.721303,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:05:07,0100038404,event,...,,,,,,,,,,


In [None]:
df.type.value_counts()

traffic    378834
event        2000
utm             3
Name: type, dtype: int64

In [None]:
df[df.type == 'event'].subtype.value_counts()

system    2000
Name: subtype, dtype: int64

In [None]:
df[df.type == 'utm'].subtype.value_counts()

ips    3
Name: subtype, dtype: int64

In [None]:
df[df.type == 'utm']

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
276767,2023-10-31T03:07:04.076644,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,03:06:50,419016384,utm,...,,"backdoor: Gh0st.Rat.Botnet,",signature,critical,Gh0st.Rat.Botnet,outgoing,38503,default,http://www.fortinet.com/ids/VID38503,1525791340
292250,2023-10-31T03:25:39.092502,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,03:25:33,419016384,utm,...,,"backdoor: Gh0st.Rat.Botnet,",signature,critical,Gh0st.Rat.Botnet,outgoing,38503,default,http://www.fortinet.com/ids/VID38503,1802804642
361755,2023-10-31T04:45:05.644585,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,04:44:00,419016384,utm,...,,"misc: HTTP.Unix.Shell.IFS.Remote.Code.Execution,",signature,high,HTTP.Unix.Shell.IFS.Remote.Code.Execution,outgoing,45677,default,http://www.fortinet.com/ids/VID45677,1625842232


In [None]:
df[df.type == 'event'].msg.value_counts()

    2000
Name: msg, dtype: int64

In [None]:
df[(df.type == 'event') & df.msg.str.startswith('Disk')]

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno


In [None]:
print([line for line in lines if 'FortiGuard' in line][0])

2023-10-30T23:56:11.892189+05:30 172.26.5.193 logver=506141727 timestamp=1698690307 tz="UTC+5:30" devname="FGT3600C_HA" devid="FG3K6C3A15800081" vd="root" date=2023-10-30 time=23:55:07 logid="0100038404" type="event" subtype="system" level="error" eventtime=1698690307 logdesc="FortiGuard hostname unresolvable" hostname="service.fortiguard.net" msg="unable to resolve FortiGuard hostname"


In [None]:
df = df[~(df.type == 'event')]
df.shape

(378837, 42)

In [None]:
df.to_csv('log.csv', index=False)
!head log.csv

timestamp,devname,devid,logver,tz,vd,date,time,logid,type,subtype,level,eventtime,srcip,srccountry,dstip,srcintf,srcintfrole,dstintf,dstintfrole,sessionid,action,proto,service,policyid,srcport,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,103.81.182.133,India,172.26.2.51,LLB- Connect,wan,Local_LAN,undefined,1943252826,client-rst,6,HTTPS,49,30390,443,,13,1097,9182,,,,,,,,,,,
2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,106.206.187.236,India,172.26.2.64,LLB- Connect,wan,Local_LAN,undefined,1943252720,client-rst,6,HTTPS,59,9570,443,,14,424,8589,,,,,,,,,,,
2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,000000

In [None]:
df.srcip

0          103.81.182.133
1         106.206.187.236
2            172.26.1.200
3           103.42.126.80
4            172.26.1.200
               ...       
380832      194.135.25.85
380833      115.97.144.48
380834       23.22.35.162
380835         65.2.1.109
380836     106.193.78.119
Name: srcip, Length: 378837, dtype: object

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378837 entries, 0 to 380836
Data columns (total 42 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   timestamp         378837 non-null  object
 1   devname           378837 non-null  object
 2   devid             378837 non-null  object
 3   logver            378837 non-null  object
 4   tz                378837 non-null  object
 5   vd                378837 non-null  object
 6   date              378837 non-null  object
 7   time              378837 non-null  object
 8   logid             378837 non-null  object
 9   type              378837 non-null  object
 10  subtype           378837 non-null  object
 11  level             378837 non-null  object
 12  eventtime         378837 non-null  object
 13  srcip             378837 non-null  object
 14  srccountry        378498 non-null  object
 15  dstip             378837 non-null  object
 16  srcintf           378837 non-null  obj

In [None]:
df[df.sentbyte.isna()]

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,...,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
566,2023-10-30T23:56:11.895918,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:55:09,0000000011,traffic,...,,,,,,,,,,
3079,2023-10-30T23:56:16.486691,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:56:07,0000000011,traffic,...,,,,,,,,,,
3094,2023-10-30T23:56:16.486983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:56:07,0000000011,traffic,...,,,,,,,,,,
3315,2023-10-30T23:56:22.458813,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:56:14,0000000011,traffic,...,,,,,,,,,,
3327,2023-10-30T23:56:22.459054,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:56:14,0000000011,traffic,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371165,2023-10-31T05:00:03.799419,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,04:54:10,0000000011,traffic,...,,,,,,,,,,
372734,2023-10-31T05:06:19.849278,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,04:55:53,0000000011,traffic,...,,,,,,,,,,
373515,2023-10-31T05:09:43.880561,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,04:56:47,0000000011,traffic,...,,,,,,,,,,
376398,2023-10-31T05:17:00.945220,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:00:01,0000000011,traffic,...,,,,,,,,,,


In [None]:
pd.set_option('display.max_columns', None)
df

Unnamed: 0,timestamp,devname,devid,logver,tz,vd,date,time,logid,type,subtype,level,eventtime,srcip,srccountry,dstip,srcintf,srcintfrole,dstintf,dstintfrole,sessionid,action,proto,service,policyid,srcport,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
0,2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,103.81.182.133,India,172.26.2.51,LLB- Connect,wan,Local_LAN,undefined,1943252826,client-rst,6,HTTPS,49,30390,443,,13,1097,9182,,,,,,,,,,,
1,2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,106.206.187.236,India,172.26.2.64,LLB- Connect,wan,Local_LAN,undefined,1943252720,client-rst,6,HTTPS,59,9570,443,,14,424,8589,,,,,,,,,,,
2,2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,172.26.1.200,Reserved,164.100.230.244,Local_LAN,undefined,LLB- Connect,wan,1943253534,close,6,HTTPS,57,49642,443,unscanned,1,1728,5255,,,,,,,,,,,
3,2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:54,0000000013,traffic,forward,notice,1698690294,103.42.126.80,India,172.26.2.51,LLB- Connect,wan,Local_LAN,undefined,1943251787,client-rst,6,HTTPS,49,1204,443,,24,356,6708,,,,,,,,,,,
4,2023-10-30T23:56:11.890671,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-30,23:54:55,0000000013,traffic,forward,notice,1698690295,172.26.1.200,Reserved,164.100.230.244,Local_LAN,undefined,LLB- Connect,wan,1943253579,close,6,HTTPS,57,49651,443,unscanned,1,1728,5175,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380832,2023-10-31T10:52:38.585306,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:08,0000000013,traffic,forward,notice,1698709208,194.135.25.85,United Kingdom,172.26.2.51,LLB- Connect,wan,Local_LAN,undefined,1943993979,accept,1,PING,49,,,,70,132,172,,,,,,,,,,,
380833,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:15,0000000013,traffic,forward,notice,1698709215,115.97.144.48,India,172.26.2.65,LLB- Connect,wan,Local_LAN,undefined,1943996365,close,6,HTTPS,60,52867,443,,3,416,9096,,,,,,,,,,,
380834,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:11,0000000013,traffic,forward,notice,1698709211,23.22.35.162,United States,172.26.2.66,LLB- Connect,wan,Local_LAN,undefined,1943996156,client-rst,6,HTTPS,67,17191,443,,6,216,248,,,,,,,,,,,
380835,2023-10-31T11:02:35.405983,FGT3600C_HA,FG3K6C3A15800081,506141727,UTC+5:30,root,2023-10-31,05:10:14,0000000013,traffic,forward,notice,1698709214,65.2.1.109,India,172.26.2.51,LLB- Connect,wan,Local_LAN,undefined,1943995865,client-rst,6,HTTPS,49,57934,443,,19,320,2530,,,,,,,,,,,


## Data Cleaning


We identify the following columns that do not add any value to the analysis, and therefore can be dropped:
- `logver`, `tz`, `date`, `time`, `logid`, `timestamp`, `eventtime`, `srcip`, `srccountry`, `dstip`, `srcintf`, `srcintfrole`, `dstintf`, `dstintfrole`, `sessionid`, `srcport`
- `devname`, `devid`

In [None]:
cols_to_drop = ['logver', 'tz', 'date', 'time', 'logid', 'timestamp', 'eventtime', 'srcip', 'srccountry', 'dstip', 'srcintf', 'srcintfrole', 'dstintf', 'dstintfrole', 'sessionid', 'srcport']
df.drop(cols_to_drop, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(cols_to_drop, axis=1, inplace=True)


In [None]:
df.drop(['devname', 'devid'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['devname', 'devid'], axis=1, inplace=True)


In [None]:
df

Unnamed: 0,vd,type,subtype,level,action,proto,service,policyid,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,crlevel,msg,eventtype,severity,attack,direction,attackid,profile,ref,incidentserialno
0,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,13,1097,9182,,,,,,,,,,,
1,root,traffic,forward,notice,client-rst,6,HTTPS,59,443,,14,424,8589,,,,,,,,,,,
2,root,traffic,forward,notice,close,6,HTTPS,57,443,unscanned,1,1728,5255,,,,,,,,,,,
3,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,24,356,6708,,,,,,,,,,,
4,root,traffic,forward,notice,close,6,HTTPS,57,443,unscanned,1,1728,5175,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380832,root,traffic,forward,notice,accept,1,PING,49,,,70,132,172,,,,,,,,,,,
380833,root,traffic,forward,notice,close,6,HTTPS,60,443,,3,416,9096,,,,,,,,,,,
380834,root,traffic,forward,notice,client-rst,6,HTTPS,67,443,,6,216,248,,,,,,,,,,,
380835,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,19,320,2530,,,,,,,,,,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378837 entries, 0 to 380836
Data columns (total 24 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   vd                378837 non-null  object
 1   type              378837 non-null  object
 2   subtype           378837 non-null  object
 3   level             378837 non-null  object
 4   action            378837 non-null  object
 5   proto             378837 non-null  object
 6   service           378837 non-null  object
 7   policyid          378837 non-null  object
 8   dstport           284987 non-null  object
 9   appcat            378834 non-null  object
 10  duration          378495 non-null  object
 11  sentbyte          378495 non-null  object
 12  rcvdbyte          378495 non-null  object
 13  crscore           16791 non-null   object
 14  crlevel           16788 non-null   object
 15  msg               3 non-null       object
 16  eventtype         3 non-null       obj

In [None]:
for col in attacks.columns:
    print(f'{col}')
    for row in attacks[col]:
        print(f'\t{row}')

timestamp
	2023-10-31T03:07:04.076644
	2023-10-31T03:25:39.092502
	2023-10-31T04:45:05.644585
devname
	FGT3600C_HA
	FGT3600C_HA
	FGT3600C_HA
devid
	FG3K6C3A15800081
	FG3K6C3A15800081
	FG3K6C3A15800081
logver
	506141727
	506141727
	506141727
tz
	UTC+5:30
	UTC+5:30
	UTC+5:30
vd
	root
	root
	root
date
	2023-10-31
	2023-10-31
	2023-10-31
time
	03:06:50
	03:25:33
	04:44:00
logid
	0419016384
	0419016384
	0419016384
type
	utm
	utm
	utm
subtype
	ips
	ips
	ips
level
	alert
	alert
	alert
eventtime
	1698701810
	1698702933
	1698707640
srcip
	164.52.0.93
	164.52.0.93
	211.63.167.125
srccountry
	Japan
	Japan
	Korea, Republic of
dstip
	172.26.2.57
	172.26.2.62
	172.26.2.54
srcintf
	LLB- Connect
	LLB- Connect
	LLB- Connect
srcintfrole
	wan
	wan
	wan
dstintf
	Local_LAN
	Local_LAN
	Local_LAN
dstintfrole
	undefined
	undefined
	undefined
sessionid
	1943747726
	1943784658
	1943942664
action
	dropped
	dropped
	dropped
proto
	6
	6
	6
service
	HTTPS
	HTTPS
	HTTP
policyid
	34
	39
	42
srcport
	48705
	56067
	435

In [None]:
df.crlevel.value_counts()

    16788
Name: crlevel, dtype: int64

In [None]:
df.drop('crlevel', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('crlevel', axis=1, inplace=True)


---
We create a new feature to identify the threat level of the attack called `threat`. This will later serve as the target variable for our machine learning model.

In [None]:
import numpy as np
df['threat'] = np.where(~df['attack'].isna(), 3, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['threat'] = np.where(~df['attack'].isna(), 3, 0)


In [None]:
df['threat'].value_counts()

0    378834
3         3
Name: threat, dtype: int64

In [None]:
df.msg.value_counts()

backdoor: Gh0st.Rat.Botnet,                         2
misc: HTTP.Unix.Shell.IFS.Remote.Code.Execution,    1
Name: msg, dtype: int64

In [None]:
df.columns

Index(['vd', 'type', 'subtype', 'level', 'action', 'proto', 'service',
       'policyid', 'dstport', 'appcat', 'duration', 'sentbyte', 'rcvdbyte',
       'crscore', 'msg', 'eventtype', 'severity', 'attack', 'direction',
       'attackid', 'profile', 'ref', 'incidentserialno', 'threat'],
      dtype='object')

In [None]:
df.drop(['incidentserialno', 'ref', 'profile', 'attackid', 'direction', 'severity', 'msg'], axis=1, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['incidentserialno', 'ref', 'profile', 'attackid', 'direction', 'severity', 'msg'], axis=1, inplace=True)


Unnamed: 0,vd,type,subtype,level,action,proto,service,policyid,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,eventtype,attack,threat
0,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,13,1097,9182,,,,0
1,root,traffic,forward,notice,client-rst,6,HTTPS,59,443,,14,424,8589,,,,0
2,root,traffic,forward,notice,close,6,HTTPS,57,443,unscanned,1,1728,5255,,,,0
3,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,24,356,6708,,,,0
4,root,traffic,forward,notice,close,6,HTTPS,57,443,unscanned,1,1728,5175,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380832,root,traffic,forward,notice,accept,1,PING,49,,,70,132,172,,,,0
380833,root,traffic,forward,notice,close,6,HTTPS,60,443,,3,416,9096,,,,0
380834,root,traffic,forward,notice,client-rst,6,HTTPS,67,443,,6,216,248,,,,0
380835,root,traffic,forward,notice,client-rst,6,HTTPS,49,443,,19,320,2530,,,,0


In [None]:
df.appcat.value_counts()

             316820
unscanned     62014
Name: appcat, dtype: int64

In [None]:
df.appcat = df.appcat.apply(lambda x: 0 if x == '' else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.appcat = df.appcat.apply(lambda x: 0 if x == '' else 1)


In [None]:
df.appcat.value_counts()

0    316820
1     62017
Name: appcat, dtype: int64

In [None]:
df.isna().sum()

vd                0
type              0
subtype           0
level             0
action            0
proto             0
service           0
policyid          0
dstport       93850
appcat            0
duration        342
sentbyte        342
rcvdbyte        342
crscore      362046
eventtype    378834
attack       378834
threat            0
dtype: int64

In [None]:
df.dstport.value_counts()

443      251557
990       14699
6785       4188
80         1864
53         1120
          ...  
8516          1
8092          1
11119         1
65153         1
10223         1
Name: dstport, Length: 950, dtype: int64

In [None]:
df[df.dstport == '10223']

Unnamed: 0,vd,type,subtype,level,action,proto,service,policyid,dstport,appcat,duration,sentbyte,rcvdbyte,crscore,eventtype,attack,threat
380752,root,traffic,forward,notice,deny,6,tcp/10223,89,10223,1,0,0,0,30,,,0


In [None]:
df.drop ('dstport', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop ('dstport', axis=1, inplace=True)


In [None]:
df.proto.value_counts()

6     283409
1      93850
17      1566
47        12
Name: proto, dtype: int64

In [None]:
df[df.proto == '47']

Unnamed: 0,vd,type,subtype,level,action,proto,service,policyid,appcat,duration,sentbyte,rcvdbyte,crscore,eventtype,attack,threat
58795,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
129243,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
178966,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
218703,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
269202,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
281190,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
303368,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
308736,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
364402,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0
366184,root,traffic,forward,notice,deny,47,gre,54,1,0,0,0,30,,,0


In [None]:
df.vd.value_counts()

root    378837
Name: vd, dtype: int64

In [None]:
df.drop('vd', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('vd', axis=1, inplace=True)


In [None]:
df.service.value_counts()

HTTPS          251556
PING            93697
FTPS            14699
NHP-FLUX         4188
ForFTP_DATA      2154
                ...  
tcp/8092            1
tcp/11119           1
tcp/28573           1
tcp/52526           1
tcp/10223           1
Name: service, Length: 935, dtype: int64

In [None]:
df.service.apply(lambda x: 'TCP' if x.startswith('tcp') else x).apply(lambda x: 'UDP' if x.startswith('udp') else x).apply(lambda x: 'ICMP' if x.startswith('icmp') else x).value_counts()

HTTPS            251556
PING              93697
FTPS              14699
TCP                8319
NHP-FLUX           4188
ForFTP_DATA        2154
HTTP               1865
DNS                1120
UDP                 364
TELNET              158
ICMP                153
RDP                  70
X-WINDOWS            68
Tomcat-Apache        57
SMTPS                40
MS-SQL               35
FTP                  35
SNMP                 29
NTP                  26
DCE-RPC              25
HQssh                24
PPTP                 24
FTP_1                24
AOL                  24
MYSQL                21
RTSP                 13
gre                  12
VDOLIVE              10
SIP                  10
IRC                   7
SOCKS                 5
KERBEROS              2
IKE                   1
test1                 1
IMAP                  1
Name: service, dtype: int64

In [None]:
df[df.threat != 3].groupby('service')['service'].transform('count')

0         251554
1         251554
2         251554
3         251554
4         251554
           ...  
380832     93697
380833    251554
380834    251554
380835    251554
380836     14699
Name: service, Length: 378834, dtype: int64

---
We create a new threat class based on the `service` column. The rarely occuring services will be used to flag the log as suspicious.

In [None]:
df.loc[df.threat != 3, 'threat'] = df[df.threat != 3].groupby('service')['service'].transform('count').apply(lambda x: 1 if x < 1000 else 0)

In [None]:
df.threat.value_counts()

0    369276
1      9558
3         3
Name: threat, dtype: int64

In [None]:
df['service'] = df.groupby('service')['service'].transform('count').apply(lambda x: 'Other' if x < 1000 else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['service'] = df.groupby('service')['service'].transform('count').apply(lambda x: 'Other' if x < 1000 else x)


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378837 entries, 0 to 380836
Data columns (total 15 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   type       378837 non-null  object
 1   subtype    378837 non-null  object
 2   level      378837 non-null  object
 3   action     378837 non-null  object
 4   proto      378837 non-null  object
 5   service    378837 non-null  object
 6   policyid   378837 non-null  object
 7   appcat     378837 non-null  int64 
 8   duration   378495 non-null  object
 9   sentbyte   378495 non-null  object
 10  rcvdbyte   378495 non-null  object
 11  crscore    16791 non-null   object
 12  eventtype  3 non-null       object
 13  attack     3 non-null       object
 14  threat     378837 non-null  int64 
dtypes: int64(2), object(13)
memory usage: 46.2+ MB


In [None]:
df.proto.value_counts()

6     283409
1      93850
17      1566
47        12
Name: proto, dtype: int64

In [None]:
cp = df

In [None]:
cp.threat.value_counts()

0    369276
1      9558
3         3
Name: threat, dtype: int64

---
We take a similar approach to the `proto` column. The rarely occuring protocols will be used to flag the log as suspicious.

In [None]:
df.loc[df.threat < 1, 'threat'] = df[df.threat != 3].groupby('proto')['proto'].transform('count').apply(lambda x: 1 if x < 1000 else 0)

In [None]:
df.threat.value_counts()

0    369276
1      9558
3         3
Name: threat, dtype: int64

In [None]:
df.type.value_counts()

traffic    378834
utm             3
Name: type, dtype: int64

In [None]:
df.drop('type', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('type', axis=1, inplace=True)


In [None]:
cp = df
df

Unnamed: 0,subtype,level,action,proto,service,policyid,appcat,duration,sentbyte,rcvdbyte,crscore,eventtype,attack,threat
0,forward,notice,client-rst,6,251556,49,0,13,1097,9182,,,,0
1,forward,notice,client-rst,6,251556,59,0,14,424,8589,,,,0
2,forward,notice,close,6,251556,57,1,1,1728,5255,,,,0
3,forward,notice,client-rst,6,251556,49,0,24,356,6708,,,,0
4,forward,notice,close,6,251556,57,1,1,1728,5175,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380832,forward,notice,accept,1,93697,49,0,70,132,172,,,,0
380833,forward,notice,close,6,251556,60,0,3,416,9096,,,,0
380834,forward,notice,client-rst,6,251556,67,0,6,216,248,,,,0
380835,forward,notice,client-rst,6,251556,49,0,19,320,2530,,,,0


In [None]:
df.drop('subtype', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('subtype', axis=1, inplace=True)


In [None]:
df.level.value_counts()

notice     378495
alert           3
Name: level, dtype: int64

In [None]:
df[df.level == 'warning']['threat'].value_counts()

0    186
1    153
Name: threat, dtype: int64

---
We create a new threat class based on the `level` column. The `warning` level will be used to flag the log as malicious.

In [None]:
df.loc[df.level == 'warning','threat'] = 2

In [None]:
df.threat.value_counts()

0    369090
1      9405
2       339
3         3
Name: threat, dtype: int64

In [None]:
df.action.value_counts()

client-rst    138955
accept        101654
close          95831
server-rst     25569
deny           10236
timeout         6250
ip-conn          230
dns              109
dropped            3
Name: action, dtype: int64

In [None]:
df.attack.fillna('', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.attack.fillna('', inplace=True)


In [None]:
df

Unnamed: 0,level,action,proto,service,policyid,appcat,duration,sentbyte,rcvdbyte,crscore,eventtype,attack,threat
0,notice,client-rst,6,251556,49,0,13,1097,9182,,,,0
1,notice,client-rst,6,251556,59,0,14,424,8589,,,,0
2,notice,close,6,251556,57,1,1,1728,5255,,,,0
3,notice,client-rst,6,251556,49,0,24,356,6708,,,,0
4,notice,close,6,251556,57,1,1,1728,5175,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
380832,notice,accept,1,93697,49,0,70,132,172,,,,0
380833,notice,close,6,251556,60,0,3,416,9096,,,,0
380834,notice,client-rst,6,251556,67,0,6,216,248,,,,0
380835,notice,client-rst,6,251556,49,0,19,320,2530,,,,0


In [None]:
df.crscore.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.crscore.fillna(0, inplace=True)


In [None]:
df.crscore.value_counts()

0     362046
30     10238
5       6549
50         4
Name: crscore, dtype: int64

In [None]:
df.eventtype.value_counts()

signature    3
Name: eventtype, dtype: int64

In [None]:
df.drop('eventtype', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('eventtype', axis=1, inplace=True)


In [None]:
def unique(df):
  for col in df.columns:
    print(f'{col}')
    print()
    print(df[col].value_counts())
    print('\n')
    print('_'*100)

unique(df)

level

notice     378495
alert           3
Name: level, dtype: int64


____________________________________________________________________________________________________
action

client-rst    138955
accept        101654
close          95831
server-rst     25569
deny           10236
timeout         6250
ip-conn          230
dns              109
dropped            3
Name: action, dtype: int64


____________________________________________________________________________________________________
proto

6     283409
1      93850
17      1566
47        12
Name: proto, dtype: int64


____________________________________________________________________________________________________
service

251556    251556
93697      93697
14699      14699
Other       9558
4188        4188
2154        2154
1865        1865
1120        1120
Name: service, dtype: int64


____________________________________________________________________________________________________
policyid

49    98170
59    45273
57 

In [None]:
df['attack'] = df.attack.apply(lambda x: 1 if x != '' else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attack'] = df.attack.apply(lambda x: 1 if x != '' else 0)


In [None]:
df.attack.value_counts()

0    378834
1         3
Name: attack, dtype: int64

In [None]:
df

Unnamed: 0,level,action,proto,service,policyid,appcat,duration,sentbyte,rcvdbyte,crscore,attack,threat
0,notice,client-rst,6,251556,49,0,13,1097,9182,0,0,0
1,notice,client-rst,6,251556,59,0,14,424,8589,0,0,0
2,notice,close,6,251556,57,1,1,1728,5255,0,0,0
3,notice,client-rst,6,251556,49,0,24,356,6708,0,0,0
4,notice,close,6,251556,57,1,1,1728,5175,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
380832,notice,accept,1,93697,49,0,70,132,172,0,0,0
380833,notice,close,6,251556,60,0,3,416,9096,0,0,0
380834,notice,client-rst,6,251556,67,0,6,216,248,0,0,0
380835,notice,client-rst,6,251556,49,0,19,320,2530,0,0,0


In [None]:
df.to_csv('log.csv', index=False)

In [None]:
unique(df)

level

notice     378495
alert           3
Name: level, dtype: int64


____________________________________________________________________________________________________
action

client-rst    138955
accept        101654
close          95831
server-rst     25569
deny           10236
timeout         6250
ip-conn          230
dns              109
dropped            3
Name: action, dtype: int64


____________________________________________________________________________________________________
proto

6     283409
1      93850
17      1566
47        12
Name: proto, dtype: int64


____________________________________________________________________________________________________
service

251556    251556
93697      93697
14699      14699
Other       9558
4188        4188
2154        2154
1865        1865
1120        1120
Name: service, dtype: int64


____________________________________________________________________________________________________
policyid

49    98170
59    45273
57 

In [None]:
df.drop(['attack', 'level'], axis=1, inplace=True)
cp = df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['attack', 'level'], axis=1, inplace=True)


---
We create a new dataframe that will be used to train the first model, which is a binary classification model. It will be used to predict if the log is malicious or not.

In [None]:
cp.threat = cp.threat.apply(lambda x: 1 if x != 0 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cp.threat = cp.threat.apply(lambda x: 1 if x != 0 else 0)


In [None]:
cp.threat.value_counts()

0    369090
1      9747
Name: threat, dtype: int64

---
We create another dataframe that will be used to train the second model, which is a multiclass classification model. It will be used to predict the threat level of the log.

In [None]:
lvl = df[df.threat > 0]
lvl

Unnamed: 0,action,proto,service,policyid,appcat,duration,sentbyte,rcvdbyte,crscore,threat
64,deny,6,Other,72,1,0,0,0,30,1
137,deny,6,Other,89,1,0,0,0,30,1
162,deny,6,Other,72,1,0,0,0,30,1
217,deny,6,Other,72,1,0,0,0,30,1
258,deny,6,Other,74,1,0,0,0,30,1
...,...,...,...,...,...,...,...,...,...,...
380730,deny,6,Other,72,1,0,0,0,30,1
380752,deny,6,Other,89,1,0,0,0,30,1
380755,deny,6,Other,72,1,0,0,0,30,1
380760,deny,6,Other,72,1,0,0,0,30,1


In [None]:
cp.to_csv('data.csv', index=False)
lvl.to_csv('lvl.csv', index=False)