# Anomaly Detection using web servers' logs

## Problem
We need to detect itrusioners and crawlers of a web server from its users' foot prints that provided for us as a log file.

It's an unsupervised problem becuase, there isn't any label for the anomalies, So we have to figure out them and find them from the log server.


In [112]:
# Import requirements
import re
import pandas as pd
import numpy as np
!pip install pandas-profiling==2.7.1
from pandas_profiling import ProfileReport

import matplotlib.pyplot as plt



## Prepare data
1. Create a `.csv` from the provided logs.
2. Extract features.
3. Clean data:
  * Fix datetimes.
  * Fix missing values.
  * Figure out categories.
  * Check correlations.
4. Prepare data to fit our model:
  * Check the balences.
  * Split train, test.


In [113]:
# !gzip --decompress drive/MyDrive/Rahnema-College/Tuning/Final-Project/output.log.gz

In [114]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


### Extract features from log files to build .csv format


In [115]:
# Regex Pattern
pattern = '(?P<Client>\S+) \[(?P<Time>\S+)\] \[(?P<Method>\S+) (?P<Request>\S+)\] (?P<Status>\S+) (?P<Size>\S+) \[\[(?P<UserAgent>[\S\s]+)\]\] (?P<Duration>\S+)'
file_path = 'drive/MyDrive/Rahnema-College/Tuning/Final-Project/output.log'
columns = ["Client", "Datetime", "Method", "Request", "Status", "Length", "UserAgent", "ResponseTime"]

In [116]:
# Find regex in our logs
def parse_data(file_path, pattern):
  """
  Return the part of data that extracted by given pattern.
  file_path -> Your log file.
  pattern -> The pattern that you're looking for in your logs.

  Return parsed_line -> as a list of finded data.
  """
  parsed_lines = []

  with open(file_path) as logs:
    for line in logs:
      try:
        finded_pattern = list(re.findall(pattern, line)[0])
        parsed_lines.append(finded_pattern)
      except Exception as e:
        print("There is an error while parsing data! Try Again :(")
  return parsed_lines

In [117]:
extracted_features = parse_data(file_path, pattern)
extracted_features[0]

['207.213.193.143',
 '2021-5-12T5:6:0.0+0430',
 'Get',
 '/cdn/profiles/1026106239',
 '304',
 '0',
 'Googlebot-Image/1.0',
 '32']

In [118]:
# Create a .csv format
data = pd.DataFrame(extracted_features, columns=columns)
data.head()

Unnamed: 0,Client,Datetime,Method,Request,Status,Length,UserAgent,ResponseTime
0,207.213.193.143,2021-5-12T5:6:0.0+0430,Get,/cdn/profiles/1026106239,304,0,Googlebot-Image/1.0,32
1,207.213.193.143,2021-5-12T5:6:0.0+0430,Get,images/badge.png,304,0,Googlebot-Image/1.0,4
2,35.110.222.153,2021-5-12T5:6:0.0+0430,Get,/pages/630180847,200,52567,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32
3,35.108.208.99,2021-5-12T5:6:0.0+0430,Get,images/fav_icon2.ico,200,23531,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,20
4,35.110.222.153,2021-5-12T5:6:0.0+0430,Get,images/sanjagh_logo_purpule5.png,200,4680,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8


In [119]:
data.shape

(1260035, 8)

In [120]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1260035 entries, 0 to 1260034
Data columns (total 8 columns):
 #   Column        Non-Null Count    Dtype 
---  ------        --------------    ----- 
 0   Client        1260035 non-null  object
 1   Datetime      1260035 non-null  object
 2   Method        1260035 non-null  object
 3   Request       1260035 non-null  object
 4   Status        1260035 non-null  object
 5   Length        1260035 non-null  object
 6   UserAgent     1260035 non-null  object
 7   ResponseTime  1260035 non-null  object
dtypes: object(8)
memory usage: 76.9+ MB


### Fix time series

In [121]:
# Convert datetimes to datetime format
data["Datetime"] = pd.to_datetime(data["Datetime"], format="%Y-%m-%dT%H:%M:%S")

In [122]:
# Split datatimes
data['Year'] = data.Datetime.dt.year
data['Month'] = data.Datetime.dt.month
data['Day'] = data.Datetime.dt.day
data['Hour'] = data.Datetime.dt.hour
data['Minute'] = data.Datetime.dt.minute
data['Second'] = data.Datetime.dt.second
data['dayOfWeek'] = data.Datetime.dt.dayofweek
data['dayOfYear'] = data.Datetime.dt.dayofyear

In [123]:
data.drop("Datetime", axis=1, inplace=True)

In [124]:
data.head()

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Year,Month,Day,Hour,Minute,Second,dayOfWeek,dayOfYear
0,207.213.193.143,Get,/cdn/profiles/1026106239,304,0,Googlebot-Image/1.0,32,2021,5,12,5,6,0,2,132
1,207.213.193.143,Get,images/badge.png,304,0,Googlebot-Image/1.0,4,2021,5,12,5,6,0,2,132
2,35.110.222.153,Get,/pages/630180847,200,52567,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32,2021,5,12,5,6,0,2,132
3,35.108.208.99,Get,images/fav_icon2.ico,200,23531,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,20,2021,5,12,5,6,0,2,132
4,35.110.222.153,Get,images/sanjagh_logo_purpule5.png,200,4680,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8,2021,5,12,5,6,0,2,132


### Split Requests
* Level_1
  * Level_2

In [125]:
# Level Requests' path to make them more categorized
path_l1 = []
path_l2 = []
paths_pattern = re.compile(r'^(\/|\w+)')
for element in data.Request.values:
  finded_part = paths_pattern.findall(element)[0]
  path_l1.append(finded_part)
  path_l2.append(element.replace(finded_part, ''))

data["Req_Path_L1"] = path_l1
data["Req_Path_L2"] = path_l2

In [126]:
data.head()

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Year,Month,Day,Hour,Minute,Second,dayOfWeek,dayOfYear,Req_Path_L1,Req_Path_L2
0,207.213.193.143,Get,/cdn/profiles/1026106239,304,0,Googlebot-Image/1.0,32,2021,5,12,5,6,0,2,132,/,cdnprofiles1026106239
1,207.213.193.143,Get,images/badge.png,304,0,Googlebot-Image/1.0,4,2021,5,12,5,6,0,2,132,images,/badge.png
2,35.110.222.153,Get,/pages/630180847,200,52567,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32,2021,5,12,5,6,0,2,132,/,pages630180847
3,35.108.208.99,Get,images/fav_icon2.ico,200,23531,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,20,2021,5,12,5,6,0,2,132,images,/fav_icon2.ico
4,35.110.222.153,Get,images/sanjagh_logo_purpule5.png,200,4680,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8,2021,5,12,5,6,0,2,132,images,/sanjagh_logo_purpule5.png


In [127]:
data["Req_Path_L1"].value_counts()

/            663910
images       298288
js           125224
fonts         98485
css           56303
templates     17825
Name: Req_Path_L1, dtype: int64

In [128]:
data.loc[10:16]

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Year,Month,Day,Hour,Minute,Second,dayOfWeek,dayOfYear,Req_Path_L1,Req_Path_L2
10,35.110.222.153,Get,images/gadgets/join_pros3.jpg,200,34053,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8,2021,5,12,5,6,0,2,132,images,/gadgets/join_pros3.jpg
11,35.110.222.153,Get,css/page.2f0fc69390da8cdff683.css,200,50880,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,12,2021,5,12,5,6,0,2,132,css,/page.2f0fc69390da8cdff683.
12,35.110.222.153,Get,js/sentry.47b4061bac0b8ac89b9c.js,200,65059,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,4,2021,5,12,5,6,0,2,132,js,/sentry.47b4061bac0b8ac89b9c.
13,35.110.222.153,Get,js/page.07cb314dc14eef820638.js,200,332023,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32,2021,5,12,5,6,0,2,132,js,/page.07cb314dc14eef820638.
14,36.67.23.210,Head,/877499224,200,0,Go-http-client/2.0,28,2021,5,12,5,6,0,2,132,/,877499224
15,207.213.193.143,Get,/cdn/pro_photo_gallery/1781572036,304,0,Googlebot-Image/1.0,16,2021,5,12,5,6,0,2,132,/,cdnpro_photo_gallery1781572036
16,35.110.222.153,Get,fonts/sanjagh_icon_font_5.woff,200,8644,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,4,2021,5,12,5,6,0,2,132,fonts,/sanjagh_icon_font_5.woff


### Check Dataset

In [129]:
# # Create a profile report
# profile = ProfileReport(data)
# profile.to_file("drive/MyDrive/Rahnema-College/Tuning/Final-Project/profile_report.html")

In [130]:
# Remove Constant features
constant_features = ["Year", "Month", "Day", "dayOfWeek", "dayOfYear"]
data.drop(constant_features, axis=1, inplace=True)

* There are some rows which don't have any values for `ResponseTime` and `Clinet` features.
* Take care: some of rows contains `Clients'` value but no `ResponseTime` value.

In [131]:
data[(data["Client"] == '-') & (data["ResponseTime"] == '-')]

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Hour,Minute,Second,Req_Path_L1,Req_Path_L2
25,-,Get,/,301,169,kube-probe/1.21,-,5,6,1,/,
85,-,Get,/,301,169,kube-probe/1.21,-,5,6,3,/,
145,-,Get,/,301,169,kube-probe/1.21,-,5,6,5,/,
175,-,Get,/,301,169,kube-probe/1.21,-,5,6,7,/,
215,-,Get,/,301,169,kube-probe/1.21,-,5,6,9,/,
...,...,...,...,...,...,...,...,...,...,...,...,...
1259779,-,Get,/,301,169,kube-probe/1.21,-,15,8,51,/,
1259833,-,Get,/,301,169,kube-probe/1.21,-,15,8,53,/,
1259902,-,Get,/,301,169,kube-probe/1.21,-,15,8,55,/,
1259951,-,Get,/,301,169,kube-probe/1.21,-,15,8,57,/,


In [132]:
data[(data["Client"] != '-') & (data["ResponseTime"] == '-')]

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Hour,Minute,Second,Req_Path_L1,Req_Path_L2
776,20.62.177.11,Get,/pros/1993352776,200,53479,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,5,6,31,/,pros1993352776
2010,20.62.177.60,Get,/pros/1797822247,200,55330,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,5,7,27,/,pros1797822247
2708,20.62.177.133,Get,/pros/763244865,200,20947,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,5,8,4,/,pros763244865
2866,207.213.193.118,Get,/pages/1939232229,301,169,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...,-,5,8,18,/,pages1939232229
3468,20.62.177.4,Get,/pros/2084824811,200,37060,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,5,8,49,/,pros2084824811
...,...,...,...,...,...,...,...,...,...,...,...,...
1257193,20.62.177.11,Get,/pros/1644096504,200,24540,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,15,7,34,/,pros1644096504
1257986,20.62.177.11,Get,/pros/743056796,200,36129,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,15,7,59,/,pros743056796
1258079,20.62.177.161,Get,/pros/1177343248,200,51334,Mozilla/5.0 (compatible; SemrushBot/7~bl; +htt...,-,15,8,2,/,pros1177343248
1258456,207.213.207.17,Get,/services/1404674245,301,169,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...,-,15,8,12,/,services1404674245


In [133]:
# Fill the - values with null
data.loc[data["Client"] == '-', "Client"] = np.nan

# Becuase response time will be an integer value we put -1 to be convertable
data.loc[data["ResponseTime"] == '-', "ResponseTime"] = "-1"

In [134]:
# Convert integers
data["Status"] = data["Status"].astype("int64")
data["Length"] = data["Length"].astype("int64")
data["ResponseTime"] = data["ResponseTime"].astype("int64")

In [135]:
# Then set -1s to null
data.loc[data["ResponseTime"] == -1, "ResponseTime"] = None

In [136]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1260035 entries, 0 to 1260034
Data columns (total 12 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   Client        1241945 non-null  object 
 1   Method        1260035 non-null  object 
 2   Request       1260035 non-null  object 
 3   Status        1260035 non-null  int64  
 4   Length        1260035 non-null  int64  
 5   UserAgent     1260035 non-null  object 
 6   ResponseTime  1240227 non-null  float64
 7   Hour          1260035 non-null  int64  
 8   Minute        1260035 non-null  int64  
 9   Second        1260035 non-null  int64  
 10  Req_Path_L1   1260035 non-null  object 
 11  Req_Path_L2   1260035 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 115.4+ MB


### Fix Missing values

In [137]:
data.isna().sum()

Client          18090
Method              0
Request             0
Status              0
Length              0
UserAgent           0
ResponseTime    19808
Hour                0
Minute              0
Second              0
Req_Path_L1         0
Req_Path_L2         0
dtype: int64

In [138]:
# Fill missing values of `ResponseTime` with its mean value
data["ResponseTime"].fillna(data["ResponseTime"].mean(), inplace=True)
data["ResponseTime"].isna().sum()

0

In [141]:
# Drop samples which their `Client`s' values missed
data.dropna(inplace=True)

In [142]:
data.shape

(1241945, 12)

In [143]:
data.isna().sum()

Client          0
Method          0
Request         0
Status          0
Length          0
UserAgent       0
ResponseTime    0
Hour            0
Minute          0
Second          0
Req_Path_L1     0
Req_Path_L2     0
dtype: int64

In [144]:
data.head()

Unnamed: 0,Client,Method,Request,Status,Length,UserAgent,ResponseTime,Hour,Minute,Second,Req_Path_L1,Req_Path_L2
0,207.213.193.143,Get,/cdn/profiles/1026106239,304,0,Googlebot-Image/1.0,32.0,5,6,0,/,cdnprofiles1026106239
1,207.213.193.143,Get,images/badge.png,304,0,Googlebot-Image/1.0,4.0,5,6,0,images,/badge.png
2,35.110.222.153,Get,/pages/630180847,200,52567,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32.0,5,6,0,/,pages630180847
3,35.108.208.99,Get,images/fav_icon2.ico,200,23531,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,20.0,5,6,0,images,/fav_icon2.ico
4,35.110.222.153,Get,images/sanjagh_logo_purpule5.png,200,4680,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8.0,5,6,0,images,/sanjagh_logo_purpule5.png


### Fix Categorical

* Categorical features: Method, Status

In [None]:
data["Method"].value_counts()

In [None]:
data["Status"].value_counts()

In [None]:
# Fix Method feature
data["Method"] = data["Method"].astype("category")
data = pd.get_dummies(data, columns=["Method"], drop_first=True)

In [None]:
# Fix Status feature
data["Status"] = data["Status"].astype("category")
data["Status_cat"] = data["Status"].cat.codes+1
data.drop("Status", axis=1, inplace=True)

In [None]:
data.head()