# Anomaly Detection using web servers' logs

## Problem
We need to detect itrusioners and crawlers of a web server from its users' foot prints that provided for us as a log file.

It's an unsupervised problem becuase, there isn't any label for the anomalies, So we have to figure out them and find them from the log server.


In [8]:
# Import requirements
import re
import pandas as pd


### Prepare data
1. Create a `.csv` from the provided logs.
2. Extract features.
3. Clean data:
  * Fix missing values.
  * Figure out categories.
  * Check correlations.
4. Prepare data to fit our model:
  * Check the balences.
  * Split train, test.


In [9]:
# !gzip --decompress drive/MyDrive/Rahnema-College/Tuning/Final-Project/output.log.gz

In [10]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#### Extract features from log files to build .csv format


In [11]:
# Regex Pattern
pattern = '(?P<Client>\S+) \[(?P<Time>\S+)\] \[(?P<Method>\S+) (?P<Request>\S+)\] (?P<Status>\S+) (?P<Size>\S+) \[\[(?P<UserAgent>[\S\s]+)\]\] (?P<Duration>\S+)'
file_path = 'drive/MyDrive/Rahnema-College/Tuning/Final-Project/output.log'
columns = [["Client", "Time", "Method", "Request", "Status", "Length", "UserAgent", "ResponseTime"]]

In [15]:
# Find regex in our logs
def parse_data(file_path, pattern):
  """
  Return the part of data that extracted by given pattern.
  file_path -> Your log file.
  pattern -> The pattern that you're looking for in your logs.

  Return parsed_line -> as a list of finded data.
  """
  parsed_lines = []

  with open(file_path) as logs:
    for line in logs:
      try:
        finded_pattern = re.findall(pattern, line)[0]
        parsed_lines.append(finded_pattern)
      except Exception as e:
        print("There is an error while parsing data! Try Again :(")
  return parsed_lines

In [16]:
extracted_features = parse_data(file_path, pattern)
extracted_features[:3]

[('207.213.193.143',
  '2021-5-12T5:6:0.0+0430',
  'Get',
  '/cdn/profiles/1026106239',
  '304',
  '0',
  'Googlebot-Image/1.0',
  '32'),
 ('207.213.193.143',
  '2021-5-12T5:6:0.0+0430',
  'Get',
  'images/badge.png',
  '304',
  '0',
  'Googlebot-Image/1.0',
  '4'),
 ('35.110.222.153',
  '2021-5-12T5:6:0.0+0430',
  'Get',
  '/pages/630180847',
  '200',
  '52567',
  'Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-J710GN Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36',
  '32')]

In [17]:
# Create a .csv format
data = pd.DataFrame(extracted_features, columns=columns)
data.head()

Unnamed: 0,Client,Time,Method,Request,Status,Length,UserAgent,ResponseTime
0,207.213.193.143,2021-5-12T5:6:0.0+0430,Get,/cdn/profiles/1026106239,304,0,Googlebot-Image/1.0,32
1,207.213.193.143,2021-5-12T5:6:0.0+0430,Get,images/badge.png,304,0,Googlebot-Image/1.0,4
2,35.110.222.153,2021-5-12T5:6:0.0+0430,Get,/pages/630180847,200,52567,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,32
3,35.108.208.99,2021-5-12T5:6:0.0+0430,Get,images/fav_icon2.ico,200,23531,Mozilla/5.0 (Linux; Android 6.0; CAM-L21) Appl...,20
4,35.110.222.153,2021-5-12T5:6:0.0+0430,Get,images/sanjagh_logo_purpule5.png,200,4680,Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-...,8
