DATA PREPROCESSING

First of all, to find the pattern of each log of the data , regular expression was used. Then, by checking the pattern which are matched, they were added to a dictionary one by one. Dictionary converted to a pandas DataFrame. Here is the code:

In [1]:
import pandas as pd
import re

pattern = re.compile(r'(\d+\.\d+\.\d+\.\d+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)')

data = {"ip":[],"data_time":[],"request_method":[],"request_path":[],"protocol":[],"status_code":[],"response_size":[]}

with open("access_log.txt","r") as file:
    log_data = file.readlines()

for line in log_data:
    correct = pattern.match(line)
    if correct:
        ip, datetime, request, status, size = correct.groups()
        method, path, protocol = request.split()

        data["ip"].append(ip)
        data["data_time"].append(datetime)
        data["request_method"].append(method)
        data["request_path"].append(path)
        data["protocol"].append(protocol)
        data["status_code"].append(int(status))
        data["response_size"].append(int(size))

df = pd.DataFrame.from_dict(data)
df.to_csv("data.csv",index=False)

By printing the head of our data,

In [2]:
import pandas as pd
dframe = pd.read_csv("data.csv")
dframe.head()

Unnamed: 0,ip,data_time,request_method,request_path,protocol,status_code,response_size
0,10.223.157.186,15/Jul/2009:14:58:59 -0700,GET,/,HTTP/1.1,403,202
1,10.223.157.186,15/Jul/2009:14:58:59 -0700,GET,/favicon.ico,HTTP/1.1,404,209
2,10.216.113.172,16/Jul/2009:02:51:29 -0700,GET,/assets/css/reset.css,HTTP/1.1,200,1014
3,10.216.113.172,16/Jul/2009:02:51:29 -0700,GET,/assets/css/960.css,HTTP/1.1,200,6206
4,10.216.113.172,16/Jul/2009:02:51:29 -0700,GET,/assets/js/the-associates.js,HTTP/1.1,200,4492


we can see that our data is separated correctly.

To see if there is any row seperated by a mistake,

In [3]:
print(dframe.notna().sum())

ip                54
data_time         54
request_method    54
request_path      54
protocol          54
status_code       54
response_size     54
dtype: int64


we can check if there are equal values ​​in each column. The data preprocessing appears to be well done.

In order to get a regular structure and not have to deal with columns, I will convert the data into regular sentences and add them as a new column.

In [4]:
import pandas as pd

df = pd.read_csv("data.csv")
texts = []

for index, row in df.iterrows():
    ip = row["ip"]
    time = row["data_time"]
    method = row["request_method"]
    path = row["request_path"]
    protocol = row["protocol"]
    status = row["status_code"]
    size = row["response_size"]

    new_line = f"User with IP {ip}, made request of type {method}, on date and time {time}, accessed request path {path}, used {protocol} protocol, returned {status} status code, received {size} bytes of data."

    texts.append(new_line)

df["text"] = texts
df.to_csv("data.csv", index=False)

pd.set_option('display.max_colwidth', None)

print(df.head())

               ip                   data_time request_method  \
0  10.223.157.186  15/Jul/2009:14:58:59 -0700            GET   
1  10.223.157.186  15/Jul/2009:14:58:59 -0700            GET   
2  10.216.113.172  16/Jul/2009:02:51:29 -0700            GET   
3  10.216.113.172  16/Jul/2009:02:51:29 -0700            GET   
4  10.216.113.172  16/Jul/2009:02:51:29 -0700            GET   

                   request_path  protocol  status_code  response_size  \
0                             /  HTTP/1.1          403            202   
1                  /favicon.ico  HTTP/1.1          404            209   
2         /assets/css/reset.css  HTTP/1.1          200           1014   
3           /assets/css/960.css  HTTP/1.1          200           6206   
4  /assets/js/the-associates.js  HTTP/1.1          200           4492   

                                                                                                                                                                                

Now, we need to convert our data into vectors and transfer them to the vector database. 
FAISS Vector Database was used.
To see codes, check the py file called "log_vectorizer"