### Analysis of Parsed NGINX Access Logs

In [1]:
import numpy as np 

In [2]:
import pandas as pd

In [3]:
access_logs=pd.read_csv('access_logs.csv')

In [4]:
access_logs.head()

Unnamed: 0,IP Address,Timestamp,HTTP Method,Request Path,Status Code,Response Size,Referrer,User Agent,Request Length,Query Parameters Count,...,Request Size Distribution,User-Agent Diversity,Time Interval Between Requests,Path Frequency,Suspicious Patterns,SQL Injection Detected,XSS Detected,Command Injection Detected,Insecure Deserialization Detected,File Inclusion Detected
0,172.18.28.9,2024-09-11T00:00:40+05:30,GET,/socket.io/?EIO=4&transport=polling&t=P7TDBHF&...,200,1,https://ss.dmrc.org/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,274,4,...,1.0,1,0:00:00,1,No,No,No,No,No,No
1,172.18.28.9,2024-09-11T00:00:40+05:30,POST,/socket.io/?EIO=4&transport=polling&t=P7TDHOY&...,200,2,https://ss.dmrc.org/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,275,4,...,1.5,1,0:00:00,1,No,No,No,No,No,No
2,172.18.28.9,2024-09-11T00:01:05+05:30,GET,/socket.io/?EIO=4&transport=polling&t=P7TDHOY....,200,1,https://ss.dmrc.org/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,276,4,...,1.333333,1,0:00:12.500000,1,No,No,No,No,No,No
3,172.18.28.9,2024-09-11T00:01:06+05:30,POST,/socket.io/?EIO=4&transport=polling&t=P7TDNWZ&...,200,2,https://ss.dmrc.org/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,275,4,...,1.5,1,0:00:08.666667,1,No,No,No,No,No,No
4,172.18.28.9,2024-09-11T00:01:31+05:30,GET,/socket.io/?EIO=4&transport=polling&t=P7TDNWa&...,200,1,https://ss.dmrc.org/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,274,4,...,1.4,1,0:00:12.750000,1,No,No,No,No,No,No


In [5]:
n_features=access_logs.shape[1]
n_samples =access_logs.shape[0]

In [6]:
print(f'number of features: {n_features}')
missing_values_count = access_logs.isnull().sum()
missing_values_count[0:n_features]

number of features: 28


IP Address                           0
Timestamp                            0
HTTP Method                          0
Request Path                         0
Status Code                          0
Response Size                        0
Referrer                             0
User Agent                           0
Request Length                       0
Query Parameters Count               0
Is Secure                            0
Time of Day                          0
Day of Week                          0
User-Agent Length                    0
Referrer Length                      0
Status Code Category                 0
Request Frequency                    0
Status Code Distribution             0
Request Size Distribution            0
User-Agent Diversity                 0
Time Interval Between Requests       0
Path Frequency                       0
Suspicious Patterns                  0
SQL Injection Detected               0
XSS Detected                         0
Command Injection Detecte

In [7]:
total_cells = np.prod(access_logs.shape)
total_missing = missing_values_count.sum()
percent_missing = (total_missing/total_cells) * 100
print('percentage missing:',(f'{percent_missing:.2f}') ,'%')

percentage missing: 0.00 %


In [8]:
for feature in access_logs.columns:
    if feature in access_logs.columns:
        unique_count = access_logs[feature].nunique()
        print(f"Number of unique values for {feature}: {unique_count}")
    else:
        print(f"Column '{feature}' does not exist in the DataFrame.")

Number of unique values for IP Address: 3
Number of unique values for Timestamp: 10151
Number of unique values for HTTP Method: 3
Number of unique values for Request Path: 33216
Number of unique values for Status Code: 6
Number of unique values for Response Size: 1238
Number of unique values for Referrer: 26
Number of unique values for User Agent: 270
Number of unique values for Request Length: 285
Number of unique values for Query Parameters Count: 6
Number of unique values for Is Secure: 1
Number of unique values for Time of Day: 10151
Number of unique values for Day of Week: 1
Number of unique values for User-Agent Length: 39
Number of unique values for Referrer Length: 19
Number of unique values for Status Code Category: 4
Number of unique values for Request Frequency: 37163
Number of unique values for Status Code Distribution: 37220
Number of unique values for Request Size Distribution: 37210
Number of unique values for User-Agent Diversity: 268
Number of unique values for Time In

In [9]:
command_injection = access_logs[access_logs['Command Injection Detected'] == 'Yes']
print(command_injection)

        IP Address                  Timestamp HTTP Method  \
65     172.18.28.9  2024-09-11T00:07:04+05:30        POST   
66     172.18.28.9  2024-09-11T00:07:04+05:30         GET   
70     172.18.28.9  2024-09-11T00:07:04+05:30         GET   
72     172.18.28.9  2024-09-11T00:07:05+05:30         GET   
77     172.18.28.9  2024-09-11T00:07:28+05:30         GET   
...            ...                        ...         ...   
37123  172.18.28.9  2024-09-11T15:37:07+05:30         GET   
37124  172.18.28.9  2024-09-11T15:37:07+05:30         GET   
37138  172.18.28.9  2024-09-11T15:37:11+05:30         GET   
37150  172.18.28.9  2024-09-11T15:37:25+05:30         GET   
37209  172.18.28.9  2024-09-11T15:42:49+05:30         GET   

                                            Request Path  Status Code  \
65     /socket.io/?EIO=4&transport=polling&t=P7TElBO&...          200   
66     /socket.io/?EIO=4&transport=polling&t=P7TElBV&...          200   
70     /socket.io/?EIO=4&transport=websocket&sid

In [10]:
suspicious_patterns = access_logs[access_logs['Suspicious Patterns'] == 'Yes']
print(suspicious_patterns)

Empty DataFrame
Columns: [IP Address, Timestamp, HTTP Method, Request Path, Status Code, Response Size, Referrer, User Agent, Request Length, Query Parameters Count, Is Secure, Time of Day, Day of Week, User-Agent Length, Referrer Length, Status Code Category, Request Frequency, Status Code Distribution, Request Size Distribution, User-Agent Diversity, Time Interval Between Requests, Path Frequency, Suspicious Patterns, SQL Injection Detected, XSS Detected, Command Injection Detected, Insecure Deserialization Detected, File Inclusion Detected]
Index: []

[0 rows x 28 columns]


In [11]:
sql_injection = access_logs[access_logs['SQL Injection Detected'] == 'Yes']
print(sql_injection)

        IP Address                  Timestamp HTTP Method  \
1710   172.18.28.9  2024-09-11T03:21:58+05:30        POST   
1711   172.18.28.9  2024-09-11T03:22:23+05:30         GET   
3504   172.18.28.9  2024-09-11T06:05:23+05:30         GET   
3953   172.18.28.9  2024-09-11T06:14:51+05:30         GET   
3987   172.18.28.9  2024-09-11T06:15:24+05:30         GET   
...            ...                        ...         ...   
36103  172.18.28.9  2024-09-11T15:21:46+05:30         GET   
36179  172.18.28.9  2024-09-11T15:23:06+05:30         GET   
36180  172.18.28.9  2024-09-11T15:23:06+05:30         GET   
37107  172.18.28.9  2024-09-11T15:37:04+05:30         GET   
37108  172.18.28.9  2024-09-11T15:37:04+05:30         GET   

                                            Request Path  Status Code  \
1710   /socket.io/?EIO=4&transport=polling&t=P7TxL--&...          200   
1711   /socket.io/?EIO=4&transport=polling&t=P7TxL--....          200   
3504   /socket.io/?EIO=4&transport=polling&t=P7U