# Log Analyzer
Implements some routines to analyses logs (currently Apache and Nginx) help you investigate stuff.

---
## 1. Ingestion 🍖
You will start the work by ingesting the data.

Usually, logs are rotated and compressed so you might need to decompress and join all files in a single one.  Before uploading files in this notebook, you **must** compress them with [gzip](https://www.gnu.org/software/gzip/manual/gzip.html) because it makes things much more faster.
💡 To make things even faster, you might also want to **filter** some logs before uploading them.  Here are some useful commands to run locally:

```shell
gzip -d *gz                                                # decompresses gzipped files
cat *access.log* > apache-all.log                          # joins all `*access.log*`
grep "regex_filter" apache-all.log > apache-filtered.log   # gets only lines that match `regex_filter`
gzip apache-filtered.log                                   # compresses the file in `apache-filtered.log.gz`
```

In [None]:
#@title Raw Log Upload
import gzip as gz
from os import rename
from shutil import copyfileobj
from google.colab import files

RAW = 'raw.log'
PARSED = 'parsed.csv'
ERRORS = 'errors.log'

print('Select and upload the log file')
uploaded = files.upload()
rename(list(uploaded.keys())[0], f'{RAW}.gz')

try:
  with gz.open(f'{RAW}.gz','rb') as g:
    with open(RAW,'wb') as f:
      copyfileobj(g, f)
  print('🟢 Logs uploaded successfuly')
except gz.BadGzipFile:
  print('🔴 File must be plain text and gzipped')

---
## 2. Parsing ✂️
After being ingested, the logs must be parsed so you can work on them.  The strategy is to parse the logs and put them in a CSV format that can be later used by other tools.

In [None]:
#@title Parser
from re import compile
from csv import writer

patterns = {
    'apache_access_pag_roxas': compile(r'^(?P<src_ip>\S+) \S+ \S+ \[(?P<timestamp>[\w:/]+\s[+\-]\d{4})\] \"(?P<http_method>\S+) (?P<resource>\S+)? (?P<http_version>\S+)?\" (?P<http_status>\d{3}|-) (?P<bytes>\d+) \"(?P<referrer>.+)\" \"(?P<user_agent>.*?)$'),
    'apache_error_pag_roxas': compile(r'^\[(?P<timestamp>[^\]]+)\] \[(?P<level>[^\]]+)\] \[pid (?P<pid>\d+)[^\]]+\] \[client (?P<src_ip>[^:]+):(?P<port>\d+)\] (?P<message>.+)$')
}

regex = 'apache_access_pag_roxas'  #@param['apache_access_pag_roxas','apache_error_pag_roxas']

pattern = patterns[regex]
header = list(pattern.groupindex.keys())
indexed = list()
unindexed = list()
success = 0
errors = 0

with open(RAW, 'r') as f:
  for line in f.readlines():
    try:
      data = list(pattern.search(line).groupdict().values())
      indexed.append(data)
      success += 1
    except AttributeError:
      unindexed.append(line)
      errors += 1

with open(PARSED, 'w') as f:
  w = writer(f)
  w.writerow(header)
  w.writerows(indexed)

if errors > 0:
  with open(ERRORS, 'w') as f:
    f.write(''.join(unindexed))

print(f'Parsing done: ✅ {success} logs written ({PARSED}), ⛔ {errors} errors found ({ERRORS})')


In [None]:
#@title Optional: File Explorer
with open(PARSED, 'r') as f:
  for line in f.readlines():
    print(line)

---
## 3. Normalization 🎚️
Having the logs parsed, it's time to handle this data to make it more useful for use.

In [None]:
#@title Data Frame Load
import pandas as pd

pd.options.display.max_colwidth = None
pd.options.display.max_rows = None
pd.options.display.precision = 2

df = pd.read_csv(PARSED)

print(f'🟢 Loaded {len(df)} rows, {len(df.columns)} columns')

In [None]:
#@title Data Preparation
#
# EXAMPLES
#
df.head(10)
# df.tail()
# df.sample(frac=0.1, random_state=529)
# df.columns
# df.columns.values
# df.dtypes
# df.info()
# df.describe()
# df.sort_values('timestamp', ascending=False)
# df.sort_values(['timestamp', 'src_ip'], ascending=[True,False])
# df['total'] = df['attack'] + df['defense']  # creates a new field based on others


#
# APACHE DATA
#
apache_timestamp = '%d/%b/%Y:%H:%M:%S %z'     # 12/Nov/2023:12:32:16 +0000
# nginx_timestamp  = '%a %b %d %H:%M:%S.%f %Y'  # Fri Dec 01 22:19:36.814868 2023
timestamp = apache_timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'], format=timestamp, utc=True)

from urllib.parse import unquote
df['resource'] = df['resource'].apply(unquote)


#
# STATISTICS
#
# df.groupby(['src_ip']).count()
# df.groupby(['timestamp', 'src_ip']).count().sort_values('http_method', ascending=False).head()
print(f'🟢 Data is ready: {len(df)} rows, {len(df.columns)} columns')

---
# 3. Filtering 🛠
The last step is to filter out the data to keep only the parts you will actually use.

In [None]:
#@title Filter 1
#
# EXAMPLES
#
# df[['src_ip', 'path']][0:10]
# df.iloc[4:12]
# df.loc[df['src_ip'] == '176.29.111.26']
# df.loc[(df['src_ip'] == '176.29.111.26') & ~(df['http_method'] == 'GET')]
# df.reset_index(drop=True, inplace=True)
# df[~df['path'].str.contains('sleep\(')].head()
# df[~df['user_agent'].str.contains(r'SLEEP\(\d+\)', regex=True, case=False)]
# df.loc[df['src_ip'] == '159.223.105.70', 'src_ip'] = 'REDACTED'
# df.loc[df['src_ip'] == '179.182.216.200', ['path','user_agent']] = 'SUSPICIOUS'
# df.loc[df['src_ip'] == '179.182.216.200', ['path','user_agent']] = ['SUSPICIOUS_PA','SUSPICIOUS_UA']
df = df.sort_values('timestamp')

# suspicious_ip = ['176.29.111.269', '64.227.19.165', '66.249.93.36', '189.40.75.35']
# df[df['src_ip'].isin(suspicious_ip)].head()

print(f'🟢 Filters applied')

---
## 4. Investigation 🔍
All data is in place and ready to use so in this section you will perform operations and visualizations to find the information you need.

In [None]:
print(f'First log: {str(df.loc[df.index[0]]["timestamp"])}')
print(f'Last log.: {str(df.loc[df.index[-1]]["timestamp"])}')

In [None]:
# df.to_csv('output.csv', index=False)

ax = df['src_ip'].value_counts().plot(kind='bar', title='Source IP Count')
ax.set_xlabel('Source IP')
ax.set_ylabel('Count')

## Pro Tips 💡

You can run shell commands in this notebook either inline with `!command args` or using a magic funcion to make the whole cell act as a terminal:

```shell
%%shell
command_1 args
command_n args
```