# Demo HTTP access logs

For some rudimentary testing, let's create some mid-fidelity Apache-format HTTP access logs. These will have a bunch of rows that look like
```
123.66.150.17 - - [12/Aug/2010:02:45:59 +0000] "POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1" 200 2 "http://www.example.com/wordpress3/wp-admin/post-new.php" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"
```

We're going to start from some existing public access logs, look for any patterns we should mimic (or distributions we should sample from), and then create our data using `mimesis` and our empirical findings.

## Example (real) access log

I have no idea what "Almhuette Raith" is, but they seem to have a large log publically available at http://www.almhuette-raith.at/apache-log/access.log, we'll arbitrarily start with that.

I've previously downloaded this (1.7GB) file and re-uploaded it to host it in S3 as a much smaller (78MB) gzip file. We'll download and unzip it.

In [11]:
!wget https://hpw-public-demos.s3.us-west-2.amazonaws.com/almhuette-raith_access.log.gz
!gunzip almhuette-raith_access.log.gz

--2022-09-27 12:56:36--  https://hpw-public-demos.s3.us-west-2.amazonaws.com/almhuette-raith_access.log.gz
Resolving hpw-public-demos.s3.us-west-2.amazonaws.com (hpw-public-demos.s3.us-west-2.amazonaws.com)... 52.218.178.194
Connecting to hpw-public-demos.s3.us-west-2.amazonaws.com (hpw-public-demos.s3.us-west-2.amazonaws.com)|52.218.178.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81675207 (78M) [application/x-gzip]
Saving to: ‘almhuette-raith_access.log.gz’


2022-09-27 12:57:44 (1.15 MB/s) - ‘almhuette-raith_access.log.gz’ saved [81675207/81675207]



1.7GB is small enough that we can load it into memory without doing anything special, so let's do that.

In [12]:
with open("almhuette-raith_access.log") as infile:
  access_records = infile.readlines()

len(access_records)

9595040

There's at least one package on PyPI that can parse Apache logs (`apachelogs`), but actually using it is a little of a pain and I'm not sure how robust it is anyway.

I'm going to write a simple parser that will work for our purposes; it's just a big regular expression, and probably not very robust.

In [14]:
import re

In [33]:
access_log_pattern = re.compile(
    r'(?P<ip>.*?) (?P<remote_log_name>.*?) (?P<userid>.*?) \[(?P<date>.*?)(?= ) (?P<timezone>.*?)\] '
    r'"(?P<request_method>\w*) (?P<path>(?:[^"\\]|\\.)*?)(?P<request_version> HTTP/.*)?" '
    r'(?P<status>\d*?) (?P<length>-|\d*) "(?P<referrer>(?:[^"\\]|\\.)*?)" "(?P<user_agent>(?:[^"\\]|\\.)*?)"'
)

re.match(access_log_pattern, access_records[0]).groupdict()

{'ip': '13.66.139.0',
 'remote_log_name': '-',
 'userid': '-',
 'date': '19/Dec/2020:13:57:26',
 'timezone': '+0100',
 'request_method': 'GET',
 'path': '/index.php?option=com_phocagallery&view=category&id=1:almhuette-raith&Itemid=53',
 'request_version': ' HTTP/1.1',
 'status': '200',
 'length': '32653',
 'referrer': '-',
 'user_agent': 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'}

That seems to work on all the lines in the Almhuette log at least. I wouldn't bet money on it working on all Apache logs, but it's good enough for now.

In [34]:
from collections import defaultdict, Counter

In [35]:
field_counts = defaultdict(Counter)

for record in access_records:
    match = re.match(access_log_pattern, record)
    if match:
        for field, value in match.groupdict().items():
            field_counts[field][value] += 1
    else:
        print("Failed to match record: ", record)
        break

for field, counts in field_counts.items():
    print(f"{field}: {len(counts)}")

ip: 34521
remote_log_name: 1
userid: 3
date: 456915
timezone: 2
request_method: 16
path: 5753269
request_version: 2
status: 15
length: 50562
referrer: 2544
user_agent: 14472
