# Parsing Logfiles with Regex

This week, we're going to play with files.  Specifically, logfiles.

A lot of the Python developers I teach (and work with) work with logfiles of various sorts, and want to work with those files in various ways.  This week, we'll see how we can turn the logfile into a list of dictionaries, making it easier to manipulate.

For the purposes of this exercise, we'll be looking at one of my favorite files, which I call "mini-access-log.txt".  It is an excerpt from the Apache HTTP server on my computer (lerner.co.il) from several years ago.  You can view and download it from here:

    https://gist.github.com/reuven/5875971

Each line of this logfile looks like the following:
67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"

As you can see:
The fields aren't separated by whitespace, but are still recognizable to a human.
Each line starts with an IP address.
Between square brackets, we have the timestamp -- date, time, and then the time zone.
Following the timestamp, inside of double quotes ("), we have the HTTP request.  You can assume that in this file, all of the request start with the word GET.
There are other fields as well, but these are the ones that interest me.

The exercise is to write a function that, when given a filename, returns a list of dictionaries.  Each dict should have the following keys:
ip_address, containing the IP address
timestamp, containing the timestamp (not including the square brackets, but everything inside of them)
request, containing the HTTP request (not including the double quotes, but everything inside of them)

Thus, the above line from mini-access-log.txt would look like this:
    {'ip_address': '67.218.116.165',
     'timestamp': '30/Jan/2010:00:03:18 +0200',
     'request': 'GET /robots.txt HTTP/1.0'}

We'll transform the file into a list of dicts, each of which looks that.  There are 206 lines in the file, which means that this list will contain 206 dictionaries, each with these three key-value pairs.

Using a regular expression will definitely help here -- but if you don't know regexps, then don't worry; you can still get it to work. That said,  but it'll be a bit clunky.

In [1]:
import re
import csv

In [2]:
fp = open('./data/mini-access-log.txt', 'r')
logs = list(csv.reader(fp))

result_list = []
for log in logs:
    result = {}
    result['ip_address'] = re.search(r"^[0-9]*.[0-9]*.[0-9]*.[0-9]*", log[0]).group()  # Starts with number
    result['timestamp'] = re.search(r"(\[)(.*)(\])", log[0]).group(2)  # text within []
    
    # Get text starting with GET and ends with number before first closing double quote "
    result['request'] = re.search(r'(GET.+[0-9])"', log[0]).group(1)  
    result_list.append(result)

print(result_list[20])

{'ip_address': '65.55.106.186', 'timestamp': '30/Jan/2010:07:07:13 +0200', 'request': 'GET /robots.txt HTTP/1.1'}


### Testing regex

In [6]:
log = logs[0]
print(log)
re.search(r"^[0-9]*.[0-9]*.[0-9]*.[0-9]*", log[0]).group()

['67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"']


'67.218.116.165'

In [4]:
re.search(r"(\[)(.*)(\])", log[0]).group(2)

'30/Jan/2010:00:03:18 +0200'

In [5]:
re.search(r'(GET.+[0-9])"', log[0]).group(1)

'GET /robots.txt HTTP/1.0'

### Solutions

In [None]:
import re

logfilename = 'mini-access-log.txt'

def re_line_to_dict(line):
    regexp = '''
((?:\d{1,3}\.){3}\d{1,3})       # IP addresses contain four numbers (each with 1-3 digits)
.*                              # Junk between IP address and timestamp
\[([^\]]+)\]                    # Timestamp, defined to be anything between [ and ]
.*                              # Junk between timestamp and request
"(GET[^"]+)"                    # Request, starting with GET
'''
    m = re.search(regexp, line, re.X)

    if m:
        ip_address = m.group(1)
        timestamp = m.group(2)
        request = m.group(3)

    else:
        ip_address = 'No IP address found'
        timestamp = 'No timestamp found'
        request = 'No request found'

    output = {'ip_address': ip_address,
              'timestamp': timestamp,
              'request': request}
    return output


def line_to_dict(line):
    ip_address = line.split()[0]

    timestamp_start = line.index('[') + 1
    timestamp_end = line.index(']')
    timestamp = line[timestamp_start:timestamp_end]

    request_start = line.index('"') + 1
    request_end = line[request_start:].index('"')
    request = line[request_start:request_start+request_end]

    return {'ip_address': ip_address,
            'timestamp': timestamp,
            'request': request}


def logtolist(filename):
    return [line_to_dict(line)
            for line in open(filename)]


for one_item in logtolist(logfilename):
    print(one_item)
