The Apache HTTP server is a widely used open source web server.

It can be configured to save log files in different formats, the most common format is the Common Log Format (CLF)

It looks like this: 
```
<IP Address> <Client ID> <User ID> <Time> <Request> <Status> <Size>
```

For example:
```
127.0.0.1 - swills [13/Nov/2019:14:43:30 -0800] "GET /assets/234 HTTP/1.0" 200 2326
```

In [1]:
# Using the re module, we can use groups in order to work with sections of the log
import re

log_line = '127.0.0.1 - swills [13/Nov/2019:14:43:30 -0800] "GET /assets/234 HTTP/1.0" 200 2326'

# Let's extract only the IP
match = re.search(r'(?P<IP>\d+\.\d+\.\d+\.\d+)', log_line)
match.group('IP')

'127.0.0.1'

In [2]:
# We can also use a group expression to get the time

match_2 = re.search(r'\[(?P<Time>\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}\ -\d{4})\]', log_line)
match_2.group('Time')

'13/Nov/2019:14:43:30 -0800'

In [17]:
# We can also grab multiple groups
r = r'(?P<IP>\d+\.\d+\.\d+\.\d+)'
r += r' - (?P<Client>\w+) '
r += r'\[(?P<Time>\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}\ -\d{4})\]'
r += r' "(?P<Request>.+)"'
r += r' (?P<Status>\d{3})'
r += r' (?P<Size>\d+)'

m = re.search(r, log_line)

In [18]:
m.group('IP')

'127.0.0.1'

In [19]:
m.group('Client')

'swills'

In [20]:
m.group('Time')

'13/Nov/2019:14:43:30 -0800'

In [21]:
m.group('Request')

'GET /assets/234 HTTP/1.0'

In [22]:
m.group('Status')

'200'

In [23]:
m.group('Size')

'2326'

We parsed a single line of a log, but let's say we want to search an entire log file for all the IP addresses of clients who did a GET request on the 8th of November 2019.

In [27]:
access_log = '''127.0.0.1 - swills [13/Nov/2019:14:43:30 -0800] "GET /assets/234 HTTP/1.0" 200 2326
192.168.54.8 - dude [08/Nov/2019:15:21:34 -0800] "GET /assets/234 HTTP/1.0" 200 3627
172.3.45.1 - otherDude [08/Nov/2019:16:31:54 -0800] "GET /assets/234 HTTP/1.0" 200 3457
17.21.50.11 - anotherDude [08/Nov/2019:16:31:57 -0800] "GET /assets/234 HTTP/1.0" 200 2157
3.43.24.12 - someDude [08/Nov/2019:17:01:07 -0800] "GET /assets/234 HTTP/1.0" 200 3239'''

r = r'(?P<IP>\d+\.\d+\.\d+\.\d+)'
r += r' - (?P<Client>\w+) '
r += r'\[(?P<Time>08/Nov/2019:\d{2}:\d{2}:\d{2}\ -\d{4})\]'
r += r' (?P<Request>"GET .+")'

matched = re.finditer(r, access_log)

for match in matched:
    print(match.group('IP'))

192.168.54.8
172.3.45.1
17.21.50.11
3.43.24.12
