[View in Colaboratory](https://colab.research.google.com/github/redpanda-ai/epi_python/blob/master/sm_quiz.ipynb)

### **Assignment**

Write a script that, given a web server log file, returns the 10 most frequently requested objects
and their cumulative bytes transferred. Only include GET requests with Successful (HTTP 2xx)
responses. Resolve ties however you’d like.
Log format:
- request date, time, and time zone
- request line from the client
- HTTP status code returned to the client
- size (in bytes) of the returned object

**Given this input data:**
```
[01/Aug/1995:00:54:59 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:04 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:06 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 403 298
[01/Aug/1995:00:55:09 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:18 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:56:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
```
**The result should be:**
```
/images/ksclogosmall.gif 10905
/images/opf-logo.gif 65022
```
(You may tweak the output format as you'd like)

In [0]:
test_file_0 = """
[01/Aug/1995:00:54:59 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:04 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:06 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 403 298
[01/Aug/1995:00:55:09 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:18 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
"""

test_file_1 = """
[01/Aug/1995:00:54:59 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:04 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:06 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 403 298
[01/Aug/1995:00:55:09 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:18 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_1.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_2.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_3.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_4.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_5.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_6.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:18 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_7.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_8.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_a.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_9.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_0.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:52 -0400] "GET /images/ksclogosmall_a.gif HTTP/1.0" 200 3635
"""

with open("test_file_0.log", "w", encoding="utf-8") as out_file:
  out_file.write(test_file_0)
  
with open("test_file_1.log", "w", encoding="utf-8") as out_file:
  out_file.write(test_file_1)


In [2]:
import re
from collections import Counter, defaultdict

def n_most_frequent_requests(log_file, n=10):
    """Returns the n most frequent successful (HTTP response code 2xx)
    GET requests"""
    regex = r'^\[.*\]\s+"GET (\S+).*"\s+2\d\d\s+(\d+)$'
    #                        ^^^^^              ^^^^^
    #                        object             size_in_bytes

    object_counter = Counter()
    byte_sums = defaultdict(int)

    with open(log_file, "r", encoding="utf-8") as in_file:
        for line in in_file:
            if re.search(regex, line):
                matches = (re.match(regex, line).groups())
                object, size_in_bytes = matches
                object_counter[object] += 1
                byte_sums[object] += int(size_in_bytes)


    n_most_common = object_counter.most_common(n)

    results = []
    for key, count in n_most_common:
        results.append((key, byte_sums[key], count))

    return results

#Let's run a few tests
test_files = ["test_file_0.log", "test_file_1.log"]
for test_file in test_files:
    print(f"Testing {test_file}:")
    header_0, header_1, header_2 = "object", "bytes", "count"
    print(f"\t{header_0:<30} {header_1:>10} {header_2:>4}")
    for key, bytes, count in n_most_frequent_requests(test_file):
        # I have added a third field to the output, count
        print(f"\t{key:<30} {bytes:>10} {count:>4}")
    print("\n")


Testing test_file_0.log:
	object                              bytes count
	/images/ksclogosmall.gif            10905    3
	/images/opf-logo.gif                65022    2


Testing test_file_1.log:
	object                              bytes count
	/images/opf-logo.gif                97533    3
	/images/ksclogosmall.gif            10905    3
	/images/ksclogosmall_a.gif           7270    2
	/images/ksclogosmall_1.gif           3635    1
	/images/ksclogosmall_2.gif           3635    1
	/images/ksclogosmall_3.gif           3635    1
	/images/ksclogosmall_4.gif           3635    1
	/images/ksclogosmall_5.gif           3635    1
	/images/ksclogosmall_6.gif           3635    1
	/images/ksclogosmall_7.gif           3635    1




### About this solution

I chose to use Python's built-in data types, *Counter* and *defaultdict* as ancillary data structures to keep track of the number of successful requests and the cumulative bytes transferred per object, respectively. 

### Analysis of solution
Both *Counter* and *defaultdict* are considered high-performance container datatypes.  Their individual time and space complexities can be approximated to those of *dicts*.  Other than that, I use a loop to read each file, so the time complexity is bound by the number of lines in the file, $ O(n) $.