# Week 4 HW part 2:  parsing a log line

## Tuples

Let us familiarize ourselves with another standard Python datatype that you will encounter in this homework:  the **tuple**.  A tuple looks and behaves very similarly to a list, except that it is contained within parentheses `()` instead of brackets `[]`.  Elements are indexed in exactly the same way:

In [1]:
mytuple = ("zero", "one", "two")
mytuple[0]

'zero'

The only difference is that tuples have a FIXED LENGTH (once created, we cannot add or remove elements), and they are IMMUTABLE (once created, the elements cannot be changed - ignoring some technicalities).

They are especially well-suited for situations where you want to bundle a small (e.g. 3) group of values together.  In these situations they are much more efficient than lists.

NOTE: If you want to create a tuple with only one element then you need an extra comma "," after the element.  Can you figure out why?

In [2]:
not_a_tuple = (9)
type(not_a_tuple)

int

In [3]:
is_a_tuple = (9,)
type(is_a_tuple)

tuple

Tuples are very commonly used when you want to return multiple values from a function:

In [4]:
# toy example of returning a tuple
def get_record(user_id):
    name = "Sue"
    age = 32
    platform = "Windows"
    return (name, age, platform)

# you COULD call it like this
record = get_record(24235432)
n = record[0]
a = record[1]
p = record[2]

# BUT Python supports a slicker syntax called "tuple unpacking"
n, a, p = get_record(24235432)

print(n)
print(a)
print(p)

Sue
32
Windows


## Parsing log entries using regular expressions

*Parsing* is the act of extracting information from strings.  In this homework we will figure out how to use *regular expressions* to parse each line in the log file.

Recall that in week 2 we used regular expressions to clean up our tweets (see Python video tutorials).  There we only did simple substitutions (finding patterns and replacing with `' '`).

This week we're going to step up the difficulty.  Using a concept in regex called "groups", we will extract fields from each log line.  To get started, have a look at this tutorial:

https://www.machinelearningplus.com/python/python-regex-tutorial-examples/ 

The full documentation is really more of a reference:

https://docs.python.org/3/howto/regex.html

In [5]:
import re

# Here is an example entry from the log

logentry = 'maynard.isi.uconn.edu - - [28/Jul/1995:13:32:22 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891'

In [48]:
# let's figure out how to extracting the following fields from it (these fields were described
# in the homework description already):

# requesting_host
# user_identity
# user_local_identity
# timestamp
# requested_resource
# return_code
# bytes_transferred

# PLEASE write a regular expression that extracts the fields from this line.  Tests are in
# the next cell.

# YOUR CODE HERE
regex = re.compile('\s+')
splited = regex.split(logentry)
requesting_host = splited[0]
user_identity = splited[1]
user_local_identity = splited[2]
timestamp = re.sub('[%s]' % re.escape("""!"#$%&'()*+,.;<=>?@[\]^_`{|}~"""), '', splited[3] + " " + splited[4])
requested_resource = re.search('".*?"',logentry).group()[1:-1]
return_code = splited[8]
bytes_transferred = splited[9]

In [49]:
assert requesting_host == 'maynard.isi.uconn.edu'
assert user_identity == '-'
assert user_local_identity == '-'
assert timestamp == '28/Jul/1995:13:32:22 -0400'
assert requested_resource == 'GET /images/shuttle-patch-logo.gif HTTP/1.0'
assert return_code == '200'
assert bytes_transferred == '891'
# Note for later:  should probably test what happens when bytes_transferred is a '-'

## Parsing Timestamps

The timestamp itself has some further structure that we want to extract.  Let's try using another regex to split up the timestamp string.  It is formatted in the following way: `Day/Month/Year:Hour:Minute:Second Timezone`.

Write a regular expression that parses the timestamp string (from the example above):

In [53]:
# YOUR CODE HERE
time_splited = regex.split(re.sub('[%s]' % re.escape("""!"#$%&'()*+,./:;<=>?@[\]^_`{|}~"""), ' ', timestamp))
day = time_splited[0]
month = time_splited[1]
year = time_splited[2]
hour = time_splited[3]
minute = time_splited[4]
second = time_splited[5]
timezone = time_splited[6]

In [54]:
assert day == '28'
assert month == 'Jul'
assert year == '1995'
assert hour == '13'
assert minute == '32'
assert second == '22'
assert timezone == '-0400'

It turns out that parsing this timestamp using a regex is just the beginning of our difficulties with timestamps!

In order to get something useful (i.e. dates and times that you can actually analyze) you would have to translate the month string from `'Jul'` to the number `7`.  But what if somebody changes the log format to write out `'July'` instead of `'Jul'`?  Are you going to handle that case as well?  What if somebody changes the log language to French?

What about timezones?  Your head should start spinning now.

Even after you solve all parsing issues, how do you answer questions like:  how many days are in between `December 7, 1941` and `January 1, 2017`.  Did you remember leap days?  Did anybody ever tell you there are even [leap seconds](https://github.com/sstirlin/pyleapsec), and different systems measure them differently (or not at all)?  Yes, I have suffered.

Fortunately, Python has a `datetime` module that is meant to simplify life.  Instead of using regex to parse timestamps, `datetime` has a smart function `strptime` that can parse timestamps.  You will develop a love/hate relationship with it.  Here is a tutorial to get you started: https://www.guru99.com/date-time-and-datetime-classes-in-python.html

One thing to keep in mind:  most systems measure time using UTC (Coordinated Universal Time).  Roughly, this is just Greenwich Mean Time (with some subtleties).  The timezone will come formatted like `-0400` (4 hours behind UTC), or `+0800` (8 hours ahead of UTC).

Your task:  create a `datetime` object that holds the date and time that you extracted above:

In [93]:
from datetime import datetime, timezone, timedelta

dt = datetime.strptime(timestamp,'%d/%b/%Y:%H:%M:%S %z')

In [94]:
assert dt == datetime(1995, 7, 28, 13, 32, 22, tzinfo=timezone(-timedelta(hours=4)))