# Profiling parsing datetimes

From separate profiling (not included), I found that the most time consuming part was parsing the timestamps. I compiled a short list of the  methods I tried. The output here is not necessary for running this project, but may be useful to someone who updates the project later. We will be using the standard `date_string` copied below, and storing the times taken in the dictionary `results`.

In [209]:
timestamp_string = "01/Jul/1995:00:00:09 -0400"
results     = []

### Method 1: using dateutil.parser

This is a flexible parser, but the datestring passed in is not recognized immediately. If we enable fuzzy searches, then it becomes even more forgiving of the format. The downside is that it needs to determine the format on every string. This is a lot of overhead, as the format is the same for the timestamps in the logfile.

The code for  `timestampParser` becomes
```{python}
def timestampParser(ts):
    return parse(ts, fuzzy=True)
```

In [210]:
# This is an additional download. The parsing is quite flexible, so it can handle a wide range of
# formats automatically
from dateutil.parser import parse

print parse(timestamp_string, fuzzy=True)

1995-07-01 00:00:09-04:00


In [211]:
# For some reason, assigning directly to results['dateutil.parser'] fails
temp = %timeit -o parse(timestamp_string, fuzzy=True)
results.append(['dateutil.parser raw',temp])

1000 loops, best of 3: 299 µs per loop


### Method 2: using dateutil.parser with manual preprocessing

The dateutil.parser was confused by two things: the timezone and the date:time format. By doing a little preprocessing on the string, we can eliminate the  need for a fuzzy search. The date takes the first 11 characters, and the ` -0400` takes the last six characters. 

Our code is 
```{python}
def timestampParser(ts):
    return parse(timestamp_string[:11] + " " + timestamp_string[12:])
```

In [212]:
print parse(timestamp_string[:11] + " " + timestamp_string[12:])

1995-07-01 00:00:09-04:00


In [213]:
temp = %timeit -o parse(timestamp_string[:11] + " " + timestamp_string[12:])
results.append(['dateutil.parser preprocessed',temp])

1000 loops, best of 3: 283 µs per loop


### Method 3: using strptime

This method requires a format string to decipher the timestamp string. This means that it is more fragile, but it does not have to  determine the format on every pass. Unlike `dateutil` which is a common, but not base, package the `datetime` package is standard on all python distributions I have encountered. I also strip the timezone from the timestamp before processing.

In [214]:
import datetime
print datetime.datetime.strptime(timestamp_string[:-6], "%d/%b/%Y:%H:%M:%S")

1995-07-01 00:00:09


In [215]:
temp = %timeit -o datetime.datetime.strptime(timestamp_string[:-6], "%d/%b/%Y:%H:%M:%S")
results.append(['strptime', temp])

10000 loops, best of 3: 31.7 µs per loop


### Method 4: manually constructing the datetime object

This is the most fragile command. I preconstruct a list of months, and manually split the date and do the conversions. When looking at the actual code of `timestampParser` note that this fragile method is wrapped in a `try/except` block, and falls through to the more robust strptime if an exception is raised.

In [216]:
months = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}

def manualParse(ts):
    date,hh,mm,ss = ts[:-6].split(':')
    D,M,Y = date.split('/')
    D,Y,hh,mm,ss = map(int, [D,Y,hh,mm,ss])
    M = months[M]
    return datetime.datetime(Y,M,D,hh,mm,ss)

print manualParse(timestamp_string)

1995-07-01 00:00:09


In [217]:
temp = %timeit -o manualParse(timestamp_string)
results.append(['manual', temp])

100000 loops, best of 3: 8.65 µs per loop


## Summary

In [218]:
import pandas as pd

summary_table = pd.DataFrame(results, columns = ["Method", "TimeIt"])
summary_table['time (microsec)'] = summary_table['TimeIt'].apply(lambda x: round(x.best*10**6))
summary_table['num trials'] = summary_table['TimeIt'].apply(lambda x: x.loops)

to_show = summary_table[['Method','time (microsec)','num trials']].sort_values(by='time (microsec)')
to_show

Unnamed: 0,Method,time (microsec),num trials
3,manual,9.0,100000
2,strptime,32.0,10000
1,dateutil.parser preprocessed,283.0,1000
0,dateutil.parser raw,299.0,1000


In [226]:
# pandas keeps changing the css and styling options, so this is here
# for a prettier version, but the previous cell guarantees we will get
# a result
from IPython.core.display import display,HTML
s= """
<style>
.dataframe table{
    display:block;
    margin-left: auto;
    margin-right: auto;
}
.dataframe thead{
    background-color: #efefef;
}
.dataframe tr:nth-child(even) {
   background-color: #9df;
}

</style>
"""

display(HTML(s))
display(to_show)

Unnamed: 0,Method,time (microsec),num trials
3,manual,9.0,100000
2,strptime,32.0,10000
1,dateutil.parser preprocessed,283.0,1000
0,dateutil.parser raw,299.0,1000
