# Parsing

Goals:

    - Plan a parsing strategy
    - Use basic regular expressions: match, search, sub
    - Benchmarking a parser
    - Running nosetests
    - Write a simple parser
    
# Modules:




In [2]:
import re
import nose
# %timeit

# Parsing is hard...

<h2>
<i>"System Administrators spent $24.3\%$ of
 their work-life parsing files."</i>
 <br><br>
 
 Independent analysis by The GASP* Society ;) <br>
 </h2>
<h3>
 *(Grep Awk Sed Perl)
 </h3>

# ... use a strategy!

<table>
<tr><td>
<ol><li>Collect parsing samples
<li>Play in ipython and collect %history
<li>Write tests, then the parser
<li>Eventually benchmark
</ol>
</td><td>
<img src="/files/img/parsing-lifecycle.png" />
</td></tr>
</table>


# Parsing postfix logs

In [3]:
from __future__ import print_function
# Before writing the parser, collect samples of
#  the interesting lines. For now just
from course import mail_sent, mail_delivered

print("I'm goint to parse the following line", mail_sent, sep="\n\n")

I'm goint to parse the following line

May 31 08:00:00 test-fe1 postfix/smtp[16669]: 7CD8E730020: to=<jon@doe.it>, relay=examplemx2.doe.it[222.33.44.555]:25, delay=0.8, delays=0.17/0.01/0.43/0.19, dsn=2.0.0, status=sent(250 ok:  Message 2108406157 accepted)


In [2]:
# and %edit a simple
def test_sent():
    hour, host, to = parse_line(mail_sent)
    assert hour == '08:00:00'
    assert to == 'jon@doe.it'

In [3]:
# Play with mail_sent and start using basic strings in ipython
mail_sent.split()

[u'May',
 u'31',
 u'08:00:00',
 u'test-fe1',
 u'postfix/smtp[16669]:',
 u'7CD8E730020:',
 u'to=<jon@doe.it>,',
 u'relay=examplemx2.doe.it[222.33.44.555]:25,',
 u'delay=0.8,',
 u'delays=0.17/0.01/0.43/0.19,',
 u'dsn=2.0.0,',
 u'status=sent(250',
 u'ok:',
 u'Message',
 u'2108406157',
 u'accepted)']

In [4]:
# You can number fields with zip. 
# Remember that ipython puts the last returned value in `_`
# which is useful in interactive mode!
fields, counting = _, zip(range(20), _)
print(*counting, sep="\n")

(0, u'May')
(1, u'31')
(2, u'08:00:00')
(3, u'test-fe1')
(4, u'postfix/smtp[16669]:')
(5, u'7CD8E730020:')
(6, u'to=<jon@doe.it>,')
(7, u'relay=examplemx2.doe.it[222.33.44.555]:25,')
(8, u'delay=0.8,')
(9, u'delays=0.17/0.01/0.43/0.19,')
(10, u'dsn=2.0.0,')
(11, u'status=sent(250')
(12, u'ok:')
(13, u'Message')
(14, u'2108406157')
(15, u'accepted)')


In [5]:
# Now we can pick fields singularly...
hour, host, dest = fields[2], fields[3], fields[6]

In [6]:
# ... or with 
from operator import itemgetter
which_returns_a_function = itemgetter(2, 3, 6)
assert (hour, host, dest) == which_returns_a_function(fields)

## Exercise I
    
    - %edit 03_parsing_test.py
    - complete the parse_line(line) function
    - %paste your solution's code in iPython and run manually the test functions



In [None]:
#
# Use this cell for Exercise I
#
%load ../03_parsing_test.py

In [None]:
#
# Run test
#
test_sent()

In [None]:
# Solution
%load course/parse_line.py

# Python Regexp


In [7]:
# Python supports regular expressions via
import re

# We start showing a grep-reloaded function
def grep(expr, fpath):
    one = re.compile(expr) # ...has two lookup methods...
    assert ( one.match     # which searches from ^ the beginning
         and one.search )  # that searches $\pyver{anywhere}$

    with open(fpath) as fp:
        return [x for x in fp if one.search(x)]

assert not grep(r'^localhost', '/etc/hosts')
ret = grep('127.0.0.1', '/etc/hosts')
assert ret, "ret should not be empty"
print(*ret)

127.0.0.1	localhost



### Achieve more complex splitting using regular expressions. 

In [4]:
from re import split # is a very nice function
import sys
from course import sh

# Let's gather some ping stats
if sys.platform.startswith('win'):
    cmd = "ping -n3 www.google.it"
else:
    cmd = "ping -c3 -w3 www.google.it"

# Split for both space and =
ping_output = [split("[ =]", x) for x in sh(cmd)]

print(*ping_output, sep="\n")

['PING', 'www.google.it', '(216.58.210.195):', '56', 'data', 'bytes']
['64', 'bytes', 'from', '216.58.210.195:', 'icmp_seq', '0', 'ttl', '50', 'time', '27.440', 'ms']
['64', 'bytes', 'from', '216.58.210.195:', 'icmp_seq', '1', 'ttl', '50', 'time', '25.720', 'ms']
['64', 'bytes', 'from', '216.58.210.195:', 'icmp_seq', '2', 'ttl', '50', 'time', '23.311', 'ms']
['---', 'www.google.it', 'ping', 'statistics', '---']
['3', 'packets', 'transmitted,', '3', 'packets', 'received,', '0%', 'packet', 'loss']
['round-trip', 'min/avg/max/stddev', '', '', '23.311/25.490/27.440/1.693', 'ms']


In [5]:
# Splitting with re.findall

from re import findall  # can be misused too;

# eg for adding the ":" to a
mac = "00""24""e8""b4""33""20"

# ...using this
re_hex = "[0-9a-fA-F]{2}"
mac_address = ':'.join(findall(re_hex, mac))
print("The mac address is ", mac_address)

# Actually this does a bit of validation, requiring all chars to be in the 0-F range

The mac address is  00:24:e8:b4:33:20


# Benchmarking in iPython - I

  - Parsing big files needs benchmarks. iPython %timeit magic is a good starting point
  
We are going to measure the execution time of various tasks, using different strategies like regexp, join and split. 

In [6]:
# Run the following cell many times. 
# Do you always get the same results?
test_all_regexps = ("..", "[a-fA-F0-9]{2}")
for re_s in test_all_regexps:
    %timeit ':'.join(findall(re_s, mac))

The slowest run took 26.84 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.38 µs per loop
The slowest run took 68.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.51 µs per loop


In [7]:
# We can even compare compiled vs inline regexp
import re
from time import sleep
for re_s in test_all_regexps:
    re_c = re.compile(re_s)
    %timeit ':'.join(re_c.findall(mac))

The slowest run took 5.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.71 µs per loop
1000000 loops, best of 3: 1.81 µs per loop


In [8]:
# Or find other methods:

# complex...
from re import sub as sed
%timeit sed(r'(..)', r'\1:', mac)

The slowest run took 23.20 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.4 µs per loop


In [9]:
# ...or simple
%timeit ':'.join([mac[i:i+2] for i in range(0,12,2)])
#Outside iPython check the timeit module

# Execise: which is the fastest method? Why?

The slowest run took 13.83 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.24 µs per loop


## Example: generating vsan configuration snippets

In [18]:
# Don't need to type this VSAN configuration script
#  which uses linux FC information from /sys filesystem
from glob import glob
fc_id_path = "/sys/class/fc_host/host*/port_name"
for x in glob(fc_id_path):
    # ...we boldly skip an explicit close()
    pwwn = open(x).read()  # 0x500143802427e66c
    pwwn = pwwn[2:]
    # ...and even use the slower but readable
    pwwn = re.findall(r'..', pwwn)
    print("member pwwn ", ':'.join(pwwn))

# Parsing: Exercise II

Now another test for the delivered messages
    - %edit 03_parsing_test.py
    - %paste to iPython and run test_delivered()
    - fix parse_line to work with both tests and save


In [None]:
#
# Use this cell for Exercise II
#
test_delivered()