# PyBay 2018 -- Parsing Clinic

Copyright(c) 2018 Raymond Hettinger
All Rights Reserved

Description
------------

Know some Python basics and want to learn more about how to Parse Data? Please register separately for this workshop with Raymond Hettinger, seasoned Python Trainer and core Python developer for 17+ years.

Abstract
----------

* Learn to harvest data in many forms
* Load a variety of CSV dialects. Learn to use the Sniffer and how to handle common problems
* Load and generate JSON data. Learn to pretty print or minify JSON.
* Learn to handle binary data inside JSON using Base64 or Latin-1 encodings.
* Use ElementTree and LXML to parse XML files. Learn to handle namespaces and how to use XPATH selectors.
* Parse HTML using BeautifulSoup.
* Hand edit and read YAML files. Learn the advantages and disadvantages of this format.
* Parse binary data (such as IP packet headers) using the struct module.
* Parse column oriented text using both splitting and slicing strategies.
* Discuss security risks for pickles, XML, YAML, and even plain text.
* Time permitting, introduce Pandas for data loading and clean-up
* Time permitting, discuss strategies for building regexes to parse complex data layouts
* Time permitting, show regex strategies for parsing natural text and introduce NLTK
* Who should attend?
* This is a workshop for beginners to intermediate level. You should know some Python basics already.


Materials link:  http://bit.ly/pybay2018-core

## Column-oriented text

It has straight lines vertically because humans benefit from straight-lines.

So, we know this data is for humans.

**Key Learning Point:** Parsing column-oriented text intended for humans is not fast, not easy, not fun, and not reliable.

* Some data may be missing.  This confuses str.split() and unpacking
* Some data may contain spaces.    This confuses str.split() and unpacking
* ANSI.sys color codes
* Page breaks
* Line wrapping
* Page numbers
* Subtotals
* Column alignments shifts between runs

Strategies:

1. Figure-out how a human would correctly mentally parse the data and teach the computer to do that.

2. Anti-Nike rule:  Just don't do it.  This data was meant to be parsed by a computer. Instead, get better data:  XML, YAML, CSV, JSON, PyProtocols, Binary structs, ...

In [46]:
with open('data/ipv4_int_bri.txt') as f:
    it = iter(f)
    header = next(it)
    interface_start = header.index('Interface')
    ipaddr_start = header.index('IP-Address')
    status_start = header.index('Status')
    protocol_start = header.index('Protocol')
    for line in it:
        # line = line.rstrip()
        # interface, ipaddr, status, protocol = line.split()
        interface = line[interface_start : ipaddr_start].rstrip()
        ipaddr = line[ipaddr_start : status_start].rstrip()
        status = line[status_start : protocol_start].rstrip()
        protocol = line[protocol_start :].rstrip()
        if status.lower() == 'up':
            print('%-15s %s' % (ipaddr, interface))

51.51.51.51     Loopback0
1.20.30.40      MgmtEth0/RSP0/CPU0/0
unassigned      MgmtEth0/RSP1/CPU0/0
unassigned      GigabitEthernet0/3/0/1
unassigned      GigabitEthernet0/3/0/3
unassigned      TenGigE0/3/1/0
unassigned      TenGigE0/3/1/3
unassigned      TenGigE0/4/0/0
111.1.1.1       TenGigE0/5/1/1
unassigned      GigabitEthernet 0/7/0/15


In [18]:
# How strip works        (it many languages it is called trim() -- Perl calls it chomp() )
s = '   Hello  \t World  \t  \n'
t = s.strip()
t

'Hello  \t World'

In [12]:
# How splitlines() work
s = '''The Tale of Two Cities
It was the best of times
it was the worst of times
the summer of hope
and winter of despair
'''

s.splitlines(True)

['The Tale of Two Cities\n',
 'It was the best of times\n',
 'it was the worst of times\n',
 'the summer of hope\n',
 'and winter of despair\n']

In [22]:
x = 10
x = x + 1
x

11

In [41]:
# Technique of passing around partially consumed iterators
names = ['manny', 'mo', 'jack']
it = iter(names)
print('First person:', next(it))
for name in it:
    print(name.upper())

First person: manny
MO
JACK


In [43]:
# How to locate text
s = 'The tale of two cities'
s.index('of')

9

In [44]:
s[9:11]

'of'

# CSV

In [15]:
with open('data/raisin_team.csv') as f:
    for line in f:
        line = line.rstrip()
        lastname, firstname, title, email, phone = line.split(',')
        print(firstname, lastname)

Raymond Hettinger
Mary Thomas
Harold Davis
Martin Masterson
David Jones
Luis Zapata
Fritz Gunter
Esmerela Pichon
Marilyn Blain
Blair Marks
David Jones
Harold Davis
Gertrude Schmidt


In [17]:
with open('data/raisin_team_update.csv') as f:
    for line in f:
        line = line.rstrip()
        lastname, firstname, title, email, phone = line.split(',')
        lastname = lastname.strip('"')
        print(firstname, lastname)

"Raymond" Hettinger
"Mary" Thomas
"Harold" Davis
"Martin" Masterson
"David" Jones
"Luis" Zapata
"Fritz" Gunter
"Esmerela" Pichon
"Marilyn" Blain
"Blair" Marks
"David" Jones
"Harold" Davis
"Gertrude" Schmidt


In [18]:
import csv

with open('data/raisin_team_update.csv') as f:
    for lastname, firstname, title, email, phone in csv.reader(f):
        print(firstname, lastname)

Raymond Hettinger
Mary Thomas
Harold Davis
Martin Masterson
David Jones
Luis Zapata
Fritz Gunter
Esmerela Pichon
Marilyn Blain
Blair Marks
David Jones
Harold Davis
Gertrude Schmidt
