# Built-in Modules

Most of the functions in Python are from modules, instead of builtin (global) functions. That's how they keep language syntax simple and stay flexible to meet different needs. 

There are built-in modules in all Python versions and distributions. However, chances are some distribution (especially on embedded environments) may remove some modules to save space. In that case, you might have to install some missing modules manually.

In [1]:
# Global configs & imports
import gzip, csv, json, os, os.path, glob, multiprocessing, time, pickle, re, random
from pprint import pprint

PATH = {
    "data": "../data"
}

# Text Processing

Human can only read text-based files, and that's why text-based formats such as yaml, json, XML, emerged in recent years. They became common even as data exchange formats, which is a territory that binary formats dominated for compactness and efficiency.

## csv

CSV or Comma-Separated Values, is probably the most ancient format that can express data tables (structured). However, the lack of detailed standard (RFC4180 is quite rough) makes it less suitable for non-ASCII characters. 

`csv.DictReader` wraps functions of a CSV Reader. It assumes first line to be csv header, exposed as `.fieldnames` attribute, and returns the lines one bye one as `dict` when iterated.

In [2]:
filename = os.path.join(PATH['data'], 'small.csv')
with open(filename) as fp:
    reader = csv.DictReader(fp)
    lines = list(reader)

print("%d lines loaded" % len(lines))

10 lines loaded


However, loading entire file into memory at once (what above code does) may not be efficient, especially when dealing with large datasets. In this case, iterating over these lines can be a better idea. 

BTW, both `file` object and `csv.DictReader` supports iteration.

In [3]:
with open(filename) as fp:
    reader = csv.DictReader(fp)
    line_count = 0
    
    for item in reader:
        # process that item
        line_count += 1

print("%d lines loaded" % line_count)

10 lines loaded


## JSON

As JavaScript became popular this decade, so do JSON (JavaScript Object Notation) as data exchange format on the web. It's more human readable and compact compared to XML, and more expressive to YAML.

JSON syntax is stricter than JavaScript:

* Double quotes only
* Quotes required for Object keys
* No trailing comma allowed
* No comment

Sample syntax:

In [4]:
students_json = """[
    {
        "name": "Joe",
        "age": 3,
        "siblings": [
            "Tom"
        ]
    },
    {
        "name": "Tom",
        "age": 5,
        "siblings": [
            "Joe"
        ]
    }
]"""

You might have noticed that data in JSON syntax happends to be valid in Python! But in most cases, we treat them as strings, and convert to and from native data structures, just like other exchange formats. To do so, we have to `import json`.

In [5]:
pprint(json.loads(students_json))

[{'age': 3, 'name': 'Joe', 'siblings': ['Tom']},
 {'age': 5, 'name': 'Tom', 'siblings': ['Joe']}]


In [6]:
d = {
    "Q1": ["January", "February", "March"],
    "Q2": ["April", "May", "June"],
    "Q3": ["July", "August", "September"],
    "Q4": ["October", "November", "December"],
}

print(json.dumps(d))

{"Q1": ["January", "February", "March"], "Q2": ["April", "May", "June"], "Q4": ["October", "November", "December"], "Q3": ["July", "August", "September"]}


Pass `indent` to json.dumps for human-friendly outputs.

In [7]:
print(json.dumps(d, indent=4))

{
    "Q1": [
        "January",
        "February",
        "March"
    ],
    "Q2": [
        "April",
        "May",
        "June"
    ],
    "Q4": [
        "October",
        "November",
        "December"
    ],
    "Q3": [
        "July",
        "August",
        "September"
    ]
}


Converting them back to native data structures should yield the same results.

In [8]:
d == json.loads(json.dumps(d))

True

### Note

Since JSON supports two collection types (`Object` and `Array`, equivalent to Python `dict` and `list`), passing unsupported types (`set`, `object`, ...) may produce unexpected results.

## pickle

Unlike `csv` or `json`, `pickle` is unique to Python, and was used for serialization and unserialization instead of human-readable data transport. As a consequence, `pickle` returns byte strings, is able to handle most Python objects, and is guaranteed for backward compatibility.

In [9]:
print(pickle.dumps(d))

b'\x80\x03}q\x00(X\x02\x00\x00\x00Q1q\x01]q\x02(X\x07\x00\x00\x00Januaryq\x03X\x08\x00\x00\x00Februaryq\x04X\x05\x00\x00\x00Marchq\x05eX\x02\x00\x00\x00Q2q\x06]q\x07(X\x05\x00\x00\x00Aprilq\x08X\x03\x00\x00\x00Mayq\tX\x04\x00\x00\x00Juneq\neX\x02\x00\x00\x00Q4q\x0b]q\x0c(X\x07\x00\x00\x00Octoberq\rX\x08\x00\x00\x00Novemberq\x0eX\x08\x00\x00\x00Decemberq\x0feX\x02\x00\x00\x00Q3q\x10]q\x11(X\x04\x00\x00\x00Julyq\x12X\x06\x00\x00\x00Augustq\x13X\t\x00\x00\x00Septemberq\x14eu.'


In [10]:
print (pickle.loads(pickle.dumps(d)))

{'Q1': ['January', 'February', 'March'], 'Q2': ['April', 'May', 'June'], 'Q4': ['October', 'November', 'December'], 'Q3': ['July', 'August', 'September']}


In [11]:
print(pickle.loads(pickle.dumps(d)) == d)

True


`pickel` provides similar API compared to `json`:

* `load()` to load from file object
* `loads()` from byte string
* `dump()` to write to file object
* `dumps()` to a byte string

## Regex (re)

Regular Expression is like a chain saw, powerful for a certain scale as long as you can handle it. RegEx itself is so complex that there are books on it. With limited time here, we'll only talk about basics of `re`, and how you should be using it.

There are several functions exposed from module `re`, but normally we just call `re.compile()` to turn patterns into compiled regular expressions. That because the most common usecase is to create a few regex patterns and search for them from lots of strings. This pattern can save significant computation in practice.

Matching regex itself is also much slower than simple string comparisons. Don't use regex on these cases, such as `a == b`, `a in b`, or a.startswith(b).

### Pattern Syntax

In some languages, backslash (\) is used to escape symbols in strings. Backslashes themselves have to be escaped (\\) to represent one backslash. Regular expression also uses backslash to escape symbols, so we need to type four backslashes to represent one (plain backslash) in regex pattern in those languages.

Unfortunately, Python is too smart and it tries to escape backslashes automatically if the following symbol doesn't have special meaning in strings. 

In [12]:
print(repr("\d"), repr("\\d"))
print(repr("\t"), repr("\\t"))

'\\d' '\\d'
'\t' '\\t'


That's why we prepend 'r' in front of pattern strings to tell Python those are **raw** strings, and they will be interpreted as-is. Then our regex would be easier to understand.

In [13]:
print(repr(r"\d"), repr(r"\\d"))
print(repr(r"\t"), repr(r"\\t"))

'\\d' '\\\\d'
'\\t' '\\\\t'


Like Python, there are special symbols in regex that represents new meanings if occuring escaped (after backslash) or unescaped. Some of them are:

* **(** ... **)** parentheses: 
    * Define logical units
    * Signify regex engine to pass this segment to output (retrieved via `groups()` call on `Match` object)
* **(?:**...**)**
    * Non-capturing group, works like ..., but it's not captured as ouptut
* **[** ... **]** square brackets:
    * Occurrence of any of character within
    * Different escape rules for characters within
    * Dash means a range of characters. Eg., a-z, A-Z, 0-9
* **.** dot: Matches to any character
* **\d**: Any number
* **\w**: Any alphabet (Ambiguous behavior for unicode characters)
* **\s**: Spaces (including tab, carriage return, line feed, spaces, ...)
* Occurrences
    * **?**: The character or logical unit before **?** must appear 0 or 1 time.
    * **\***: The character or logical unit before **\*** must appear arbitrary times (including 0).
    * **+**: The character or logical unit before **+** must appear 1 or more times.
    * **{** N, M **}**: The character or logical unit before **{** must appear between N and M times (inclusive).
    * **{** N **}**: The character or logical unit before **{** must appear exactly N times.
    * **{** N, **}**: The character or logical unit before **{** can appear N or more times.
* Position
    * **^**: Match to start of string
    * **$**: Match to end of string

## `Pattern` API

After `compile()`, we then match strings against our `Pattern` object with these functions:

* `.findall()`: Searches for all non-overlapping matches, and return them as `list` of `strings`. Greedy matches apply.
* `.finditer()`: Like `findall()`, but returns a generator.
* `.search()`: Search for first occurrence after pos (defualt=0), and return it as Match object.
* `.match()`: Like `search()`, but it matches from beginning, as if there's an "^" at the beginning in pattern.

## Examples

In [14]:
ip = "192.168.1.1"
ptrn_ip = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

matcher = re.compile(ptrn_ip)
result = matcher.search(ip)

print(result.group())
print(result.groups())

192.168.1.1
()


In [15]:
ptrn_ip_strict = r"^%s$" % ptrn_ip
ip_sentence = "My ip addr is %s." % ip
print(re.search(ptrn_ip, ip_sentence).group())
print(re.search(ptrn_ip_strict, ip_sentence))

192.168.1.1
None


`re.search` returns Match object when pattern is found in string, and `None` otherwise.

In [16]:
print(re.search(r"\d+", ""))

None


Regardless of the directive you use in the pattern, regex groups are always strings. 

In [17]:
result = re.search(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})", ip)
print(result.group())
print(result.groups())

192.168.1.1
('192', '168', '1', '1')


In [18]:
ptrn = re.compile(r"((?:m|i|r|g|t|c|cc|cg|hi|hs)\d)" \
                  "\.(micro|small|medium|(?:\d*x)?large)")

def t(s):
    ret = ptrn.search(s)
    print(ret.groups() if ret else "None")

In [19]:
# Positive test cases

t(r"Launched m2.small in us-east-1")
t(r"t2.xlarge are twice as powerful than t2.large")
t(r"c4.large")
t(r"Rumor has it that t8.64xlarge is coming next Q3")

('m2', 'small')
('t2', 'xlarge')
('c4', 'large')
('t8', '64xlarge')


In [20]:
# Negative test cases

t(r"k2.small")
t(r"g2.XLARGE")

None
None


# Filesystem

## Builtin `file` object

Builtin `file` objects can be created with `file` function calls. There is a cursor pointing to current offset, and all operations (read, write, iterate, ...) will work based on this cursor.

There are two functions to manipulate with the cursor:

* `tell()`: Return offset in bytes
* `seek()`: There are 3 modes for seek (POSIX), aka Whence:
    * SEEK_SET: from start of that file
    * SEEK_CUR: relative to current offset
    * SEEK_END: from end of file

`whence` is defined as constants in most implementations. Check the manual before using them. In Python, `file.seek(offset, whence=0)` recognizes 0 as SEEK_SET, 1 SEEK_CUR, and 2 a SEEK_END. To avoid confusion, offset must be zero when whence in (1,2). Call `fseek(0, 2)` will move your cursor to the end of the file, which is useful when appending to a file opened with mode `"r+"`.

As a common practice, functions that take `file` objects as inputs should call `file.tell()` to save cursor offset, and restore via `file.seek(offset)` before retuning to avoid side-effects, unless that's the function's primary effect.

## os.path

`os.path` is a standalone module that you have to `import` explicitly. It wraps most filesystem functions, such as:

* `dirname()`: Return dirname to a path
* `abspath()`: Returns absolute path to a relative or fragmented one.
* `realpath()`: like abspath, but it resolves symbolic links.
* `join()`: Joins fragments into a str representing legit path. Takes care of platform-specific delimeters.
* `getatime()`, `getctime()`, `getmtime()`
* `getsize()`: Gets file size, relies on low level call wrapped as `os.stat`.
* `isfile()`, `isdir()`, `islink()`

## os.walk

Searches for directories and files recursively from a root directory. Return value from `os.walk` is a generator.

In [21]:
scanpath = os.path.join(".")
tmp = os.walk(scanpath)
print(tmp)

<generator object walk at 0x0000000004D950D8>


In [22]:
pprint(list(tmp))

[('.',
  ['.ipynb_checkpoints'],
  ['00.Common.ipynb',
   '01.Python Basics.ipynb',
   '01q.ipynb',
   '02.Builtin Modules.ipynb',
   '03.More on Python.ipynb',
   '04.3rd-Party Modules.ipynb']),
 ('.\\.ipynb_checkpoints',
  [],
  ['00.Common-checkpoint.ipynb',
   '01.Python Basics-checkpoint.ipynb',
   '01q-checkpoint.ipynb',
   '02.Builtin Modules-checkpoint.ipynb',
   '04.3rd-Party Modules-checkpoint.ipynb'])]


## glob

Module `glob` handles filename globbing, and is handy listing files under a directory. There are two functions in `glob`:

* `glob.glob`: Returns filenames as a list
* `glob.iglob`: Returns a generator that iterates over filenames

When dealing with folders containing a lot of files, `iglob` can be more efficient.

In [23]:
print(glob.glob("%s/*" % scanpath))

['.\\00.Common.ipynb', '.\\01.Python Basics.ipynb', '.\\01q.ipynb', '.\\02.Builtin Modules.ipynb', '.\\03.More on Python.ipynb', '.\\04.3rd-Party Modules.ipynb']


In [24]:
print(list(glob.iglob("%s/*" % scanpath)))

['.\\00.Common.ipynb', '.\\01.Python Basics.ipynb', '.\\01q.ipynb', '.\\02.Builtin Modules.ipynb', '.\\03.More on Python.ipynb', '.\\04.3rd-Party Modules.ipynb']


In [25]:
print(glob.glob("%s/../*/*.ipynb" % scanpath))

['./..\\Common\\git.ipynb', './..\\PH 201508\\00.Common.ipynb', './..\\PH 201508\\01.Python Basics.ipynb', './..\\PH 201508\\01q.ipynb', './..\\PH 201508\\02.Builtin Modules.ipynb', './..\\PH 201508\\03.More on Python.ipynb', './..\\PH 201508\\04.3rd-Party Modules.ipynb']


# os

Wraps lots of POSIX functions. For example, `nice()`, `open()` (Low-level one that returns file descriptor numbers instead of `file` object), `getpid()`, `execl()` (use 3rd-party library `subprocess` instead), `chdir()`, `setuid()`, `umask()`, `urandom()`.

Most of them are low-level functions, and you should consider high-level wrappers instead (unless you know what your're doing).

# random

Hosts functions about random, including:

* `randint(a, b)`: Returns a random number in range [a, b] (inclusive on both ends). Supports long (bigint).
* `random()`: Returns a float in interval [0, 1)
* `choice(a)`: Returns an element from list a
* `randrange(start, stop, step)`: Fixes randint() that both ends are inclusive. 
* `sample(l, n)`: Pick n distinct elements from l. Duplicates are allowed in l, and they were treated as different entries.
* `shuffle(l)`: Shuffles entries in l:list in place.

In [26]:
print("Roll a dice and you get %d" % random.randint(1,6))

Roll a dice and you get 2


# Web Requests: urllib, urllib2, urllib3

These are intermediate libraries for http[s] with more new features. Old ones were still kept for compatibility reasons. urllib2 takes `Request` objects in its action API's, while urllib3 supports connetion pool.

Most of the time, you should be dealing with high-level HTTP wrppers, like Requests (we'll talk about a few days later).

# Compression

## gzip

The API for gzip is very straightforward because there's no folder structure in gzip. As a consequence, gzip files can be manipulated with `gzip.open()`, and it behaves almost like builtin `file` object.

There can be some compatibility issues between Python2/3 with gzip. By default, gzip opens files in binary mode, and data read from those files is `str` in Python2 and `byte` in Python3. Supplying `mode` as text (t) when calling `gzip.open()` can solve the compatibility issue.

In [27]:
with gzip.open("../data/large-one-column.csv.gz", 'rt') as fp:
    print("%d lines loaded" % len(list(fp)))

1050001 lines loaded


## zipfile

Unlike `gzip`, `zip` handles file structure internally and requires another tier of abstraction to handle it. 

# Multiprocessing

Due to physical limits, it's harder to raise clock frequency beyond 4GHz. On the contrary, Moore's Law continues. And now we have 8+ cores on mobile devices, and soon there can be 64+ cores on desktops. The initial approach in most languages was threads, but sharing memory addresses can cause a lot of trouble. Introducing GIL (Global Intepreter Lock) in Python does solve this issue, but it also makes threading less favorable for only one thread in a process can enter run state at a given time. 

That's why `multiprocessing` came to the rescue in Python 2.6. It mimics the API from `muiltithreading` to make migration easier. Processes can be initiated from `multiprocessing` and then distributed across CPU cores. It also includes helper classes to handle IPC (Inter-process communication) and data synchronization transparently behind the scene. 

**Note**: Multiprocessing doesn't work well in IPython notebook

## Process

Python will launch a new `Process` and attach entry point (a function) to that thread. Since the function will be imported from that thread, it's always a good practice to protect modular statements with `if __name__=="__main": blah...` block. The main process can then call `Process.start()` to start the child process, and `Process.join()` to wait until child process ends.

In [28]:
# start_mp.py

from multiprocessing import Process

def f(name):
    print ('hello %s' % name)

if __name__ == '__main__' and False:
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

## Pool

`multiprocessing.Pool` offers a simple interface that mimics built-in function `map()`. With `Pool.map()`, we can easily parallelize functions as long as it ...
1. takes one argument 
2. can be imported from modular space 
3. without side effect when imported
4. arguments can be pickle'd

In [29]:
p = multiprocessing.Pool(2)
inputs = list(range(30000))

def square_it(x):
    return x ** 2

In [30]:
# Convential, single thread approach
results = map(p, inputs)

In [31]:
# Multiprocess, Twice as fast
if False:
    results = p.map(square_it, inputs)

To access functions that require more than one argument, we can define another wrapper that takes one parameter that unpacks the parameter and call the desired function, and pack the variables before passing them to `Pool.map()`.

## IPC Helpers

There are some classes in `multiprocessing` that helps us to handle IPC (inter-process communication). They handle locks behind the scene to prevent racing conditions.

### Queue

Exposing similar API to `Queue.Queue`, `multiprocessing.Queue` handles data exchange between processes. Usually we create `Queue` from main process, and pass the instance to two groups of processes, namely producers and consumers. Producers `Queue.put()` messages to the `Queue`, while consumers `Queue.get()` from it.

`Queue.put()` and `Queue.get()` operations can be blocking or non-blocking. If `Queue.get()` timeouts, it will raise `Queue.Empty` (Exception defined in module `Queue`, not `multiprocessing`).

### Pipe

Unlike `Queue`, `Pipe`s are bi-directional. That makes it more like a low-level communication protocol, and the developer should define his/her own processing logic on top of `Pipe`.

## Gotcha

### Error tracking

Programs using multiprocessing are harder to trace, for Exceptions helpful to debug were raised from the child processes. The main process that Python were attached to will only know the failure, not the raised Exceptions.

Reverting from `Pool.map()` to builtin `map()` can help, for their API are identical. And those with `Process`, the best approach would be writing logs with `logging` that we'll cover later.

### Dangling processes

There are chances that child processes may hang, leaving the entire program unresponsive. The most common causes are ...

1. Deadlocks in your application logic
2. Hanging requests without proper timeouts
3. Infinite loop in child process (eg., inappropriate retries)

# Logging

Like log4j for Java, `logging` is the standard logging facility for logging in Python. You can create and configure the default (top-level) logger for your process, and sub-loggers as well for fine-grained control. Each logger can have its own Filters and Handlers, and the Handlers then format the log message and output desired destination.

By default, logging outputs via `StreamingHandler` to stdout, but the output pipe is configurable.

Common **output handlers** are:

* `StreamingHandler`: Output to streams such as stdout, stderr, or str (via StringIO). It can support file streams, but we have dedicated Handlers for files.
* `FileHandler`: Output to file.
* `WatchedFileHandler`: Reopens destination if filename is changed elsewhere. Useful working with logrotate. Don't use it on Windows.
* `RotatingFileHandler`, `TimedRotatingFileHandler`: Rotates and removes old logs, in case you don't want logrotate.
* `SysLogHandler`, `NTEventLogHandler`: Self-explanatory.

Builtin levels are:

* **CRITICAL** 50
* **ERROR** 40
* **WARNING** 30
* **INFO** 20
* **DEBUG** 10
* **NOTSET** 0

In [32]:
import logging

def test_logger(logger):
    logger.debug("Some debug message")
    logger.info("Some information")
    logger.warn("Some warning message")
    logger.error("Houston, we have a problem")

In [33]:
logging.basicConfig()
test_logger(logging.getLogger("Demo-02"))

ERROR:Demo-02:Houston, we have a problem


In [34]:
logger = logging.getLogger("Demo-02")
logger.setLevel(logging.DEBUG)
test_logger(logger)

DEBUG:Demo-02:Some debug message
INFO:Demo-02:Some information
ERROR:Demo-02:Houston, we have a problem
