# Programming Tools I

In this lesson and the next, we look at a variety of standard library modules that generically aid in programming tasks.  Of course, in some way, all modules should do that.  But the three modules we look at in this lesson all deal with structuring data in generic ways that are not restricted to any particular domain or task.

We will structure this lesson from narrowest to broadest, in a sense.  That is, the modules `enum` and `dataclasses` each contain a single basic data structure.  The `collections` module contains an assortment of different data structures.  These are separate for historical reasons, but you can easily think of enums and dataclasses as kinds of collections.

Let us load everything we will use at the start.

In [1]:
from enum import Enum, auto
from dataclasses import *
from collections import *

from IPython.display import HTML
from random import randrange
from pprint import pprint

## Module: enum

Enumerations are available in a variety of programming languages.  The idea of collecting together distinct values, used for a common purpose, is widely useful.  The `enum` module has been part of Python for several versions, but is recent compared to some others.  Prior to that, typically global "constants" were used.

In [2]:
# Define color constants
RED, GREEN, BLUE = 1, 2, 3

def show_bar(value=1, color=None):
    colors = {RED:'red', BLUE:'blue', GREEN:'green'}
    _color = colors.get(color, 'black')
    s = ("❓" if _color=="black" else '◉') * value
    return HTML(f"<font color={_color}>{s}</font>")

In [3]:
show_bar(12, GREEN)

In [4]:
show_bar(7, RED)

In [5]:
CHARTREUSE, PURPLE = 4, 5
show_bar(15, color=PURPLE)

In [6]:
show_bar(color=BLUE)

Generally his works fine, and you will probably come across existing code that looks roughly like the above.  However, we are missing a couple things:

* We have no way to determine the list of all colors
* The domain or purpose of constants is not unified

In [7]:
# Constants for a different domain
HEIGHT, WIDTH, DEPTH = 1, 2, 3
show_bar(22, DEPTH)

We might reformulate this function slightly to use a more self-documenting and inspectible enumeration.

In [8]:
Colors = Enum("Color", "RED GREEN BLUE")

def show_bar(value=1, color=None):
    colors = {Colors.RED:'red', Colors.BLUE:'blue', Colors.GREEN:'green'}
    _color = colors.get(color, 'black')
    s = ("❓" if _color=="black" else '◉') * value
    return HTML(f"<font color={_color}>{s}</font>")

In [9]:
list(Colors)

[<Color.RED: 1>, <Color.GREEN: 2>, <Color.BLUE: 3>]

In [10]:
for color in Colors:
    print(color)

Color.RED
Color.GREEN
Color.BLUE


In [11]:
print(Colors.GREEN)
print(Colors.GREEN.value)
Colors.GREEN.value == 2

Color.GREEN
2


True

In [12]:
show_bar(15, Colors.GREEN)

In [13]:
print(f"{GREEN=}")
show_bar(12, GREEN)

GREEN=2


In [14]:
Dims = Enum("Dimension", "HEIGHT WIDTH DEPTH")
print(list(Dims))
show_bar(20, Dims.WIDTH)

[<Dimension.HEIGHT: 1>, <Dimension.WIDTH: 2>, <Dimension.DEPTH: 3>]


We can define an enumeration more verbosely to get some extra capabilities.

In [15]:
class Colors(Enum):
    RED = 1
    GREEN = 2
    BLUE = auto() # Pick value for me
    SCARLET = RED # A synonym for RED
    VERDANT = 2   # A synoynm for GREEN
    
    def __str__(self):
        return f"The color {self.name} has value {self.value}"
    
    @classmethod
    def favorite(cls):
        from random import choice
        return choice(list(cls)).name

In [16]:
print(Colors.BLUE)

The color BLUE has value 3


In [17]:
print("My favorite is", Colors.favorite())
print("...wait, no it's", Colors.favorite())

My favorite is BLUE
...wait, no it's GREEN


In [18]:
print(Colors.SCARLET)

The color RED has value 1


If you wish to do something with the values of enumeration elements beyond distinguishing them, they can have arbitrary values and work the same.

In [19]:
class Colors(Enum):
    RED = "Stop!"
    GREEN = "Go"
    YELLOW = "Yield"

In [20]:
list(Colors)

[<Colors.RED: 'Stop!'>, <Colors.GREEN: 'Go'>, <Colors.YELLOW: 'Yield'>]

In [21]:
class Colors(Enum):
    RED = "#AA2222"
    GREEN = "#11DD33"
    BLUE = "#2222AA"

def show_bar(value, color):
    return HTML(f"<font color={color.value}>{'◉'*value}</font>")

In [22]:
show_bar(24, Colors.BLUE)

In [23]:
show_bar(18, Colors.RED)

## Module: dataclasses

A Data Class collects together related information.  It somewhat resembles a dictionary in having fields and values; but it is limited to fixed fields that cannot be removed or added to.  The values can be changed.  

A Data Class also resembles a namedtuple that we will look at in `collections` module.  However, a Data Class is mutable, and attribute values can be modified with a program.

In [24]:
Printer = make_dataclass("Printer", ["Height", "Width", "Depth", "Weight"])

In [25]:
p1 = Printer(Width=460, Depth=493, Height=460, Weight=29)
p2 = Printer(Width=464, Depth=385, Height=145, Weight=5.1)
print(p1, p2, sep='\n')

Printer(Height=460, Width=460, Depth=493, Weight=29)
Printer(Height=145, Width=464, Depth=385, Weight=5.1)


In [26]:
# Comparisons
print("p1 and p2 have same attributes?", p1 == p2)

p3 = Printer(Width=464, Depth=385, Height=145, Weight=7.2)
print(p3)
p3.Weight = 5.1  # Update information
print(p3)

print("p2 and p3 have same attributes?", p2 == p3)

p1 and p2 have same attributes? False
Printer(Height=145, Width=464, Depth=385, Weight=7.2)
Printer(Height=145, Width=464, Depth=385, Weight=5.1)
p2 and p3 have same attributes? True


In [27]:
# Dimension comparisons are not directly meaningful.
try:
    print("Is p1 'more than' p2?", p1 > p2)
except Exception as err:
    print(err)

'>' not supported between instances of 'Printer' and 'Printer'


In [28]:
# Individual attributes might be comparable    
print("Is p1 is heavier than p2?", p1.Weight > p2.Weight)

Is p1 is heavier than p2? True


However, different Data Classes that happen to have the same attributes remain different.

In [29]:
Breadbox = make_dataclass("Breadbox", ["Height", "Width", "Depth", "Weight"])
bb = Breadbox(Width=464, Depth=385, Height=145, Weight=5.1)
bb

Breadbox(Height=145, Width=464, Depth=385, Weight=5.1)

In [30]:
print("Are bb and p2 equivalent?", bb == p2)

Are bb and p2 equivalent? False


The `dataclasses` module contains functions to manipulate Data Classes.  Notice that these are functions, not methods on the actual object.

In [31]:
# We can simplify a dataclass to a dictionary
print(asdict(bb))

{'Height': 145, 'Width': 464, 'Depth': 385, 'Weight': 5.1}


In [32]:
# Or as a tuple.  Order is as defined, not as initialized
print(astuple(bb))

(145, 464, 385, 5.1)


In [33]:
# Disregarding type, the fields/values are the same
asdict(bb) == asdict(p2)

True

An interesting thing about Data Classes is that they are dynamically created types.  There is no "dataclass" type that all Data Classes belong to.  The special function `is_dataclass()` will answer whether a class or instance is a Data Class.

In [34]:
type(bb), is_dataclass(bb), is_dataclass(Breadbox)

(types.Breadbox, True, True)

### Class definitions

As with enumerations, you can customize Data Classes by writing class definitions rather than using the class factory `make_dataclass`.  For technical reasons, this needs to be done with a decorator rather than by inheritance, in this case.

In [35]:
@dataclass
class Kitty:
    Width: int      # Integer length in mm
    Depth: int
    Weight: float   # Float weight in kg
    Height: int = 460  
    
    @staticmethod
    def density(obj):
        return obj.Weight / (obj.Height * obj.Width * obj.Depth)
    
    def denser_than(self, other):
        if not isinstance(other, (Kitty, Breadbox)):
            raise ValueError("Kitties are incommensurable with "
                             f"{other.__class__.__name__}s")
        # Simplifying assumption: cats are rectangular cylinders
        return self.density(self) > self.density(other)

In [36]:
astrophe = Kitty(Width=180, Depth=280, Weight=6.8)
kachina = Kitty(Width=185, Depth=260, Height=450, Weight=6.3)
astrophe

Kitty(Width=180, Depth=280, Weight=6.8, Height=460)

In [37]:
print("Astrophe denser than Kachina?", astrophe.denser_than(kachina))
print(f"Astrophe density kg/mm^3: {Kitty.density(astrophe):.2e}")
print(f"Kachina density kg/mm^3:  {Kitty.density(kachina):.2e}")

Astrophe denser than Kachina? True
Astrophe density kg/mm^3: 2.93e-07
Kachina density kg/mm^3:  2.91e-07


Comparing to other things depends on the method knowing how to handle it.

In [38]:
# Most things can be compared to a breadbox
print("Astrophe denser than a breadbox?", astrophe.denser_than(bb))

Astrophe denser than a breadbox? True


In [39]:
# Printers are not such a common reference
try:
    astrophe.denser_than(p1)
except Exception as err:
    print(err)

Kitties are incommensurable with Printers


## Module: collections

The module `collections` extends the built-in Python data structures of `tuple`, `list`, `set`, and `dict` in a number of useful ways.  The module `queue` also contains data structures that are thread safe, but the course on Concurrency discusses that.

This course will not discuss everything in the `colletions` module.  For example, `UserDict`, `UserList`, and `UserString` are largely historical, since it was formerly not possible to subclass directly from `dict`, `list`, and `string`.  `OrderedDict` is partially superceded by `dict` maintaining insertion order in current Python.  The class `defaultdict` remains useful at times, but is skipped for time constraints of this course.

### namedtuple

A `namedtuple` is a subclass of `tuple` that adds names to its elements.  In some ways it is akin to an enumeration, and in other ways to a Data Class.  However, this structure is older than either of those in Python development, and remains very useful. 

For an example, let us work with RGB colors in somewhat different way than we did with enumerations and Data Classes, but a plausible need.

In [40]:
RGB = namedtuple("Color", "red green blue")

salmon = RGB(250, 128, 114)
lavender = RGB(230, 230, 250)
seagreen = RGB(46, 139, 87)
indianred = RGB(205, 92, 92)

seagreen

Color(red=46, green=139, blue=87)

A variation on the `show_bar()` function will utilize color tuples.

In [41]:
def show_bar(value, color):
    cstr = "#%x%x%x" % color
    return HTML(f"<font color={cstr}>{'◉'*value}</font>")

In [42]:
show_bar(19, salmon)

In [43]:
show_bar(21, indianred)

In [44]:
# A plain tuple works also
show_bar(25, (220, 84, 168))

Attributes are more mnemonic than the raw positions.  Either style of access is equivalent.

In [45]:
salmon.green, salmon[1]

(128, 128)

In [46]:
# Which color contains more of a red component?
print("Is Indian Red more 'red' than is Salmon?")
indianred.red > salmon.red

Is Indian Red more 'red' than is Salmon?


False

### deque

A `deque` is a data structure that has mostly the same API as a `list` but in which it is computationally cheap to add or remove items from either end.  In a list, it is cheap to add or remove items from the end, but not from the beginning. Let us first quickly prove that `list` can be inefficient for the add/remove from beginning operation.

In [47]:
def remove_to_set(mylist, end="right"):
    myset = set()
    ndx = -1 if end=="right" else 0
    while mylist:
        myset.add(mylist.pop(ndx))
    return myset

We can time a couple operations for comparison.  Both have an overhead of the set operations,so this is not the most extreme comparison.

In [48]:
mylist = [randrange(1, 100) for _ in range(1_000_000)]
print(len(mylist), mylist[:10])
%time myset = remove_to_set(mylist, end="right")

1000000 [33, 59, 19, 52, 59, 30, 29, 91, 32, 27]
CPU times: user 89.2 ms, sys: 0 ns, total: 89.2 ms
Wall time: 89.5 ms


In [49]:
mylist = [randrange(1, 100) for _ in range(1_000_000)]
print(len(mylist), mylist[:10])
%time myset = remove_to_set(mylist, end="left")

1000000 [49, 73, 70, 71, 20, 37, 89, 3, 90, 63]
CPU times: user 3min 37s, sys: 55.3 ms, total: 3min 37s
Wall time: 3min 38s


The example is contrived, but there are real-world scenarios where access from both ends is useful.  A `deque` simply adds an `.appendleft()` and a `.popleft()` method to a list (plus different internal structure).  However, `deque` does **not** directly support slicing, which is an inconvenience versus a list.

In [50]:
mydeque = deque()
for _ in range(1_000_000):
    mydeque.appendleft(randrange(1, 100))
    
print(len(mydeque), [mydeque[i] for i in range(10)])

1000000 [75, 8, 59, 82, 50, 67, 76, 45, 45, 4]


In [51]:
def remove_to_set(mydeque, end="right"):
    myset = set()
    op = mydeque.pop if end=="right" else mydeque.popleft
    while mydeque:
        myset.add(op())
    return myset

In [52]:
dqcopy = mydeque.copy()
%time myset = remove_to_set(dqcopy, "right")

CPU times: user 156 ms, sys: 0 ns, total: 156 ms
Wall time: 154 ms


In [53]:
dqcopy = mydeque.copy()
%time myset = remove_to_set(dqcopy, "left")

CPU times: user 144 ms, sys: 0 ns, total: 144 ms
Wall time: 147 ms


There are a few more special operations of deques.  We can `.reverse()` them (but likewise with lists).  We can `.rotate()` them.  We can also limit their length, with extra items simply "falling off" the opposite end.

In [54]:
dq = deque(range(10))
dq

deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [55]:
dq.reverse()
dq

deque([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

A way of keeping track of only the "recent" events is to use a deque with limited size.  You can either `append()` or `appendleft()`—or whatever mixture suits your purpose—and the deque will never exceed the given size.  Items from the "far" end will simply be discarded.  Think of a stock trading algorithm that wants to remember "the last 20 trades", for example.

In [56]:
last5 = deque(maxlen=5)  # Only store a few items
for char in 'ABCDEFG':
    last5.appendleft(char)
    print(last5)

deque(['A'], maxlen=5)
deque(['B', 'A'], maxlen=5)
deque(['C', 'B', 'A'], maxlen=5)
deque(['D', 'C', 'B', 'A'], maxlen=5)
deque(['E', 'D', 'C', 'B', 'A'], maxlen=5)
deque(['F', 'E', 'D', 'C', 'B'], maxlen=5)
deque(['G', 'F', 'E', 'D', 'C'], maxlen=5)


### ChainMap

A `ChainMap` is a convenient way of combining multiple dictionaries or other mappings.  An advantage here is that it does not alter the component mappings; also it does not need to perform any copying, which can save time when using large dictionaries.

A common scenario to use `ChainMap` is when multiple configuration files are read, and they may have overlapping keys.  In `ChainMap`, the **first** matching key wins.

In [57]:
localcfg  = {'ipaddr': "192.168.1.103", 'port': 10_001}
groupcfg  = {'ipaddr': "192.168.2.100", 'protocol': 'FooBarProtocol'}
globalcfg = {'port': 10_002, 'protocol': 'BamService', 'servername': 'George'}
allcfg = ChainMap(localcfg, groupcfg, globalcfg)

In [58]:
pprint(allcfg)
print('-----')
for key in allcfg.keys():
    print(key, allcfg[key])

ChainMap({'ipaddr': '192.168.1.103', 'port': 10001},
         {'ipaddr': '192.168.2.100', 'protocol': 'FooBarProtocol'},
         {'port': 10002, 'protocol': 'BamService', 'servername': 'George'})
-----
port 10001
protocol FooBarProtocol
servername George
ipaddr 192.168.1.103


### Counter

A `Counter` is a special kind of dictionary that increments the count of occurrences of a key.  It is also called in other literature a *multi-set*.  Let us look at a few simple examples, then a somewhat larger one.

In [59]:
# For each "event" add it to the count
count = Counter()
count.update(['red', 'blue', 'red', 'green', 'blue', 'blue'])
count

Counter({'red': 2, 'blue': 3, 'green': 1})

We can iterate over each thin, repeated by occurrence count.

In [60]:
for word in count.elements():
    print(word, end=' ')

red red blue blue blue green 

Or update manually how many times something should be assigned to occur.

In [61]:
count['red'] += 5
count['blue'] = 2
count

Counter({'red': 7, 'blue': 2, 'green': 1})

A handy feature is that everything that has not be added has a count of zero.

In [62]:
print("Red things:", count['red'])
print("Chartreuse:", count['chartreuse'])

Red things: 7
Chartreuse: 0


We can remove as well.  Using the `-=` augmented assignment works like `+=`, but more commonly, you might "unobserve" some events.  Watch out for possible negative counts; they are useful for some purposes, and easy to exclude when not wanted.  The `.elements()` method only includes positive counts.

In [63]:
count.subtract(['red', 'blue', 'red', 'chartreuse'])
count

Counter({'red': 5, 'blue': 1, 'green': 1, 'chartreuse': -1})

In [64]:
count.update(['chartreuse', 'magenta'])
count

Counter({'red': 5, 'blue': 1, 'green': 1, 'chartreuse': 0, 'magenta': 1})

Let us look at a larger data source.  The only book/story with the word Python in the title on Project Gutenberg seems to be a 1962 science fiction story by Frederik Pohl, _Plague of Pythons_.  We can use that for its words.

In [65]:
import re
pythons = open('pg51804.txt').read()
pat = r"""[!"#$%&'()*,-./:;?_]"""
pythons = re.sub(pat, '', pythons.lower())
print(f"Words in story: {len(pythons.split()):,}")

Words in story: 42,383


In [66]:
words = Counter()
words.update(pythons.split())

This text is pretty typical of most English texts in its 10 most common words used.

In [67]:
words.most_common(10)

[('the', 2549),
 ('of', 1087),
 ('and', 1054),
 ('he', 1029),
 ('to', 1028),
 ('a', 1023),
 ('was', 920),
 ('it', 686),
 ('in', 605),
 ('had', 492)]

In [68]:
len(words)

6074

We can get the least common counts by calling `.most_common()` with no limit, and looking at the tail of the resulting list.

In [69]:
words.most_common()[-10:]

[('network', 1),
 ('volunteer', 1),
 ('necessarily', 1),
 ('edition', 1),
 ('pg', 1),
 ('httpwwwgutenbergorg', 1),
 ('includes', 1),
 ('produce', 1),
 ('subscribe', 1),
 ('newsletter', 1)]