# 01 - defaultdict

#### Lecture

The `defaultdict` is a specialized dictionary found in the `collections` module. (It is a subclass of the `dict` type).

In [8]:
from collections import defaultdict

Standard dictionaries in Python will raise an exception if we try to access a non-existent key:

In [1]:
d = {}
d['a']

KeyError: 'a'

We can get around this with `.get` method which returns `None` by default, or we can specify a default value:

In [2]:
d.get('a', 0)

0

We have already seen before that the pattern for using dictionaries for counting is the following:

In [6]:
counts = {}
sentence = "able was I ere I saw elba"
for c in sentence:
    counts[c] = counts.get(c, 0) + 1

print(counts)

{'a': 4, 'b': 2, 'l': 2, 'e': 4, ' ': 6, 'w': 2, 's': 2, 'I': 2, 'r': 1}


But the issue is we have to remember to implement this pattern for each key that needs it. It would be easier to define the default value **once** per dict.

`defaultdict` solves this issue. It has the following structure: `defaultdict(<callable>, <...>)` where 
- `<callable>` is called to calculate a default value and must have zero arguments.
- `<...>` are the arguments you would pass to the regular `dict()` function, e.g.  `defaultdict(<callable>, x=1, y=2)` will create a defaultdict with two keys, `x` and `y` with values 1 and 2.

It's worth noting
- **accessing** nonexist keys in a `defaultdict` will create those keys in the dictionary with the default value. This is in contrast to `.get` which only reads, never writes. This is true even when we call `.get` on a `defaultdict`.
- the callable is a factory function that is called every time it is **needed**, not when the `defaultdict` is created. So whenever you access a nonexistent key, that's when the callable is called.

In [15]:
d = defaultdict(lambda: 'python', x=1, y=2)
print(d)
d['z']
print(d)

defaultdict(<function <lambda> at 0x00000210B398F5B0>, {'x': 1, 'y': 2})
defaultdict(<function <lambda> at 0x00000210B398F5B0>, {'x': 1, 'y': 2, 'z': 'python'})


If we want to create an empty list as the default value, we could pass `list` as our callable since `list() => []`, but there are other **factory functions** (these are functions in addition to types):
- `int: int() => 0`
- `dict: dict() => {}`
- `set: set() => {}`
- `str: str() => ''`

Here is the better pattern for counting. Since `defaultdict` is a subclass of `dict`, we have access to all `dict` methods like `.items()`, etc.

In [14]:
counts = defaultdict(int)
sentence = "able was I ere I saw elba"

for c in sentence:
    counts[c] += 1

print(counts)

defaultdict(<class 'int'>, {'a': 4, 'b': 2, 'l': 2, 'e': 4, ' ': 6, 'w': 2, 's': 2, 'I': 2, 'r': 1})


#### Coding

##### Example 1

Let's take a look at another example of where a `defaultdict` can be useful.

Suppose we have a dictionary structure that has people's names as keys, and a dictionary for the value that contains the person's eye color. We want to create a dictionary of eye colors, with a list of the people's names that have that eye color. For the data below, we would expect:
```python
{'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']}
```

In [17]:
persons = {
    'john': {'age': 20, 'eye_color': 'blue'},
    'jack': {'age': 25, 'eye_color': 'brown'},
    'jill': {'age': 22, 'eye_color': 'blue'},
    'eric': {'age': 35},
    'michael': {'age': 27}
}

Using regular dictionaries and `.get`, here's one approach:

In [21]:
eye_colours = {}

for person, details in persons.items():
    colour = details.get('eye_color', 'unknown')
    if colour not in eye_colours:
        eye_colours[colour] = [person]
    else:
        eye_colours[colour].append(person)

print(eye_colours)

{'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']}


Here's the improved approach:

In [22]:
eye_colours = defaultdict(list)
for person, details in persons.items():
    eye_colours[details.get('eye_color', 'unknown')].append(person)

print(eye_colours)

defaultdict(<class 'list'>, {'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']})


To simplify further, we can move the complexity of setting 'unknown' into the `persons` dictionary itself; if a person has no specified eye colour, return 'unknown'. One way we could do this is:

In [23]:
persons = {
    'john': defaultdict(lambda: 'unknown', age=20, eye_color='blue'),
    'jack': defaultdict(lambda: 'unknown', age=20, eye_color='brown'),
    'jill': defaultdict(lambda: 'unknown', age=22, eye_color='blue'),
    'eric': defaultdict(lambda: 'unknown', age=35),
    'michael': defaultdict(lambda: 'unknown', age=27)
}

Since each details dict has the same first argument, we can create a partial function which sets this argument implicitly. Our fully improved code becomes:

In [31]:
from functools import partial

details_dict = partial(defaultdict, lambda: 'unknown')  # this returns a version of defaultdict whose first argument is lambda: 'unknown'
persons = {
    'john': details_dict(age=20, eye_color='blue'),
    'jack': details_dict(age=20, eye_color='brown'),
    'jill': details_dict(age=22, eye_color='blue'),
    'eric': details_dict(age=35),
    'michael': details_dict(age=27)
}

eye_colours = defaultdict(list)
for person, details in persons.items():
    eye_colours[details['eye_color']].append(person)

print(eye_colours)

defaultdict(<class 'list'>, {'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']})


##### Example 2

So far, the factory method used in `defaultdict` has been deterministic, but since it's called whenever we hit a nonexistent key, it can be nondeterministic in that it could call to a database, API, etc. In this example, we'll show it being called on different datetimes.

In this example we want to keep track of how many times certain functions are being called, as well as when they were **first** called. To do this I want to be able to decorate the functions I want to keep track of, and I want to be able to specify the dictionary that should be used so I can keep a reference to it so I can examine the results.

In [53]:
from collections import defaultdict, namedtuple
from datetime import datetime
from functools import wraps

Stats = namedtuple('Stats', 'decorator data')
d = defaultdict(lambda: {"count": 0, "first_called": datetime.utcnow()})

def decorator(fn):
    @wraps(fn)
    def wrapper(*args, **kwargs):
        d[fn.__name__]['count'] += 1
        return fn(*args, **kwargs)
    return wrapper

stats = Stats(decorator=decorator, data=d)        

In [54]:
@stats.decorator
def func_1():
    pass

@stats.decorator
def func_2(x, y):
    pass

We expect `stats.data` to be empty because the functions haven't been called yet and `defaultdict` won't call the factory method until it has.

In [55]:
stats.data

defaultdict(<function __main__.<lambda>()>, {})

In [56]:
func_1()
func_1()
func_1()

In [57]:
stats.data

defaultdict(<function __main__.<lambda>()>,
            {'func_1': {'count': 3,
              'first_called': datetime.datetime(2024, 6, 10, 17, 6, 4, 971158)}})

Waiting some time before calling the line below...

In [58]:
func_2(10, 20)

In [59]:
stats.data

defaultdict(<function __main__.<lambda>()>,
            {'func_1': {'count': 3,
              'first_called': datetime.datetime(2024, 6, 10, 17, 6, 4, 971158)},
             'func_2': {'count': 1,
              'first_called': datetime.datetime(2024, 6, 10, 17, 6, 8, 634088)}})

# 02 - OrderedDict

Prior to Python 3.7, dictionary key order was not guaranteed. This became part of the language in 3.7, so the usefullness of this `OrderedDict` is diminished - but necessary if you want your dictionaries to maintain key order **and** be compatible with Python versions earlier then 3.6 (technically dicts are ordered in 3.6 as well, but it was considered an implementation detail, and not actually guaranteed).

With that being said, `OrderedDict` does have additional order-related functionality such as:
- reverse iteration: `reversed(d)` (note that you can now do this on regular dicts in Python 3.8)
- popping first or last item: `d.popitem(last=False)` or `d.popitem(last=True)`
- move item to end: `d.move_to_end('second_key')`
- move item to beginning: `d.move_to_end('second_key', last=False)`

Also equality comparison (`==`) of two `OrderedDict` is only `True` if the keys **and** the order is identical, unlike regular dictionaries where order doesn't matter.

Since all of this is trivial to demonstrate with code, I've left it out here. Defer back the original notebook if needed.

# 03 - OrderedDict vs Plain Dict

In this section, we can show how we can achieve the same functionality of `OrderedDict` but using a regular dictionary instead, constrained to **only performing mutations, not creating new dicts** . For example, 

- we can get the first key with `next(iter(d2.keys())`
- we can get the last key with `d2.popitem()`

Alternatively, we can convert the dict views into a list and get the first/last.

Since, this is unlikely to be useful in practice, I've left it out here too. Defer back the original notebook if needed.

# 04 - Counter

#### Lecture

We've already seen how we can use `defaultdict` to create a counter.

In [12]:
from collections import defaultdict

sentence = 'able was i ere i saw elba'
d = defaultdict(int)

for c in sentence:
    d[c] += 1

print(d)

defaultdict(<class 'int'>, {'a': 4, 'b': 2, 'l': 2, 'e': 4, ' ': 6, 'w': 2, 's': 2, 'i': 2, 'r': 1})


But there are certain counter operations that are tedious to do manually, for example:
- count the frequency of items in an iterable
- update one counter dictionary with another counter dictionary
- over multiple counter dictionaries, find the min/max counter value for each key that's present in all counter dictionaries.

`collections.Counter` allows us to do this.

`Counter` is a specialised dictionary with certain properties:
- it is (python's implementation of) a multiset which is just a set that can contain identical elements. The frequency of each element is the **multiplicity** and the **cardinality** of the multiset is the sum of the multiplicities.
- supports the same constructor options as regular dicts.
- acts like a `defaultdict` with a default value of 0.
- `update()` performs an in-place addition of counts as opposed to updating the value of a key.

#### Coding

##### Example 1: Frequency of items in an iterable

This one is identical to `defaultdict`:

In [3]:
from collections import Counter

sentence = 'able was i ere i saw elba'
counter = Counter()

for c in sentence:
    counter[c] += 1

print(counter)

Counter({' ': 6, 'a': 4, 'e': 4, 'b': 2, 'l': 2, 'w': 2, 's': 2, 'i': 2, 'r': 1})


But we can also do it this way:

In [16]:
counter = Counter('able was i ere i saw elba')
print(counter)

Counter({' ': 6, 'a': 4, 'e': 4, 'b': 2, 'l': 2, 'w': 2, 's': 2, 'i': 2, 'r': 1})


In [21]:
import random
counter = Counter([random.randint(0, 10) for _ in range(1_000)])
print(counter)

Counter({7: 114, 10: 103, 5: 100, 8: 94, 9: 90, 1: 89, 3: 88, 0: 85, 6: 84, 2: 77, 4: 76})


##### Example 2: n most common items

Let's find the `n` most common words (by frequency) in a paragraph of text. Words are considered delimited by white space or punctuation marks such as `.`, `,`, `!`, etc - basically anything except a character or a digit.
This is actually quite difficult to do, so we'll use a close enough approximation that will cover most cases just fine, using a regular expression:

In [22]:
import re
sentence = '''
On the real line, there are functions to compute uniform, normal (Gaussian), lognormal, negative exponential, gamma, 
and beta distributions. For generating distributions of angles, the von Mises distribution is available. 
Almost all module functions depend on the basic function random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). 
'''
words = re.split('\W', sentence)
counter = Counter(words)
print(counter)

Counter({'': 23, 'the': 4, '0': 3, 'functions': 2, 'distributions': 2, 'random': 2, 'On': 1, 'real': 1, 'line': 1, 'there': 1, 'are': 1, 'to': 1, 'compute': 1, 'uniform': 1, 'normal': 1, 'Gaussian': 1, 'lognormal': 1, 'negative': 1, 'exponential': 1, 'gamma': 1, 'and': 1, 'beta': 1, 'For': 1, 'generating': 1, 'of': 1, 'angles': 1, 'von': 1, 'Mises': 1, 'distribution': 1, 'is': 1, 'available': 1, 'Almost': 1, 'all': 1, 'module': 1, 'depend': 1, 'on': 1, 'basic': 1, 'function': 1, 'which': 1, 'generates': 1, 'a': 1, 'float': 1, 'uniformly': 1, 'in': 1, 'semi': 1, 'open': 1, 'range': 1, '1': 1})


In [23]:
counter.most_common(5)

[('', 23), ('the', 4), ('0', 3), ('functions', 2), ('distributions', 2)]

##### Example 3: Iterating through repeated elements

As we already know, we can't have identical keys in a dictionary so when we iterate through the keys view, we are guaranteed to get unique elements. 

But the same is not true for `Counter` - we *can* iterate through and see repeated elements:

In [41]:
counter = Counter('abbazyyyxa')
print(counter)

for c in counter.elements():
    print(c)

Counter({'a': 3, 'y': 3, 'b': 2, 'z': 1, 'x': 1})
a
a
a
b
b
z
y
y
y
x


##### Example 4: Aggregating Counters with `.update()`, `.subtract()`,

Remember that the values refer to frequencies. `Counter.update()` and `Counter.subtract()` will not replace one value with another like with regular dictionaries. Instead it will sum them and **mutate** the original dictionary.

In [4]:
c1 = Counter(a=1, b=2, c=3)
c2 = Counter(b=1, c=2, d=3)

c1.update(c2)
print(c1)

Counter({'c': 5, 'b': 3, 'd': 3, 'a': 1})


In [5]:
c1 = Counter(a=1, b=2, c=3)
c2 = Counter(b=1, c=2, d=3)

c1.subtract(c2)
print(c1)

Counter({'a': 1, 'b': 1, 'c': 1, 'd': -3})


We can also aggregate using the constructor:

In [7]:
c1 = Counter('aabbcc')
c1.update('abcdefgh')
print(c1)

Counter({'a': 3, 'b': 3, 'c': 3, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1})


##### Example 5: Aggregating Counters with mathematical operations

These `Counter` objects also support several other mathematical operations when both operands are `Counter` objects. In all these cases the result is a new `Counter` object.

* `+`: same as `update`, but returns a new `Counter` object instead of an in-place update.
* `-`: subtracts one counter from another, but discards zero and negative values
* `&`: keeps the **minimum** of the key values
* `|`: keeps the **maximum** of the key values

We can also aggregate with the operators which is **non-mutating**:

In [10]:
c1 = Counter('aabbcc')
c2 = Counter('abc')
c1 + c2

Counter({'a': 3, 'b': 3, 'c': 3})

In [11]:
c1 = Counter('aabbcc')
c2 = Counter('abc')
c1 - c2

Counter({'a': 1, 'b': 1, 'c': 1})

We can also get the minimum of the key values across dictionaries:

In [12]:
c1 = Counter(a=5, b=1)
c2 = Counter(a=1, b=10)
c1 & c2

Counter({'a': 1, 'b': 1})

We can get the maximum with |:

In [13]:
c1 = Counter(a=5, b=1)
c2 = Counter(a=1, b=10)
c1 | c2

Counter({'b': 10, 'a': 5})

The **unary** `+` can also be used to remove any non-positive count from the Counter:

In [15]:
c1 = Counter(a=10, b=-20, c=0)
+c1

Counter({'a': 10})

The **unary** `-` changes the sign of each counter, and removes any non-positive result:

In [16]:
c1 = Counter(a=10, b=-20, c=0)
-c1

Counter({'b': 20})

##### Example 6: Top selling widgets

Let's assume you are working for a company that produces different kinds of widgets.
You are asked to identify the top 3 best selling widgets.

You have two separate data sources - one data source can give you a history of all widget orders (widget name, quantity), while another data source can give you a history of widget refunds (widget name, quantity refunded).

From these two data sources, you need to determine the top selling widgets (taking refinds into account of course).

Let's simulate both of these lists:

In [18]:
import random
random.seed(0)

widgets = ['battery', 'charger', 'cable', 'case', 'keyboard', 'mouse']

orders = [(random.choice(widgets), random.randint(1, 5)) for _ in range(100)]
refunds = [(random.choice(widgets), random.randint(1, 3)) for _ in range(20)]

In [19]:
for order in orders[:3]:
    print(order)

('case', 4)
('battery', 3)
('keyboard', 4)


**Solution 1**

In [20]:
sold_counter = Counter()
refund_counter = Counter()

for order in orders:
    sold_counter[order[0]] += order[1]

for refund in refunds:
    refund_counter[refund[0]] += refund[1]

In [21]:
net_counter = sold_counter - refund_counter
net_counter.most_common(3)

[('keyboard', 58), ('battery', 54), ('mouse', 39)]

**Solution 2**

We know that `Counter` can take an iterable (remember iterators/generators are included) so:
```python
Counter(['case', 'case', 'case', 'case', 'battery', 'battery', 'battery', ...])
```
This gives us a single iterable (generator) of iterables (generators) which can be passed into `itertools.chain.from_iterable(<iterable>)`
```python
(repeat(*order) for order in orders) 
```

In [40]:
from itertools import chain, repeat

repeat_orders_iter = (repeat(*order) for order in orders) 
sold_counter = Counter(chain.from_iterable(repeat_orders_iter))

repeat_refunds_iter = (repeat(*refund) for refund in refunds) 
refund_counter = Counter(chain.from_iterable(repeat_orders_iter))

net_counter = sold_counter - refund_counter
print(net_counter.most_common(3))

[('keyboard', 65), ('battery', 61), ('mouse', 46)]


Alternatively, we could (eager) unpack the single iterable and use `itertools.chain(<iter_1>, <iter_2>, ...)`

In [41]:
from itertools import chain, repeat

repeat_orders_iter = (repeat(*order) for order in orders) 
sold_counter = Counter(chain(*repeat_orders_iter))

repeat_refunds_iter = (repeat(*refund) for refund in refunds) 
refund_counter = Counter(chain(*repeat_orders_iter))

net_counter = sold_counter - refund_counter
print(net_counter.most_common(3))

[('keyboard', 65), ('battery', 61), ('mouse', 46)]


# 05 - ChainMap

#### Lecture

`collections.ChainMap` is the dictionary equivalent to `itertools.chain`, but it is **not** an instance of a dictionary.

With `chain`, we could "concatenate" multiple iterables together but iterate through them as if it was one large iterable. The big iterable never duplicated any data - instead it would iterate through the underlying iterables.

`ChainMap` is the same - we can chain multiple dictionaries without duplicating any data. 

It also behaves like a **view**. If we start mutating the original underlying dictionaries during iteration, the `ChainMap` dictionary will reflect those changes. Alternatively, if we modify the `ChainMap`, this will modify the underlying dictionaries. In fact, it will only modify one specific dictionary called the **Child** (see below).

##### Comparison to regular dicts

- `d = {**d1, **d2, **d3}` duplicates data from `d1`, `d2` and `d3` so it uses twice as much storage.
- Unpacking also **merges** so if two dictionaries contain identical keys, **last "update" wins**. With `ChainMap`, **first key wins** (see below)
- There is **no guarantee of key order when iterating a `ChainMap`**

In [56]:
from collections import ChainMap

d1 = {'a': 1, 'b': 2}
d2 = {'b': 20, 'c': 3}
d3 = {'c': 30, 'd': 4}

d = ChainMap(d1, d2, d3)
for k, v in d.items():
    print(k, v)

c 3
d 4
b 2
a 1


##### Parent-Child Relationships

In `ChainMap(d1, d2, d3)`, the first dict is called the **child** and all subsequent dictionaries are **parents**. Recalling that **first key wins**, `d1` overrides `d2` which overrides `d3`.

There are specific attributes to deal with these relationships:

- `d.parents`: A `ChainMap` containing the **parent** elements only; equivalent to `ChainMap(d2, d3)`
- `d.new_child(d0)`: Adds `d0` to the bottom of the relationship so it's the new child that overrides all others; equivalent to `ChainMap(d0, d1, d2, d3)`
- `d.maps` returns a **mutable** list representing the hierarchy. All iterable operations performed on this will apply to the ChainMap. This is the preferable way of mutating the relationships.
```
d = ChainMap(d1, d2, d3) --> ChainMap(d1, d2, d3)
d.maps.append(d4)        --> ChainMap(d1, d2, d3, d4)
d.maps.insert(0, d0)     --> ChainMap(d0, d1, d2, d3, d4) 
```


  


##### Mutating Maps via the ChainMap

We can mutate the key-value pairs in the underlying map itself. 
```python
d = ChainMap(d1, d2)
d[key] = value
```
**But**, these mutations only affect the underlying **child** map **only**.

In [58]:
d1 = {'a': 1, 'b': 2}
d2 = {'a': 20, 'c': 3}

d = ChainMap(d1, d2)

d['a'] = 100  # d1 has been mutated; d2 unaffected
print(d1)

d['c'] = 200 # d1 has been mutated EVEN THOUGH d2 contains a 'c' key; d2 unaffected.
print(d1)

{'a': 100, 'b': 2}
{'a': 100, 'b': 2, 'c': 200}


If we try to delete a key that's present in a parent but not the child, we'll get a `KeyError` exception:

In [59]:
d1 = {'a': 1, 'b': 2}
d2 = {'a': 20, 'c': 3}

d = ChainMap(d1, d2)

del d['c']

KeyError: "Key not found in the first mapping: 'c'"

#### Example: Temporarily modifying a config

Let's say we have a dictionary with some settings and we want to temporarily modify these settings, but without modifying the original dictionary.

We could certainly copy the dictionary and work with the copy, discarding the copy when we no longer need it - but again this incurs some overhead copying all the data.

Instead we can use a chain map this way, by making the first dictionary in the chain a new empty dictionary - any updates we make will be made to that dictionary only, thereby preserving the other dictionaries.

In [62]:
config = {
    'host': 'prod.deepdive.com',
    'port': 5432,
    'database': 'deepdive',
    'user_id': '$pg_user',
    'user_pwd': '$pg_pwd'
}

In [63]:
local_config = ChainMap({}, config)

Now let's mutate the child `{}` which will override the parent (config):

In [64]:
local_config['user_id'] = 'test'
local_config['user_pwd'] = 'test'

list(local_config.items())

[('host', 'prod.deepdive.com'),
 ('port', 5432),
 ('database', 'deepdive'),
 ('user_id', 'test'),
 ('user_pwd', 'test')]

The original config remains unchanged:

In [65]:
print(config)

{'host': 'prod.deepdive.com', 'port': 5432, 'database': 'deepdive', 'user_id': '$pg_user', 'user_pwd': '$pg_pwd'}


This can be extended to having configs with defined hierarchies e.g. server, dev, prod, test, global etc.

# 06 - UserDict

If we wanted to create our own version of a dictionary with special rules implicit, we could **subclass `dict`**. Then, we could override the various dunder methods to customise the behaviour. 

There are some caveats to this approach however:

While overriding `d.__getitem__('a')` does override the behaviour of `d['a']` and similarly for `d.__setitem__('a', 10)`, we do not observe overriding of `.get()` and `.update()`. The reason is because these are **built-in types** that use direct access to the data in C. Even `len(string)` does not use `__len__`.

**Alternative, better approach**

We can instead use `collections.UserDict` which is *not* a subclass of `dict`, it's a **mapping type**.

It implements 
- all key functionality that we have with dictionaries
- views
- `__getitem__` and `__setitem__` and uses them internally for `.get()` and `.update()`