# Python Deep Dive Part 3 - Dictionaries, Sets, JSON

## Associative arrays

### Associative arrays

* In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection.
* The name does not come from the associative property known in mathematics. Rather, it arises from the association of values with keys. It is not to be confused with associative processors.
* The dictionary problem is the classic problem of designing efficient data structures that implement associative arrays. The two major solutions to the dictionary problem are *hash tables* and *search trees*. It is sometimes also possible to solve the problem using directly addressed arrays, binary search trees, or other more specialized structures.
* In an associative array, the association between a key and a value is often known as a "mapping"; the same word may also be used to refer to the process of creating a new association.

Operations

* The operations that are usually defined for an associative array are:
  * adding / removing elements
  * looking up a value via key
  * modifying an associated value

Implementation comparison

| Underlying data structure | Lookup or Removal (average) | Lookup or Removal (worst case) | Insertion (average) | Insertion (worst case) | Ordered |
|---|---|---|---|---|---|
| Hash table | O(1) | O(n) | O(1) | O(n) | No |
| Self-balancing binary search tree | O(log n) | O(log n) | O(log n) | O(log n) | Yes |
| Unbalanced binary search tree | O(log n) | O(n) | O(log n) | O(n) | Yes |
| Sequential container of key–value pairs (e.g., association list)| O(n) | O(n) | O(1) | O(1) | No |

Ordered dictionary

* The basic definition of a dictionary does not mandate an order. To guarantee a fixed order of enumeration, ordered versions of the associative array are often used. There are two senses of an ordered dictionary:
  * The order of enumeration is always deterministic for a given set of keys by sorting. This is the case for tree-based implementations, one representative being the `<map>` container of C++.
  * The order of enumeration is key-independent and is instead based on the order of insertion. This is the case for the "ordered dictionary" in .NET Framework, the `LinkedHashMap` of Java and Python.
* The latter is more common. Such ordered dictionaries can be implemented using an association list, by overlaying a doubly linked list on top of a normal dictionary, or by moving the actual data out of the sparse (unordered) array and into a dense insertion-ordered one.

Dictionaries in Python

* Dictionaries are everywhere in Python
  * modules
  * classes
  * objects
  * scopes
  * sets
  * custom dictionaries
* It is one of the most important data structure in Python.

### Hash maps

* One common concrete implementation of an associative array is a hash map.
* In computing, a hash table, also known as a hash map or a hash set, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values.
* A hash table uses a *hash function* to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.
* During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.
* Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause *hash collisions* where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way.
* Hashing is an example of a space-time tradeoff.
  * If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access.
  * If infinite time is available, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.
* In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

Hash function

* A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output.
* The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes.
* The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.
* Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval.
* Hashing is a computationally and storage space-efficient form of data access that avoids the non-constant access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.
* A good hash function satisfies two basic properties:
  * it should be very fast to compute;
  * it should minimize duplication of output values (collisions).

### Python dictionaries

* Python dictionaries are ubiquitous.
  * namespaces
  * classes
  * modules
  * functions
  * sets
  * custom dictionaries
* Dictionaries are such an important part of Python that a lot of time and effort was put into making them as efficient as possible.

PEP 412: Key-sharing dictionary

* This PEP proposes a change in the implementation of the builtin dictionary type `dict`. The new implementation allows dictionaries which are used as attribute dictionaries (the `__dict__` attribute of an object) to share keys with other attribute dictionaries of instances of the same class.

Compact dictionary

* Python 3.6 introduced a more memory-efficient implementation of the `dict` type based on [a proposal by Raymond Hettinger](https://mail.python.org/pipermail/python-dev/2012-December/123028.html).
* The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5.

### Python's `hash()` function

* `hash(object)`
  * Return the hash value of the object (if it has one).
  * Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
  * Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for `1` and `1.0`).
  * For objects with custom `__hash__()` methods, note that `hash()` truncates the return value based on the bit width of the host machine (`sys.hash_info.width`).
  * Hash values for objects that compare equal remain equal during program run, but they can *change from run to run*, so you should never rely on a hash value being the same from one program run to another.

In [8]:
import sys

print(sys.hash_info.width)
print(f'{hash(1)=}')
print(f'{hash(1.0)=}')
print(f'{hash(2)=}')
print(f'{hash(2.1)=}')
print(f'{hash('1')=}')
print(f'{hash((1, 2))=}')
print(f'{hash(frozenset((1, 2)))=}')
# print(f'{hash([1, 2])=}') # unhashable type: 'list'
# print(f'{hash((0, [1, 2]))=}') # unhashable type: 'list'

64
hash(1)=1
hash(1.0)=1
hash(2)=2
hash(2.1)=230584300921369602
hash('1')=-2520627851638319182
hash((1, 2))=-3550055125485641917
hash(frozenset((1, 2)))=-1826646154956904602


## Dictionaries

### Creating dictionaries

Dictionary elements

* Basic structure of dictionary elements: key value pairs.
  * value -> any Python object
  * key -> any hashable object, because hash tables require hash of an object to be constant for the life of the program

Hashable objects

* int, float, complex, binary, Decimal, Fraction -> immutable -> hashable
* string -> immutable collection -> hashable
* frozenset -> immutable collection, its elements are required to be hashable -> hashable
* tuple -> immutable collection -> hashable only if all elements are also hashable
* set, dictionary -> mutable collections -> unhashable
* list -> mutable collection -> unhashable
* function -> immutable -> hashable
* custom class and object -> maybe hashable

Requirements for hash functions

* If an object is hashable:
  * the hash of the object must be an integer value
  * if two objects compare equal, the hashes must be equal
* Two objects that do not compare equal may still have the same hash (hash collision).

Ways of creating dictionaries

* Use a comma-separated list of `key: value` pairs within braces: `{'jack': 4098, 'sjoerd': 4127} or {4098: 'jack', 4127: 'sjoerd'}`
* Use a dict comprehension: `{}`, `{x: x ** 2 for x in range(10)}`
* Use the type constructor: `dict()`, `dict([('foo', 100), ('bar', 200)])`, `dict(foo=100, bar=200)`, `dict(zip(['one', 'two', 'three'], [1, 2, 3]))`, `dict({'one': 1, 'three': 3}, two=2)`
  * `class dict(**kwargs)`
  * `class dict(mapping, **kwargs)`
  * `class dict(iterable, **kwargs)`
  * If keyword arguments are given, the keyword arguments and their values are added to the dictionary created from the positional argument.
    * If a key being added is already present, the value from the keyword argument replaces the value from the positional argument.
    * Keyword arguments only works for keys that are valid Python *identifiers*, and dictionary keys will then be a string of that name.
* `classmethod fromkeys(iterable[, value])`
  * Create a new dictionary with keys from `iterable` and values set to `value`.
  * `fromkeys()` is a class method that returns a new dictionary. `value` defaults to `None`. All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.

In [11]:
import math

d1 = {'one': 1, 'two': 2, 'three': 3}
d2 = dict(one=1, two=2, three=3)
d3 = dict((('one', 1), ('two', 2), ('three', 3)))
d4 = dict(zip(('one', 'two', 'three'), (1, 2, 3)))
d5 = dict({'one': 1, 'two': 2}, three=3)
d6 = {k: v for k, v in zip(('one', 'two', 'three'), (1, 2, 3))}
print(d1 == d2 == d3 == d4 == d5 == d6)
d7 = dict.fromkeys(i for i in range(3)) # you can omit the outer parentheses when a generator expression is used as the sole argument to a function
print(f'{d7=}')
d8 = {(x, y): math.hypot(x, y) for x in range(3) for y in range(3)}
print(f'{d8=}')

True
d7={0: None, 1: None, 2: None}
d8={(0, 0): 0.0, (0, 1): 1.0, (0, 2): 2.0, (1, 0): 1.0, (1, 1): 1.4142135623730951, (1, 2): 2.23606797749979, (2, 0): 2.0, (2, 1): 2.23606797749979, (2, 2): 2.8284271247461903}


In [12]:
d1 = {'num': 1, 'list': [1, 2, 3]}
d2 = dict(d1) # make a shallow copy of d1
print(f'{(d1 == d2)=}')
print(f'{(d1 is d2)=}')
print(f'{(d1['list'] is d2['list'])=}') # true
d3 = dict(d1, extra=2)
print(f'{(d1['list'] is d3['list'])=}') # true

(d1 == d2)=True
(d1 is d2)=False
(d1['list'] is d2['list'])=True
(d1['list'] is d3['list'])=True


### Common operations

Basic operations

* `d[key] = value`
  * creates `key` if it does not exist already
  * assigns `value` to `key`
* `d[key]`
  * as an expression returns the value for specific key
  * exception `KeyError` if `key` is not found
* `d.get(key[, default])`
  * returns the value for `key` if `key` is in `d`, else `default`
  * if `default` is not given, it defaults to `None`
  * this method never raises a `KeyError`
* `key in d`, `key not in d`
  * membership testing
* `len(d)`
  * number of items in dictionary
* `d.clear()`
  * clears out all items, `d` becomes empty with its id unchanged
* `del d[key]`
  * removes element with that `key` from `d`
  * exception `KeyError` if `key` is not in `d`
* `d.pop(key[, default])`
  * if `key` is in `d`, remove it and return its value, else return `default`
  * if `default` is not given and `key` is not in `d`, a `KeyError` is raised
* `d.popitem()`
  * removes an item from `d`
  * returns a `(key, value)` pair
  * pairs are returned in LIFO (last in, first out) order
  * `KeyError` if `d` is empty
  * `popitem()` is useful to destructively iterate over a dictionary, as often used in set algorithms
  * since version 3.7, LIFO order is guaranteed, while in prior versions, `popitem()` would return an arbitrary item
* `d.setdefault(key[, default])`
  * if `key` is in the dictionary, return its value
  * if not, insert `key` with a value of `default` and return `default`
  * `default` defaults to `None`
* `d.items()`
  * returns a new view of `d`'s items (`(key, value)` pairs).
* `d.update([other])`
  * update `d` with the key/value pairs from `other`, overwriting existing keys
  * returns `None`
  * `update()` accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two)
  * if keyword arguments are specified, the dictionary is then updated with those key/value pairs: `d.update(red=1, blue=2)`
* `d | other`
  * creates a new dictionary with the merged keys and values of `d` and `other`, which must both be dictionaries
  * the values of `other` take priority when `d` and `other` share keys
* `d |= other`
  * update the dictionary `d` with keys and values from `other`, which may be either a mapping or an iterable of key/value pairs
  * the values of `other` take priority when `d` and `other` share keys

In [13]:
d = dict.fromkeys(i for i in range(3))
print(f'{d=}, {id(d)=}')
d.clear()
print(f'{d=}, {id(d)=}')

d={0: None, 1: None, 2: None}, id(d)=2284074067328
d={}, id(d)=2284074067328


In [26]:
import string

def get_category(c):
    # use dictionary, so that no need to iterate
    category_dict = (dict.fromkeys(string.ascii_lowercase, 'lowercases') |
                     dict.fromkeys(string.ascii_uppercase, 'uppercases') |
                     dict.fromkeys(string.digits, 'digits') |
                     dict.fromkeys(string.punctuation, 'punctuations'))
    return category_dict.get(c, 'others')

text = '''What’s the Metaverse and Why Should We Care About It?
The metaverse is an immersive virtual reality environment that combines work, socializing, shopping, and gaming. Depending on who you talk to, it’s either a utopia or a dystopia. Some view it as the next version of the internet, while others see it as a better version of the online world Second Life. Major companies like Facebook, Microsoft, and Epic Games are investing heavily in its development.
The concept of the metaverse has been around for decades, but recent advancements in technology have brought it closer to reality. Imagine a digital universe where you can create an avatar, interact with others, attend virtual events, and explore endless virtual landscapes. It’s a place where physical and digital boundaries blur, and the possibilities are limited only by our imagination.
Here are some key points about the metaverse:
Virtual Worlds: The metaverse consists of interconnected virtual worlds. These worlds can be entirely fictional or mirror real-world locations. Users can navigate through these spaces using their avatars.
Social Interaction: In the metaverse, social interactions are central. You can chat with friends, attend virtual parties, collaborate on projects, or even meet new people. It’s like a 24/7 digital social gathering.
Economy and Commerce: Just like in the real world, the metaverse has its economy. People buy and sell virtual goods, services, and properties. Cryptocurrencies play a significant role here.
Gaming and Entertainment: Gaming is a major part of the metaverse. From immersive multiplayer games to virtual concerts, entertainment options are vast. Artists perform live in virtual venues, and users can participate from anywhere.
Challenges and Concerns: While the metaverse promises exciting possibilities, it also raises concerns. Privacy, security, and ethical issues need careful consideration. Who owns the virtual spaces? How do we prevent abuse and harassment?
The Future: The metaverse is still evolving, and its impact on society remains uncertain. Will it enhance our lives or lead to further isolation? Only time will tell.
In summary, the metaverse is a digital frontier where creativity, collaboration, and innovation converge. Whether you’re an early adopter or a skeptic, it’s a concept worth exploring. Read more about it here and dive into this fascinating realm!'''

char_count = {}

for c in text:
    if c.strip():
        # method 1: use match case with guard
        # match c:
        #     case c if c in string.ascii_lowercase:
        #         category = 'lowercases'
        #     case c if c in string.ascii_uppercase:
        #         category = 'uppercases'
        #     case c if c in string.punctuation:
        #         category = 'punctuations'
        #     case c if c in string.digits:
        #         category = 'digits'
        #     case _:
        #         category = 'others'

        # method 2: use if-else
        # if c in string.ascii_lowercase:
        #     category = 'lowercases'
        # elif c in string.ascii_uppercase:
        #     category = 'uppercases'
        # elif c in string.punctuation:
        #     category = 'punctuations'
        # elif c in string.digits:
        #     category = 'digits'
        # else:
        #     category = 'others'

        # method 3: use an function that wrapped the above process
        category = get_category(c)

        # char_count.setdefault(category, {})
        # char_count[category][c] = char_count[category].get(c, 0) + 1
        char_count[category][c] = char_count.setdefault(category, {}).get(c, 0) + 1 # the rhs evaluates first

print(char_count)

for k, v in char_count.items():
    char_count[k] = dict(sorted(char_count[k].items()))
print(char_count)

{'uppercases': {'W': 8, 'M': 3, 'S': 4, 'C': 5, 'A': 2, 'I': 7, 'T': 6, 'D': 1, 'L': 1, 'F': 3, 'E': 3, 'G': 3, 'H': 2, 'V': 1, 'U': 1, 'Y': 1, 'J': 1, 'P': 2, 'O': 1, 'R': 1}, 'lowercases': {'h': 62, 'a': 170, 't': 171, 's': 127, 'e': 255, 'v': 50, 'r': 143, 'n': 141, 'd': 52, 'y': 28, 'o': 131, 'u': 44, 'l': 88, 'b': 19, 'm': 51, 'i': 162, 'c': 70, 'w': 21, 'k': 8, 'z': 1, 'g': 30, 'p': 40, 'x': 4, 'f': 17, 'j': 3}, 'others': {'’': 6}, 'punctuations': {'?': 4, ',': 29, '.': 25, ':': 7, '-': 1, '/': 1, '!': 1}, 'digits': {'2': 1, '4': 1, '7': 1}}
{'uppercases': {'A': 2, 'C': 5, 'D': 1, 'E': 3, 'F': 3, 'G': 3, 'H': 2, 'I': 7, 'J': 1, 'L': 1, 'M': 3, 'O': 1, 'P': 2, 'R': 1, 'S': 4, 'T': 6, 'U': 1, 'V': 1, 'W': 8, 'Y': 1}, 'lowercases': {'a': 170, 'b': 19, 'c': 70, 'd': 52, 'e': 255, 'f': 17, 'g': 30, 'h': 62, 'i': 162, 'j': 3, 'k': 8, 'l': 88, 'm': 51, 'n': 141, 'o': 131, 'p': 40, 'r': 143, 's': 127, 't': 171, 'u': 44, 'v': 50, 'w': 21, 'x': 4, 'y': 28, 'z': 1}, 'others': {'’': 6}, 'pun

### Dictionary views

* The objects returned by `dict.keys()`, `dict.values()` and `dict.items()` are view objects.
* They provide a *dynamic* view on the dictionary's entries, which means that when the dictionary changes, the view reflects these changes.
* Dictionary views can be iterated over to yield their respective data, and support membership tests.
* Views are read-only and not updatable.
* Keys views are set-like since their entries are unique and hashable.
* Items views also have set-like operations since the (key, value) pairs are unique and the keys are hashable. If all values in an items view are hashable as well, then the items view can interoperate with other sets.
* Values views are not treated as set-like since the entries are generally not unique.
* For set-like views, all of the operations defined for the abstract base class `collections.abc.Set` are available (for example, `==`, `<`, or `^`).
* While using set operators, set-like views accept any iterable as the other operand, unlike sets which only accept sets as the input.

In [27]:
d = {'a': 1, 'b': 2}
values_view = d.values()
print(values_view)
d['a'] = 10
print(values_view) # changed, dynamically reflect the values of the dict

dict_values([1, 2])
dict_values([10, 2])


In [33]:
d1 = {'a': 1, 'b': 2, 'c': 3}
d2 = {'b': 2, 'c': 6, 'd': 4}

d1_keys = d1.keys()
d2_keys = d2.keys()

# create a dictionary with common keys and both values from dicts above
common_keys = d1_keys & d2_keys
d3 = {common_key: (d1[common_key], d2[common_key])
      for common_key in common_keys}
print(d3)

# create a dictionary with only unique keys in both dicts above
unique_keys = d1_keys ^ d2_keys
d4 = {unique_key: d1.get(unique_key) or d2.get(unique_key) for unique_key in unique_keys}
print(d4)

{'b': (2, 2), 'c': (3, 6)}
{'d': 4, 'a': 1}



### Updating, merging and copying

Updating & merging

* `update([other])` has three forms
  * `d1.update(d2)`
  * `d1.update(iterable)`: an iterable of key/value pairs (as tuples or other iterables of length two)
  * `d1.update(**kwargs)`: only works for keys that are valid identifiers
* Unpacking dictionaries: `d = {**d1, **d2, **d3}`
  * unlike dictionary unpacking for function arguments, keys does not need to be valid identifiers

Copying

* Shallow copies
  * container object is a new object
  * copied container element keys / elements are shared references with original object
  * `d.copy()`
  * `{**d}`
  * `dict(d)`
  * `{k, v for k, v in d.items()}`, slower than the above
* Deep copies
  * no shared references
  * sometimes requires recursion, have to be careful with circular references
  * `copy.deepcopy()`

In [37]:
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'b': 5}
d1.update(d2, d=4)
print(d1) # insertion order are guaranteed, not updating order

{'a': 1, 'b': 5, 'c': 3, 'd': 4}


In [41]:
from timeit import timeit
from random import randrange
from copy import deepcopy

big_dict = {k: randrange(100) for k in range(1000000)}

def copy_unpacking(d):
    {**d}

def copy_constructor(d):
    dict(d)

def copy_copy(d):
    d.copy()

def copy_comprehension(d):
    {k: v for k, v in d.items()}

def copy_deep(d):
    deepcopy(d)

print('copy by shallow copy:', timeit('copy_copy(big_dict)', globals=globals(), number=10))
print('copy by unpacking:', timeit('copy_unpacking(big_dict)', globals=globals(), number=10))
print('copy by constructor:', timeit('copy_constructor(big_dict)', globals=globals(), number=10))
print('copy by comprehension:', timeit('copy_comprehension(big_dict)', globals=globals(), number=10)) # slower than other shallow copy methods
print('copy by deep copy:', timeit('copy_deep(big_dict)', globals=globals(), number=10)) # deep copy is much slower than shallow copy

copy by shallow copy: 0.23981210007332265
copy by unpacking: 0.22421070002019405
copy by constructor: 0.22063910006545484
copy by comprehension: 1.0174344999250025
copy by deep copy: 8.563632600009441


### Custom classes and hashing

How Python inserts a key-value item in a dictionary

* hash(key)
* mod dictionary size
* start index in hash table (sequence of slots)
* generate probe sequence (sequence of valid indices)
* iterate over probe sequence
  * is the slot at that index empty?
    * yes -> store the new item there (hash, key, value)
    * no -> hash collision -> continue iteration to look for an empty slot
      * more collision, more inefficient

How Python finds a key in a dictionary

* hash(key)
* mod dictionary size
* start index in hash table (sequence of slots)
* generate probe sequence (sequences of valid indices)
* look over probe sequence
  * is slot empty?
    * yes -> key does not exist in dictionary
    * no -> are hashes equal and are keys equal?
      * yes -> found the key
      * no -> caused by collision upon insertion / resizing -> continue iteration to find key or empty slot

Object hashes

* An object hash in Python must satisfy the following:
  * the result must be an integer
  * if `a == b`, then `hash(a) == hash(b)`
    * do not require that if `hash(a) == hash(b)` then `a == b`, i.e. two objects that are not equal, can have the same hash (hash collision)

Custom classes

* By default, custom class compare `==` if they have the same id.
* By default, Python automatically make custom objects without `__eq__` method hashable.
  * it uses the id to make a hash
  * but this is not very useful
* So we need to define equality for custom classes
  * but this will lead to the lose of automatic id based hashing of the custom class
    * because if Python uses the default hash based on id, it will cause contradiction: `obj1 == obj2` is true but `hash(obj1) == hash(obj2)` is false
* So we need to define hash for custom classes
  * `__hash__`
    * set `__hash__` attribute to `None` to indicate the class is unhashable, this is what happens when only `__eq__` is defined but not `__hash__`
    * `__hash__` must return an integer
    * if `a == b`, then `hash(a) == hash(b)`

What happens when calling `hash(obj)`

* looks for `obj.__hash__`
* `None` means not hashable
* otherwise, calls `obj.__hash__()`
* truncates returned integer to 32-bit or 64-bit depending on Python (`sys.hash_info.width`, `sys.hash_info.modulus`) -> `obj.__hash__() % sys.hash_info.modulus`

In [54]:
import os

class Person:
    def __init__(self, name, id=None):
        self._id = id if id is not None else int.from_bytes(os.urandom(8))
        self.name = name

    @property
    def id(self):
        return self._id
    
    @property
    def name(self):
        return self._name
    
    @name.setter
    def name(self, name):
        if isinstance(name, str):
            self._name = name
        else:
            raise ValueError('name should be a string')

    def __repr__(self):
        return f'Person(id={self._id}, name={self._name})'

    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return self._id == other._id
        else:
            return NotImplemented
        
    def __hash__(self):
        return hash(self._id)
    
p1 = Person('tony')
p2 = Person('freer')
print(p1)
print(hash(p1))
print(p1 == p2)
print(p1 == 1)
p1.name = 'tom'
print(hash(p1))

p3 = Person('tony', 1)
p4 = Person('freer', 1)
print(p3)
print(p3 == p4)
print(hash(p3) == hash(p4))

Person(id=12372730557963379020, name=tony)
843515511894909265
False
False
843515511894909265
Person(id=1, name=tony)
True
True


## Coding exercises

### Exercise 1

Write a Python function that will create and return a dictionary from another dictionary, but sorted by value. You can assume the values are all comparable and have a natural sort order.

In [56]:
def sort_dict_by_value(d):
    sorted_items = sorted(d.items(), key=lambda item: item[1])
    return dict(sorted_items)

d = {'a': 10, 'b': 2, 'c': 0, 'd': 4}
print(sort_dict_by_value(d))

composers = {'Johann': 65, 'Ludwig': 56, 'Frederic': 39, 'Wolfgang': 35}
print(sort_dict_by_value(composers))

{'c': 0, 'b': 2, 'd': 4, 'a': 10}
{'Wolfgang': 35, 'Frederic': 39, 'Ludwig': 56, 'Johann': 65}


### Exercise 2

Given two dictionaries, `d1` and `d2`, write a function that creates a dictionary that contains only the keys common to both dictionaries, with values being a tuple containg the values from `d1` and `d2`. (Order of keys is not important).

In [72]:
def accumulate_intersection(d1, d2):
    common_keys = d1.keys() & d2.keys()
    return {common_key: (d1[common_key], d2[common_key]) for common_key in common_keys}

d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
d2 = {'b': 20, 'c': 30, 'y': 40, 'z': 50}

print(accumulate_intersection(d1, d2))

{'b': (2, 20), 'c': (3, 30)}


### Exercise 3

You have text data spread across multiple servers. Each server is able to analyze this data and return a dictionary that contains words and their frequency.

Your job is to combine this data to create a single dictionary that contains all the words and their combined frequencies from all these data sources. Bonus points if you can make your dictionary sorted by frequency (highest to lowest).

In [74]:
# from functools import reduce

def combine_dicts(*dicts):
    # combined_keys = reduce(lambda a, b: a | b, map(lambda d: d.keys(), dicts))
    # return sort_dict_by_value({k: sum(d.get(k, 0) for d in dicts) for k in combined_keys}, True)
    combined_dict = {}
    for d in dicts:
        for k, v in d.items():
            combined_dict[k] = combined_dict.get(k, 0) + v
    return sort_dict_by_value(combined_dict, True)

def sort_dict_by_value(d, reverse=False):
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=reverse))

d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

print(combine_dicts(d1, d2))
print(combine_dicts(d1, d2, d3))


{'python': 16, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9}
{'python': 17, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9, 'erlang': 5, 'haskell': 2, 'pascal': 1}


### Exercise 4

For this exercise suppose you have a web API load balanced across multiple nodes. This API receives various requests for resources and logs each request to some local storage. Each instance of the API is able to return a dictionary containing the resource that was accessed (the dictionary key) and the number of times it was requested (the associated value).

Your task here is to identify resources that have been requested on some, but not all the servers, so you can determine if you have an issue with your load balancer not distributing certain resource requests across all nodes.

For simplicity, we will assume that there are exactly 3 nodes in the cluster.

You should write a function that takes 3 dictionaries as arguments for node 1, node 2, and node 3, and returns a dictionary that contains only keys that are not found in **all** of the dictionaries. The value should be a list containing the number of times it was requested in each node (the node order should match the dictionary (node) order passed to your function). Use `0` if the resource was not requested from the corresponding node.

In [71]:
from functools import reduce

def analyze_balance(*dicts):
    keys_list = list(map(lambda d: d.keys(), dicts))
    common_keys = reduce(lambda a, b: a & b, keys_list)
    combined_keys = reduce(lambda a, b: a | b, keys_list)
    differentiated_keys = combined_keys - common_keys
    return {k: tuple(d.get(k, 0) for d in dicts) for k in differentiated_keys}

n1 = {'employees': 100, 'employee': 5000, 'users': 10, 'user': 100}
n2 = {'employees': 250, 'users': 23, 'user': 230}
n3 = {'employees': 150, 'users': 4, 'login': 1000}

print(analyze_balance(n1, n2, n3))

{'login': (0, 0, 1000), 'employee': (5000, 0, 0), 'user': (100, 230, 0)}


## Sets

### Basic set theory

What is a set

* A set object is an *unordered* collection of *distinct* *hashable* objects.
* Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
  * `s1.intersection(s2)`, `s1 & s2`
  * `s1.union(s2)`, `s1 | s2`
  * `s1.difference(s2)`, `s1 - s2`
  * `s1.symmetric_difference(s2)`, `s1 ^ s2`
* Like other collections, sets support `x in set`, `len(set)`, and `for x in set`.
  * For finite sets, the cardinality of a set is the number of elements in the set.
  * An empty set contains no elements, i.e. cardinality 0, and can be create by `set()`.
  * Two set are said to be disjoint if their intersection is an empty list, i.e. `len(s1 & s2) == 0` or `s1.isdisjoint(s2) == True`.
* Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.
* There are currently two built-in set types, set and frozenset.
  * The set type is mutable — the contents can be changed using methods like `add()` and `remove()`. Since it is mutable, it has no hash value and cannot be used as either a dictionary key or as an element of another set.
  * The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.

Subsets and supersets

* A set `s1` is a subset of `s2` if all the elements in `s1` are in `s2`.
  * `s1 <= s2`
  * `s1.issubset(s2)`
* A set `s1` is a proper subset of `s2` if `s1` is a subset of `s2` but `s1` is not equal to `s2`.
  * `s1 < s2`
* A set `s1` is a superset of `s2` if `s2` is a subset of `s1`.
  * `s1 >= s2`
  * `s1.issupset(s2)`
* A set `s1` is a proper superset of `s2` if `s2` is a proper subset of `s1`.
  * `s1 > s2`

### Python sets

* `class set([iterable])`, `class frozenset([iterable])`
  * Returns a new set or frozenset object whose elements are taken from `iterable`. The elements of a set must be *hashable*. To represent sets of sets, the inner sets must be frozenset objects. If iterable is not specified, a new empty set is returned.
* Sets can be created by several means:
  * Use a comma-separated list of elements within braces: `{'jack', 'sjoerd'}`
  * Use unpacking: `{*{1, 2}, *{3, 4}}`
    * unpacking for function arguments also works but the order of the arguments are not guaranteed
  * Use a set comprehension: `{c for c in 'abracadabra' if c not in 'abc'}`
  * Use the type constructor: `set()`, `set('foobar')`, `set(['a', 'b', 'foo'])`

Membership testing

* Testing membership of an element in a set is extremely efficient (hash table lookup).
* Instead of writing code like `if a in (1, 2, 3):`, prefer using (as long as elements are hashable) `if a in {1, 2, 3}:` (with higher storage cost).
  * list / tuple lookup -> scan until found
  * set / dictionary -> hash table -> direct lookup (possible hash collision)

In [78]:
print(set())
print(set((1, 2, 3)))
# print(set((1, [1, 2]))) # elements must be hashable
print(set('python'))
print({c for c in 'python'}) # slower than the above way
print({*'python', *[1, 2, 'p']})

set()
{1, 2, 3}
{'p', 't', 'n', 'h', 'y', 'o'}
{'p', 't', 'n', 'h', 'y', 'o'}
{1, 'p', 't', 'n', 2, 'h', 'y', 'o'}


In [79]:
def scorer(string):
    alphabet = set('abcdefghijklmnopqrstuvwxyz')
    string_letters = set(string.lower()) & alphabet
    return len(string_letters) / len(alphabet)

print(scorer('play the world!'))
print(scorer('the quick brown fox jumps over the lazy dog'))

0.4230769230769231
1.0


### Common operations

Operations available for both instances of set and frozenset

* `len(s)`
  * Return the number of elements in set `s` (cardinality of `s`).
* `x in s`
  * Test `x` for membership in `s`.
* `x not in s`
  * Test `x` for non-membership in `s`.
* `isdisjoint(other)`
  * Return `True` if the set has no elements in common with `other`. Sets are disjoint if and only if their intersection is the empty set.
* `issubset(other)`, `set <= other`
  * Test whether every element in the set is in `other`.
* `set < other`
  * Test whether the set is a proper subset of `other`, that is, `set <= other and set != other`.
* `issuperset(other)`, `set >= other`
  * Test whether every element in `other` is in the set.
* `set > other`
  * Test whether the set is a proper superset of `other`, that is, `set >= other and set != other`.
* `union(*others)`, `set | other | ...`
  * Return a new set with elements from the set and all others.
* `intersection(*others)`, `set & other & ...`
  * Return a new set with elements common to the set and all others.
* `difference(*others)`, `set - other - ...`
  * Return a new set with elements in the set that are not in the others.
* `symmetric_difference(other)`, `set ^ other`
  * Return a new set with elements in either the set or `other` but not both.
* `copy()`
  * Return a shallow copy of the set.
* The non-operator versions of `union()`, `intersection()`,` difference()`, `symmetric_difference()`, `issubset()`, and `issuperset()` methods will accept any *iterable* as an argument.
  * In contrast, their operator based counterparts require their arguments to be sets.
  * This precludes error-prone constructions like `set('abc') & 'cbs'` in favor of the more readable `set('abc').intersection('cbs')`.
* Instances of set are compared to instances of frozenset based on their members. For example, `set('abc') == frozenset('abc')` returns `True` and so does `set('abc') in set([frozenset('abc')])`.
* Binary operations that mix set instances with frozenset return the type of the first operand. For example: `frozenset('ab') | set('bc')` returns an instance of frozenset.
* The subset and equality comparisons do not generalize to a total ordering function. For example, any two nonempty disjoint sets are not equal and are not subsets of each other, so all of the following return `False`: `a<b`, `a==b`, or `a>b`.
* Since sets only define partial ordering (subset relationships), the output of the `list.sort()` method is undefined for lists of sets.

Operations available for set that do not apply to immutable instances of frozenset

* `update(*others)`, `set |= other | ...`
  * Update the set, adding elements from all others.
* `intersection_update(*others)`, `set &= other & ...`
  * Update the set, keeping only elements found in it and all others.
* `difference_update(*others)`, `set -= other | ...`
  * Update the set, removing elements found in others.
* `symmetric_difference_update(other)`, `set ^= other`
  * Update the set, keeping only elements found in either set, but not in both.
* `add(elem)`
  * Add element `elem` to the set.
* `remove(elem)`
  * Remove element `elem` from the set. Raises `KeyError` if `elem` is not contained in the set.
* `discard(elem)`
  * Remove element `elem` from the set if it is present.
* `pop()`
  * Remove and return an arbitrary element from the set. Raises `KeyError` if the set is empty.
* `clear()`
  * Remove all elements from the set.
* The non-operator versions of the `update()`, `intersection_update()`, `difference_update()`, and `symmetric_difference_update()` methods will accept any *iterable* as an argument.
* The `elem` argument to the `__contains__()`, `remove()`, and `discard()` methods may be a set. To support searching for an equivalent *frozenset*, a temporary one is created from `elem`.

In [8]:
s = {1, 2, 3, 4, 5, 6, frozenset((1, 6))}
print(1 in s)
print({1, 6} in s) # an temporary equivalent frozenset of {1, 6} is created to support searching 
s.add(7)
print(s)
s.remove(7)
print(s)
print(s.pop())
print(s)
s.discard(7) # no error for non-existing element
s -= {1, 2} | {3, 4}
print(s)

True
True
{1, 2, 3, 4, 5, 6, 7, frozenset({1, 6})}
{1, 2, 3, 4, 5, 6, frozenset({1, 6})}
1
{2, 3, 4, 5, 6, frozenset({1, 6})}
{5, 6, frozenset({1, 6})}


In [3]:
d = {}
s = set()
l = []
print(f'{d.__sizeof__()=}')
print(f'{s.__sizeof__()=}')
print(f'{l.__sizeof__()=}')

d[1] = None
s.add(1)
l.append(1)
print(f'{d.__sizeof__()=}')
print(f'{s.__sizeof__()=}')
print(f'{l.__sizeof__()=}')

d.__sizeof__()=48
s.__sizeof__()=200
l.__sizeof__()=40
d.__sizeof__()=208
s.__sizeof__()=200
l.__sizeof__()=72


### Frozen sets

* Frozen sets are immutable sets, with the same properties and behaviors as sets except that they cannot be mutable.
* Their elements can be mutable.
* Frozenset is hashable, so that it can be used as a key in a dictionary and an element of another set.

In [4]:
from copy import deepcopy

fs1 = frozenset({1, 2, 3})
fs2 = frozenset(fs1)
print(f'{(fs1 is fs2)=}') # when making a shallow copy of a frozenset, the copy is actually the same object
fs3 = deepcopy(fs1)
print(f'{(fs1 is fs3)=}')

s1 = {4, 5, 6}
print(f'{type(s1 | fs1)=}') # the type of the result set depends on the first operand
print(f'{type(fs1 | s1)=}')

s4 = {1, 2, 3}
print(f'{(fs1 == s4)=}')
print(f'{(fs1 is s4)=}')

(fs1 is fs2)=True
(fs1 is fs3)=False
type(s1 | fs1)=<class 'set'>
type(fs1 | s1)=<class 'frozenset'>
(fs1 == s4)=True
(fs1 is s4)=False


In [5]:
class Person:
    def __init__(self, name, age):
        self._name = name
        self._age = age

    @property
    def name(self):
        return self._name
    
    @property
    def age(self):
        return self._age
    
    def __repr__(self):
        return f'Person(name={self._name}, age={self._age})'
    
    def key(self): # serves as dictionary key
        return frozenset({self._name, self._age})
    
p1 = Person('John', 75)
p2 = Person('Eric', 78)
ps = {p1.key(): p1, p2.key(): p2}
print(ps)
print(ps[Person('John', 75).key()])
print(ps[frozenset(('John', 75))])

{frozenset({75, 'John'}): Person(name=John, age=75), frozenset({'Eric', 78}): Person(name=Eric, age=78)}
Person(name=John, age=75)
Person(name=John, age=75)


In [10]:
from functools import lru_cache

@lru_cache
def my_func(*, a, b):
    print('calculating a + b')
    return a + b

print(my_func(a=1, b=2))
print(my_func(a=1, b=2)) # look up in cache for the same function call, no calculation
# one drawback of lru_cache
print(my_func(b=2, a=1)) # calculate again although a and b have the same values as before
# another drawback of lru_cache
# print(my_func(a=[1, 2], b=[3, 4])) # unhashable type: list


def memoizer(fn):
    cache = {}
    def inner(*args, **kwargs):
        key = (args, frozenset(kwargs.items()))
        if key in cache:
            return cache[key]
        else:
            result = fn(*args, **kwargs)
            cache[key] = result
            return result
    return inner
        
@ memoizer
def my_func(*, a, b):
    print('calculating a + b')
    return a + b

print(my_func(a=1, b=2))
print(my_func(a=1, b=2))
print(my_func(b=2, a=1)) # although the order of a, b changed, there is no need to recalculate

calculating a + b
3
3
calculating a + b
3
calculating a + b
3
3
3


### Dictionary views

Key view

* `d.keys()` returns, instead of a list or an iterator, a lightweight object that
  * maintains a reference to the dictionary
  * implements methods such as:
    * `__iter__`: iterable protocol
    * `__contains__`: membership testing
    * `__and__`: intersection of two views
    * `__or__`: union of two views
    * `__eq__`: same keys in both views
    * `__lt__`: is one set of keys a subset of the other

Dictionary views

* Three ways to view the data in a dictionary
  * `d.keys()`
  * `d.values()`
  * `d.items()`
  * they are all iterables
  * some may have set properties
* The order of keys, values and items are the same.
* In Python 3.6+, this order is the same as dictionary insertion order.

Set behavior

* The `keys()` view always behaves like a (frozen) set, since elements are unique and hashable.
* The `items()` view may behave like a (frozen) set, if the values are hashable.
* The `values()` view never behaves like a set, since values are not guaranteed unique and hashable.

Modifying the dictionary while iterating over a view

* Modifying values is usually not a problem.
* Modifying keys can lead to exceptions or worse.
  * Python does not allow modifying the size of the underlying dictionary while iterating over a view.
  * Iterating views while adding or deleting entries in the dictionary may raise a `RuntimeError` or fail to iterate over all items.

In [18]:
d = {'a': 1, 'b': 2, 'c': 3}

# for k, v in d.items(): # RuntimeError
#     print(v)
#     del d[k]

# for k in list(d.keys()):
#     # print(d[k])
#     # del d[k]
#     print(d.pop(k))

# for _ in range(len(d)):
#     print(d.popitem())

while len(d) > 0:
    print(d.popitem())

print(d)

('c', 3)
('b', 2)
('a', 1)
{}


## Project 1

Basic info

* In this project our goal is to validate one dictionary structure against a template dictionary.
* A typical example of this might be working with JSON data inputs in an API. You are trying to validate this received JSON against some kind of template to make sure the received JSON conforms to that template, i.e. all the keys, structure and data type of the values are identical.
* To keep things simple we'll assume that values can be either single values (like an integer, string, etc.), or a dictionary containing only single values or dictionary recursively. In other words, we're not going to deal with lists as possible values. Also we'll assume that all keys are required, and that no extra keys are permitted.
* There are many 3rd party libraries that already exist to do this (such as `jsonschema`, `marshmallow`).
* Write a function such this:
    ```Python
    def validate(data, template):
    # return True / False
    # in the case of False, return a string describing
    # the first error encountered
    # in the case of True, the string can be empty
    return state, error
    ```
  * Sample returns:
    * `validate(john, template)` -> `True, ''`
    * `validate(eric, template)` -> `False, 'mismatched keys: bio.birthplace.city'`
    * `validate(michael, template)` -> `False, 'bad type: bio.dob.month'`
  * Better use exception instead of codes and strings.

In [19]:
def validate(data, template, prev_keys=None):
    if prev_keys is None:
        prev_keys = []
    for k, v in template.items():
        data_v = data.get(k, object()) # return default object() if k does not exist in data
        # if data_v != object(): # __eq__ of object is based on id, so it's always false
        if type(data_v) is not object:
            data_v_type = type(data_v)
            template_v_type = dict if isinstance(v, dict) else v
            if data_v_type is not template_v_type:
                prev_keys.append(k)
                return False, f'bad type: {'.'.join(prev_keys)}'
            else:
                if isinstance(v, dict):
                    prev_keys.append(k)
                    state, error = validate(data_v, v, prev_keys)
                    if not state:
                        return state, error
                    else:
                        prev_keys.pop() # remove the last previous key if state is True
        else:
            prev_keys.append(k)
            return False, f'mismatch keys: {'.'.join(prev_keys)}'
    return True, ''

template = {
    'user_id': int,
    'name': {
        'first': str,
        'last': str
    },
    'bio': {
        'dob': {
            'year': int,
            'month': int,
            'day': int
        },
        'birthplace': {
            'country': str,
            'city': str
        }
    }
}

john = {
    'user_id': 100,
    'name': {
        'first': 'John',
        'last': 'Cleese'
    },
    'bio': {
        'dob': {
            'year': 1939,
            'month': 11,
            'day': 27
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Weston-super-Mare'
        }
    }
}

eric = {
    'user_id': 101,
    'name': {
        'first': 'Eric',
        'last': 'Idle'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 3,
            'day': 29
        },
        'birthplace': {
            'country': 'United Kingdom'
        }
    }
}

michael = {
    'user_id': 102,
    'name': {
        'first': 'Michael',
        'last': 'Palin'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 'May',
            'day': 5
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Sheffield'
        }
    }
}

print(validate(john, template))
print(validate(eric, template))
print(validate(michael, template))

(True, '')
(False, 'mismatch keys: bio.birthplace.city')
(False, 'bad type: bio.dob.month')


In [20]:
class SchemaError(Exception):
    pass

class SchemaKeyMismatch(SchemaError):
    pass

class SchemaTypeMismatch(SchemaError, TypeError):
    pass

def validate(data, template, prev_keys=None):
    if prev_keys is None:
        prev_keys = []
    for k, v in template.items():
        data_v = data.get(k, object()) # return default object() if k does not exist in data
        if type(data_v) is not object:
            data_v_type = type(data_v)
            template_v_type = dict if isinstance(v, dict) else v
            if data_v_type is not template_v_type:
                prev_keys.append(k)
                raise SchemaTypeMismatch(f'bad type: {'.'.join(prev_keys)}')
            else:
                if isinstance(v, dict):
                    prev_keys.append(k)
                    validate(data_v, v, prev_keys)
                    prev_keys.pop() # remove the last previous key if no exception raises
        else:
            prev_keys.append(k)
            raise SchemaKeyMismatch(f'mismatch keys: {'.'.join(prev_keys)}')

template = {
    'user_id': int,
    'name': {
        'first': str,
        'last': str
    },
    'bio': {
        'dob': {
            'year': int,
            'month': int,
            'day': int
        },
        'birthplace': {
            'country': str,
            'city': str
        }
    }
}

john = {
    'user_id': 100,
    'name': {
        'first': 'John',
        'last': 'Cleese'
    },
    'bio': {
        'dob': {
            'year': 1939,
            'month': 11,
            'day': 27
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Weston-super-Mare'
        }
    }
}

eric = {
    'user_id': 101,
    'name': {
        'first': 'Eric',
        'last': 'Idle'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 3,
            'day': 29
        },
        'birthplace': {
            'country': 'United Kingdom'
        }
    }
}

michael = {
    'user_id': 102,
    'name': {
        'first': 'Michael',
        'last': 'Palin'
    },
    'bio': {
        'dob': {
            'year': 1943,
            'month': 'May',
            'day': 5
        },
        'birthplace': {
            'country': 'United Kingdom',
            'city': 'Sheffield'
        }
    }
}

tony = {
    'user_id': 103,
    'name': {
        'first': 'Tony',
        'last': 'King'
    },
    'bio': None
}

persons = [(john, 'John'), (eric, 'Eric'), (michael, 'Michael'), (tony, 'Tony')]

for person, name in persons:
    try:
        validate(person, template)
        print(f'{name}: ok')
    except SchemaError as e:
        print(f'{name}: {e.__repr__()}')

John: ok
Eric: SchemaKeyMismatch('mismatch keys: bio.birthplace.city')
Michael: SchemaTypeMismatch('bad type: bio.dob.month')
Tony: SchemaTypeMismatch('bad type: bio')


## Serializaiton and deserialization

### Introduction

Serialization and deserialization

* In computing, serialization is the process of translating a data structure or object state into a format that can be stored (e.g. files in secondary storage devices, data buffers in primary storage devices) or transmitted (e.g. data streams over computer networks) and reconstructed later (possibly in a different computer environment).
* When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward.
* Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.
* This process of serializing an object is also called marshalling an object in some situations.
* The opposite operation, extracting a data structure from a series of bytes, is deserialization (also called unserialization or unmarshalling).

Pickling and unpickling

* Pickling is the process whereby a Python object hierarchy is converted into a *byte stream*.
* Unpickling is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
* Pickling (and unpickling) is alternatively known as serialization, marshalling, or flattening.

### Pickling

* The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure.
* The data format used by pickle is Python-specific.
  * This has the advantage that there are no restrictions imposed by external standards such as JSON or XDR (which can’t represent pointer sharing);
  * however it means that non-Python programs may not be able to reconstruct pickled Python objects.
* By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.
* The `pickle` module keeps track of the objects it has already serialized, so that later references to the same object won't be serialized again.
  * This has implications both for *recursive objects* and *object sharing*.
  * Recursive objects are objects that contain references to themselves.
  * Object sharing happens when there are multiple references to the same object in different places in the object hierarchy being serialized.
  * `pickle` stores such objects only once, and ensures that all other references point to the master copy. Shared objects remain shared, which can be very important for mutable objects.
* It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.
  * Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
  * Consider signing data with `hmac` if you need to ensure that it has not been tampered with.
  * Safer serialization formats such as json may be more appropriate if you are processing untrusted data.

Usage

* `pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)`
  * Write the pickled representation of the object `obj` to the open file object `file`.
  * This is equivalent to `Pickler(file, protocol).dump(obj)`.
* `pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)`
  * Return the pickled representation of the object `obj` as a bytes object, instead of writing it to a file.
* `pickle.load(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=None)`
  * Read the pickled representation of an object from the open file object `file` and return the reconstituted object hierarchy specified therein.
  * This is equivalent to `Unpickler(file).load()`.
  * The protocol version of the pickle is detected automatically, so no protocol argument is needed. Bytes past the pickled representation of the object are ignored.
* `pickle.loads(data, /, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=None)`
  * Return the reconstituted object hierarchy of the pickled representation `data` of an object. data must be a bytes-like object.
  * The protocol version of the pickle is detected automatically, so no protocol argument is needed.
  * Bytes past the pickled representation of the object are ignored.

What can be pickled and unpickled?

* The following types can be pickled:
  * built-in constants (`None`, `True`, `False`, `Ellipsis`, and `NotImplemented`);
  * integers, floating-point numbers, complex numbers;
  * strings, bytes, bytearrays;
  * tuples, lists, sets, and dictionaries containing only picklable objects;
  * functions (built-in and user-defined) accessible from the top level of a module (using `def`, not `lambda`);
  * classes accessible from the top level of a module;
  * instances of such classes whose the result of calling `__getstate__()` is picklable.
* Attempts to pickle unpicklable objects will raise the `PicklingError` exception.
  * When this happens, an unspecified number of bytes may have already been written to the underlying file.
  * Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a `RecursionError` will be raised in this case.
  * You can carefully raise this limit with `sys.setrecursionlimit()`.
* Functions (built-in and user-defined) are pickled by fully qualified name, not by value.
  * This means that only the function name is pickled, along with the name of the containing module and classes.
  * Neither the function's code, nor any of its function attributes are pickled.
  * Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised.
* Similarly, classes are pickled by fully qualified name, so the same restrictions in the unpickling environment apply.
  * None of the class's code or data is pickled.
  * These restrictions are why picklable functions and classes must be defined at the top level of a module.
  * When class instances are pickled, their class's code and data are not pickled along with them.
    * Only the instance data are pickled.
    * This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class.
    * If you plan to have long-lived objects that will see many versions of a class, it may be worthwhile to put a version number in the objects so that suitable conversions can be made by the class's `__setstate__()` method.

Pickling class instances

* In most cases, no additional code is needed to make instances picklable.
* By default, pickle will retrieve the class and the attributes of an instance via introspection.
* When a class instance is unpickled, its `__init__()` method is usually not invoked. The default behavior first creates an uninitialized instance and then restores the saved attributes. The following code shows an implementation of this behavior:
    ```Python
    def save(obj):
        return (obj.__class__, obj.__dict__)

    def restore(cls, attributes):
        obj = cls.__new__(cls)
        obj.__dict__.update(attributes)
        return obj
    ```
* Classes can alter the default behavior by providing one or several special methods.
  * `object.__getnewargs_ex__()`
  * `object.__getnewargs__()`
  * `object.__getstate__()`
  * `object.__setstate__(state)`
  * `object.__reduce__()`
  * `object.__reduce_ex__(protocol)`

### JSON Serialization

* JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript).
  * text-based object serialization
  * open standard
  * human readable
* It's a very common format for web API's and general data interchange between systems.
* Unlike pickling, it is considered safe, but
  * a malicious JSON string may cause the decoder to consume considerable CPU and memory resources
  * limiting the size of data to be parsed is recommended
* JSON is a natural fit for serializing and deserializing Python dictionaries.
  * Python dictionaries are objects.
  * JSON is essentially a string. 

Supported data types

* strings: delimited by double quotes, unicode
* numbers: int, float
* booleans: true, false
* arrays (lists, tuples): delimited by square brackets
* dictionaries: key-value pairs, keys must be strings, values are any supported data type
* empty value: null

`json` module

* `json` exposes an API familiar to users of the standard library `marshal` and `pickle` modules.

In [14]:
import json
from pprint import pprint
# from fractions import Fraction
# from decimal import Decimal

d_json = '''
{
    "name": "John Cleese",
    "age": 82,
    "height": 1.8,
    "walkFunny": true,
    "sketches": [
        {
            "title": "Dead Parrot",
            "costars": ["Michael Palin"]
        },
        {
            "title": "Ministry of Silly Walks",
            "costars": ["Michael Palin", "Terry Jones"]
        }
    ],
    "boring": null
}
'''
d = json.loads(d_json)
pprint(d)
print(type(d['age']))

d = {'a': (1, 2, 3)}
d_json = json.dumps(d)
print(d_json)

print(json.dumps(1))
print(json.dumps(1.1))
# print(json.dumps(1+0j)) # Object of type complex is not JSON serializable
# print(json.dumps(Fraction(1, 2))) # Object of type Fraction is not JSON serializable
# print(json.dumps(Decimal('0.5'))) # Object of type Decimal is not JSON serializable

{'age': 82,
 'boring': None,
 'height': 1.8,
 'name': 'John Cleese',
 'sketches': [{'costars': ['Michael Palin'], 'title': 'Dead Parrot'},
              {'costars': ['Michael Palin', 'Terry Jones'],
               'title': 'Ministry of Silly Walks'}],
 'walkFunny': True}
<class 'int'>
{"a": [1, 2, 3]}
1
1.1


### Custom JSON encoding

Specifying a custom encoding function

* One of the arguments of `dump` / `dumps` function is `default`:
  * If specified, `default` should be a function that gets called for objects that can't otherwise be serialized.
  * It should return a JSON encodable version of the object or raise a `TypeError`. If not specified, `TypeError` is raised.
  * The function can include logic to differentiate between different types
    * or we can use a single dispatch generic function (using `functools.singledispatch` decorator)

In [23]:
import datetime
import json

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_at = datetime.datetime.now(datetime.UTC)

    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def to_json(self):
        # return {
        #     'name': self.name,
        #     'age': self.age,
        #     'create_at': self.create_at # no need to call isoformat() on the datetime object, because when dump it, the serializer function will be called recursively
        # }
        return vars(self)

def json_serializer(obj):
    if isinstance(obj, datetime.datetime):
        return obj.isoformat()
    elif isinstance(obj, set):
        return list(obj)
    elif isinstance(obj, Person):
        return obj.to_json()

current_dt = datetime.datetime.now(datetime.UTC)
print(current_dt.isoformat())
log_record = {'time': current_dt,
              'message': 'test',
              'set': {1, 2, 3},
              'person': Person('Tony', 18)}
print(log_record)
# json.dumps(log_record) # Object of type datetime is not JSON serializable

log_record_serialized = json.dumps(log_record, default=json_serializer, indent=4)
print(log_record_serialized)

2024-02-07T21:29:56.469564+00:00
{'time': datetime.datetime(2024, 2, 7, 21, 29, 56, 469564, tzinfo=datetime.timezone.utc), 'message': 'test', 'set': {1, 2, 3}, 'person': Person(name=Tony, age=18)}
{
    "time": "2024-02-07T21:29:56.469564+00:00",
    "message": "test",
    "set": [
        1,
        2,
        3
    ],
    "person": {
        "name": "Tony",
        "age": 18,
        "create_at": "2024-02-07T21:29:56.469564+00:00"
    }
}


In [6]:
from decimal import Decimal
import json
import datetime

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def to_json(self):
        return vars(self)
    
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'
    
def json_serializer(obj):
    if isinstance(obj, datetime.datetime):
        return obj.isoformat()
    else:
        try:
            return obj.to_json()
        except AttributeError:
            try:
                return vars(obj)
            except TypeError:
                return str(obj)
    
pt_1 = Point(0, 0)
pt_2 = Point(Decimal('1.5'), Decimal('-2.5'))
person = Person('Tony', 24)

log_record = dict(
    created_at=datetime.datetime.now(datetime.UTC),
    point_1=pt_1,
    point_2=pt_2,
    points={pt_1, pt_2},
    created_by=person
)

log_record_serialized = json.dumps(log_record, default=json_serializer, indent=4)
print(log_record_serialized)

{
    "created_at": "2024-02-08T07:55:16.221482+00:00",
    "point_1": {
        "x": 0,
        "y": 0
    },
    "point_2": {
        "x": "1.5",
        "y": "-2.5"
    },
    "points": "{Point(x=0, y=0), Point(x=1.5, y=-2.5)}",
    "created_by": {
        "name": "Tony",
        "age": 24
    }
}


In [8]:
from functools import singledispatch
import datetime
from decimal import Decimal

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'
    
    def to_json(self):
        return vars(self)
    
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'

@singledispatch
def json_serializer(obj):
    try:
        return obj.to_json()
    except AttributeError:
        try:
            return vars(obj)
        except TypeError:
            return str(obj)
        
@json_serializer.register(datetime.datetime)
def _(obj):
    return obj.isoformat()

pt_1 = Point(0, 0)
pt_2 = Point(Decimal('1.5'), Decimal('-2.5'))
person = Person('Tony', 24)

log_record = dict(
    created_at=datetime.datetime.now(datetime.UTC),
    point_1=pt_1,
    point_2=pt_2,
    points={pt_1, pt_2},
    created_by=person
)

log_record_serialized = json.dumps(log_record, default=json_serializer, indent=4)
print(log_record_serialized)

{
    "created_at": "2024-02-08T08:04:21.292290+00:00",
    "point_1": {
        "x": 0,
        "y": 0
    },
    "point_2": {
        "x": "1.5",
        "y": "-2.5"
    },
    "points": "{Point(x=0, y=0), Point(x=1.5, y=-2.5)}",
    "created_by": {
        "name": "Tony",
        "age": 24
    }
}


### Using `JSONEncoder`

* `class json.JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)`
  * Extensible JSON encoder for Python data structures.
  * Supports the following objects and types by default:

    Python|JSON
    ---|---
    dict|object
    list, tuple|array
    str|string
    int, float, int- & float-derived Enums|number
    True|true
    False|false
    None|null

  * To extend this to recognize other objects, subclass and implement a `default()` method with another method that returns a serializable object if possible, otherwise it should call the superclass implementation (to raise `TypeError`).

In [9]:
import json

default_json_encoder = json.JSONEncoder()
print(default_json_encoder.encode({'a': [1, 2], 'b': (3, 4), 'c': True, 'd': None}))

{"a": [1, 2], "b": [3, 4], "c": true, "d": null}


In [12]:
import json
import datetime

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime.datetime):
            return obj.isoformat()
        else:
            try:
                iter(obj)
            except TypeError:
                pass
            else:
                return list(obj)
            # return json.JSONEncoder.default(self, obj)
            return super().default(obj)
        
custom_encoder = CustomJSONEncoder()
print(custom_encoder.encode({1, 2, 3}))
print(custom_encoder.encode(datetime.datetime.now()))
print(custom_encoder.encode({'a': 1, 'b': False, 'c': (1, 2)}))
# print(custom_encoder.encode(1 + 0j))

dumped = json.dumps(dict(name='Tony', time=datetime.datetime.now(), set={1, 2}), cls=CustomJSONEncoder)
print(dumped)

[1, 2, 3]
"2024-02-08T20:10:13.450292"
{"a": 1, "b": false, "c": [1, 2]}
{"name": "Tony", "time": "2024-02-08T20:10:13.450292", "set": [1, 2]}


In [15]:
print(float('nan'), float('inf'), float('infinity'), float('-inf')) # allow_nan argument controls whether these should be serialized

nan inf inf -inf


In [22]:
import datetime
import json

class CustomJSONEncoder(json.JSONEncoder):
    def __init__(self, **kwargs):
        kwargs |= dict(skipkeys=True, allow_nan=False, separators=(',', ':'))
        super().__init__(**kwargs)

    def default(self, obj):
        if isinstance(obj, datetime.datetime):
            return dict(
                datatype='datetime',
                isoformat=obj.isoformat(),
                date=obj.date().isoformat(),
                time=obj.time().isoformat(),
                year=obj.year,
                month=obj.month,
                day=obj.day,
                hour=obj.hour,
                minute=obj.minute,
                second=obj.second
            )
        else:
            try:
                iter(obj)
            except TypeError:
                pass
            else:
                return list(obj)
            return super().default(obj)
        
d = {
    'created_by': 'Tony',
    'created_at': datetime.datetime.now(datetime.UTC),
    'set': {1, 2, 3},
    1.2: 'float',
    None: 'None',
    0j: 'complex'
}

print(json.dumps(d, cls=CustomJSONEncoder, indent=4))

{
    "created_by":"Tony",
    "created_at":{
        "datatype":"datetime",
        "isoformat":"2024-02-08T10:01:01.726969+00:00",
        "date":"2024-02-08",
        "time":"10:01:01.726969",
        "year":2024,
        "month":2,
        "day":8,
        "hour":10,
        "minute":1,
        "second":1
    },
    "set":[
        1,
        2,
        3
    ],
    "1.2":"float",
    "null":"None"
}


### Custom JSON decoding

* `json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)`
  * Deserialize `fp` (a `.read()`-supporting text file or binary file containing a JSON document) to a Python object using this conversion table.
  * `object_hook` is an optional function that will be called with the result of any object literal decoded (a `dict`). The return value of `object_hook` will be used instead of the `dict`. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).
  * `object_pairs_hook` is an optional function that will be called with the result of any object literal decoded with an *ordered list of pairs*. The return value of `object_pairs_hook` will be used instead of the `dict`. This feature can be used to implement custom decoders. If `object_hook` is also defined, the `object_pairs_hook` takes priority.
  * `parse_float`, if specified, will be called with the string of every JSON float to be decoded. By default, this is equivalent to `float(num_str)`. This can be used to use another datatype or parser for JSON floats (e.g. `decimal.Decimal`).
  * `parse_int`, if specified, will be called with the string of every JSON int to be decoded. By default, this is equivalent to `int(num_str)`. This can be used to use another datatype or parser for JSON integers (e.g. `float`).
  * `parse_constant`, if specified, will be called with one of the following strings: `'-Infinity'`, `'Infinity'`, `'NaN'`. This can be used to raise an exception if invalid JSON numbers are encountered.
  * To use a custom JSONDecoder subclass, specify it with the `cls` kwarg; otherwise `JSONDecoder` is used. Additional keyword arguments will be passed to the constructor of the class.
  * If the data being deserialized is not a valid JSON document, a `JSONDecodeError` will be raised.
* `json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)`
  * Deserialize s (a `str`, `bytes` or `bytearray` instance containing a JSON document) to a Python object using this conversion table.

Schemas

* Deserializing custom JSON types and objects is difficult.
* In general, we need to know the structure of the JSON data in order to custom deserialize.
  * This is referred to as the schema, a pre-defined agreement on how the JSON is going to be structured or serialized.
  * The schema might be for the entire JSON, or for sub-components only.

Overriding basic type serializations

* Notice that `object_hook` only allows us to customize deserialization of objects (dicts).
* We can override the way numbers are handled by using some extra keyword-arguments in `load` / `loads`:
  * `parse_float`
  * `parse_int`
  * `parse_constant`
* There's no overrides for strings.

In [34]:
import json
import datetime
from fractions import Fraction

serialized = '''
{
    "times": {
        "created": {
            "type": "datetime",
            "value": "2024-02-10T08:09:55"
        },
        "updated": {
            "type": "datetime",
            "value": "2024-02-11T09:10:00"
        }
    },
    "message": "time of creation and update",
    "myShare": {
        "type": "fraction",
        "numerator": 1,
        "denominator": 2
    }
}
'''
print(json.loads(serialized))

def custom_decoder(dict_obj):
    if dict_obj.get('type') == 'datetime':
        value = dict_obj.get('value')
        return datetime.datetime.fromisoformat(value)
    elif dict_obj.get('type') == 'fraction':
        return Fraction(dict_obj['numerator'], dict_obj['denominator'])
    else:
        return dict_obj
    
print(json.loads(serialized, object_hook=custom_decoder))

{'times': {'created': {'type': 'datetime', 'value': '2024-02-10T08:09:55'}, 'updated': {'type': 'datetime', 'value': '2024-02-11T09:10:00'}}, 'message': 'time of creation and update', 'myShare': {'type': 'fraction', 'numerator': 1, 'denominator': 2}}
{'times': {'created': datetime.datetime(2024, 2, 10, 8, 9, 55), 'updated': datetime.datetime(2024, 2, 11, 9, 10)}, 'message': 'time of creation and update', 'myShare': Fraction(1, 2)}


In [29]:
import json

def obj_hook(obj_dict):
    print('object hook', type(obj_dict), obj_dict)
    return obj_dict

def int_handler(obj):
    print('int handler', type(obj), obj)
    return obj

def float_handler(obj):
    print('float handler', type(obj), obj)
    return obj

def constant_handler(obj):
    print('Constant handler', type(obj), obj)
    return f'constant {obj}'

j = '''
{
    "a": [1, 2, 3, 4],
    "b": 100,
    "c": 10.5,
    "d": null,
    "e": NaN,
    "f": Infinity,
    "g": -Infinity
}
'''
print(json.loads(j))
print(json.loads(j, object_hook=obj_hook, parse_constant=constant_handler, parse_float=float_handler, parse_int=int_handler))

{'a': [1, 2, 3, 4], 'b': 100, 'c': 10.5, 'd': None, 'e': nan, 'f': inf, 'g': -inf}
int handler <class 'str'> 1
int handler <class 'str'> 2
int handler <class 'str'> 3
int handler <class 'str'> 4
int handler <class 'str'> 100
float handler <class 'str'> 10.5
Constant handler <class 'str'> NaN
Constant handler <class 'str'> Infinity
Constant handler <class 'str'> -Infinity
object hook <class 'dict'> {'a': ['1', '2', '3', '4'], 'b': '100', 'c': '10.5', 'd': None, 'e': 'constant NaN', 'f': 'constant Infinity', 'g': 'constant -Infinity'}
{'a': ['1', '2', '3', '4'], 'b': '100', 'c': '10.5', 'd': None, 'e': 'constant NaN', 'f': 'constant Infinity', 'g': 'constant -Infinity'}


### Using `JSONDecoder`

* `class json.JSONDecoder(*, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, strict=True, object_pairs_hook=None)`
  * Performs the following translations in decoding by default:

    JSON|Python
    ---|---
    object|dict
    array|list
    string|str
    number (int)|int
    number (real)|float
    true|True
    false|False
    null|None

  * It also understands `NaN`, `Infinity`, and `-Infinity` as their corresponding *float* values, which is outside the JSON spec.
  * If `strict` is false (`True` is the default), then control characters will be allowed inside strings. Control characters in this context are those with character codes in the 0–31 range, including `'\t'` (tab), `'\n'`, `'\r'` and `'\0'`.

In [34]:
import json

class CustomJSONDecoder(json.JSONDecoder):
    # def object_hook(self, obj_dict): # no effect
    #     print('object hook', type(obj_dict), obj_dict)
    #     return obj_dict

    def decode(self, obj):
        print('decode', type(obj), obj)
        return super().decode(obj)
    
j = '''
{
    "a": 100,
    "b": [1, 2, 3],
    "c": NaN,
    "d": {
        "e": Infinity,
        "f": -Infinity
    },
    "g": 10.5
}
'''
deserialized = json.loads(j, cls=CustomJSONDecoder)
print(deserialized)
print(deserialized['d']['e'], type(deserialized['d']['e']))

decode <class 'str'> 
{
    "a": 100,
    "b": [1, 2, 3],
    "c": NaN,
    "d": {
        "e": Infinity,
        "f": -Infinity
    },
    "g": 10.5
}

{'a': 100, 'b': [1, 2, 3], 'c': nan, 'd': {'e': inf, 'f': -inf}, 'g': 10.5}
inf <class 'float'>


In [49]:
import json
import re
from pprint import pprint

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'
    
    def to_json(self):
        return vars(self)


def serialize_point(obj):
    if isinstance(obj, Point):
        return dict(type='point', value=obj.to_json())
    else:
        return obj


class CustomJSONDecoder(json.JSONDecoder):
    def decode(self, string):
        obj = super().decode(string)
        pattern = r'"type"\s*:\s*"point"'
        if re.search(pattern, string):
            self.parse_point(obj)
        return obj
    
    # def parse_point(self, obj):
    #     if obj.get('type') == 'point':
    #         return Point(**obj['value'])
    #     else:
    #         for k, v in obj.items():
    #             if isinstance(v, list):
    #                 for i, e in enumerate(v):
    #                     if isinstance(e, dict):
    #                         v[i] = self.parse_point(e)
    #             elif isinstance(v, dict):
    #                 obj[k] = self.parse_point(v)
    #         return obj # if not point return the object unchanged

    def parse_point(self, obj):
        if isinstance(obj, dict):
            if obj.get('type') == 'point':
                obj = Point(**obj['value'])
            else:
                for k, v in obj.items():
                    obj[k] = self.parse_point(v)
        elif isinstance(obj, list):
            for i, e in enumerate(obj):
                obj[i] = self.parse_point(e)
        return obj
        

d = {'about': 'test',
     'int': 10, 'float': 10.5,
     'point': Point(10, 10),
     'point_list': [Point(0, 0), Point(1, -1)],
     'point_dict': {'p1': Point(2, 2), 'p2': Point(3, 3)},
     'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}]}

d_serialized = json.dumps(d, default=serialize_point)
pprint(d)
print(d_serialized)

d_deserialized = json.loads(d_serialized, cls=CustomJSONDecoder)
pprint(d_deserialized)  


{'about': 'test',
 'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}],
 'float': 10.5,
 'int': 10,
 'point': Point(x=10, y=10),
 'point_dict': {'p1': Point(x=2, y=2), 'p2': Point(x=3, y=3)},
 'point_list': [Point(x=0, y=0), Point(x=1, y=-1)]}
{"about": "test", "int": 10, "float": 10.5, "point": {"type": "point", "value": {"x": 10, "y": 10}}, "point_list": [{"type": "point", "value": {"x": 0, "y": 0}}, {"type": "point", "value": {"x": 1, "y": -1}}], "point_dict": {"p1": {"type": "point", "value": {"x": 2, "y": 2}}, "p2": {"type": "point", "value": {"x": 3, "y": 3}}}, "dicts": [{"a": 1, "b": 2}, {"c": 3, "d": 4}]}
{'about': 'test',
 'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}],
 'float': 10.5,
 'int': 10,
 'point': Point(x=10, y=10),
 'point_dict': {'p1': Point(x=2, y=2), 'p2': Point(x=3, y=3)},
 'point_list': [Point(x=0, y=0), Point(x=1, y=-1)]}


In [58]:
import json
import re
from decimal import Decimal
from pprint import pprint

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'
    
    def to_json(self):
        return vars(self)


def serialize_point(obj):
    if isinstance(obj, Point):
        return dict(type='point', value=obj.to_json())
    else:
        return obj


class CustomJSONDecoder(json.JSONDecoder):
    def __init__(self, **kwargs):
        kwargs |= dict(parse_float=Decimal, object_hook=self.parse_point)
        super().__init__(**kwargs)

    def parse_point(self, obj): # this is much simpler than overriding decode()
        if obj.get('type') == 'point':
            return Point(**obj['value'])
        return obj
        

d = {'about': 'test',
     'int': 10, 'float': 10.5,
     'point': Point(1.1, 2.2),
     'point_list': [Point(0, 0), Point(1, -1)],
     'point_dict': {'p1': Point(2, 2), 'p2': Point(3, 3)},
     'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}]}

d_serialized = json.dumps(d, default=serialize_point)
pprint(d)
print(d_serialized)

d_deserialized = json.loads(d_serialized, cls=CustomJSONDecoder)
pprint(d_deserialized)
print(f'{type(d_deserialized['point'].x)=}')

{'about': 'test',
 'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}],
 'float': 10.5,
 'int': 10,
 'point': Point(x=1.1, y=2.2),
 'point_dict': {'p1': Point(x=2, y=2), 'p2': Point(x=3, y=3)},
 'point_list': [Point(x=0, y=0), Point(x=1, y=-1)]}
{"about": "test", "int": 10, "float": 10.5, "point": {"type": "point", "value": {"x": 1.1, "y": 2.2}}, "point_list": [{"type": "point", "value": {"x": 0, "y": 0}}, {"type": "point", "value": {"x": 1, "y": -1}}], "point_dict": {"p1": {"type": "point", "value": {"x": 2, "y": 2}}, "p2": {"type": "point", "value": {"x": 3, "y": 3}}}, "dicts": [{"a": 1, "b": 2}, {"c": 3, "d": 4}]}
{'about': 'test',
 'dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}],
 'float': Decimal('10.5'),
 'int': 10,
 'point': Point(x=1.1, y=2.2),
 'point_dict': {'p1': Point(x=2, y=2), 'p2': Point(x=3, y=3)},
 'point_list': [Point(x=0, y=0), Point(x=1, y=-1)]}
type(d_deserialized['point'].x)=<class 'decimal.Decimal'>


### JSON schema

* [JSON Schema](https://json-schema.org/) is a vocabulary that you can use to annotate and validate JSON documents.
* After you create the JSON Schema document, you can validate the example data against your schema using a validator in a language of your choice.
* `pip install jsonschema` for validation
  * https://json-schema.org/implementations#validators-python
  * https://github.com/python-jsonschema/jsonschema

In [None]:
schema = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/product.schema.json",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object",
    "properties": {
        "productId": {
            "description": "The unique identifier for a product",
            "type": "integer"
        },
        "productName": {
            "description": "Name of the product",
            "type": "string"
        },
        "price": {
            "description": "The price of the product",
            "type": "number",
            "exclusiveMinimum": 0
        },
        "tags": {
            "description": "Tags for the product",
            "type": "array",
            "items": {
                "type": "string"
            },
            "minItems": 1,
            "uniqueItems": true # should it be True?
        },
        "dimensions": {
            "type": "object",
            "properties": {
                "length": {
                    "type": "number"
                },
                "width": {
                    "type": "number"
                },
                "height": {
                    "type": "number"
                }
            },
            "required": ["length", "width", "height"]
        },
        "warehouseLocation": {
            "description": "Coordinates of the warehouse where the product is located.",
            "$ref": "https://example.com/geographical-location.schema.json"
        }
    },
    "required": ["productId", "productName", "price"]
}

### Data validation libraries

* [Pydantic](https://github.com/pydantic/pydantic)
  * `pip install pydantic`
  * Data validation using Python type hints.
  * Fast and extensible.
  * Define how data should be in pure, canonical Python 3.8+; validate it with Pydantic.
* [YAML](https://yaml.org/)
  * `pip install pyyaml` or `pip install ruamel.yaml`
  * A human-friendly data serialization language for all programming languages.
  * It is commonly used for configuration files and in applications where data is being stored or transmitted.
  * YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML).
  * It uses Python-style indentation to indicate nesting and does not require quotes around most string values (it also supports JSON style `[...]` for lists and `{...}` mixed in the same file).
  * Custom data types are allowed, but YAML natively encodes scalars (such as strings, integers, and floats), lists, and associative arrays (also known as maps, dictionaries or hashes).
* [serpy](https://serpy.readthedocs.io/en/latest/)
  * `pip install serpy`
  * A super simple object serialization framework built for speed.
  * serpy serializes complex datatypes (Django Models, custom classes, ...) to simple native types (dicts, lists, strings, ...). The native types can then be easily converted to JSON or any other format needed.

## Coding Exercises

Consider the following classes:

```Python
class Stock:
    def __init__(self, symbol, date, open_, high, low, close, volume):
        self.symbol = symbol
        self.date = date
        self.open = open_
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume
        
class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume
```

### Exercise 1

Given the above class, write a custom `JSONEncoder` class to **serialize** dictionaries that contain instances of these particular classes. Keep in mind that you will want to deserialize the data too, so you will need some technique to indicate the object type in your serialization.

For example you may have an object such as this one that needs to be serialized:

```Python
activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}
```

### Exercise 2

Write code to reverse the serialization you just created. Write a custom decoder that can deserialize a JSON structure containing `Stock` and `Trade` objects.

In [16]:
from datetime import datetime, date
from decimal import Decimal
import json


class Stock:
    def __init__(self, symbol, date, open, high, low, close, volume):
        self.symbol = symbol
        self.date = date
        self.open = open
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume

    def __repr__(self):
        return f'Stock(symbol={self.symbol}, date={self.date}, open={self.open}, high={self.high}, low={self.low}, close={self.close}, volume={self.volume})'
    
    def to_json(self):
        return dict(type='stock', value=vars(self))


class Trade:
    def __init__(self, symbol, timestamp, order, price, volume, commission):
        self.symbol = symbol
        self.timestamp = timestamp
        self.order = order
        self.price = price
        self.commission = commission
        self.volume = volume

    def __repr__(self):
        return f'Trade(symbol={self.symbol}, timestamp={self.timestamp}, order={self.order}, price={self.price}, volume={self.volume}, commission={self.commission})'
    
    def to_json(self):
        return dict(type='trade', value=vars(self))


class CustomJSONEncoder(json.JSONEncoder):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def default(self, obj):
        if isinstance(obj, Decimal):
            # return dict(type='decimal', value=str(obj))
            return str(obj) # just return the string of a Decimal is simpler
        elif isinstance(obj, datetime): # datetime is subclass of date
            return dict(type='datetime', value=obj.isoformat())
        elif isinstance(obj, date):
            return dict(type='date', value=obj.isoformat())
        else:
            try:
                return obj.to_json()
            except AttributeError:
                return super().default(obj)
            

class CustomJSONDecoder(json.JSONDecoder):
    def __init__(self, **kwargs):
        # kwargs |= dict(object_hook=self.object_hook)
        kwargs |= dict(object_hook=self.object_hook, parse_float=Decimal)
        super().__init__(**kwargs)

    def object_hook(self, obj):
        obj_type = obj.get('type')
        match obj_type:
            case 'date':
                return self.parse_date(obj)
            case 'datetime':
                return self.parse_datetime(obj)
            case 'stock':
                return self.parse_stock(obj)
            case 'trade':
                return self.parse_trade(obj)
            # case 'decimal': # handle Decimal with parse_float instead
            #     return self.parse_decimal(obj)
            case _:
                return obj
            
    def parse_date(self, obj):
        return date.fromisoformat(obj['value'])
    
    def parse_datetime(self, obj):
        return datetime.fromisoformat(obj['value'])
    
    def parse_stock(self, obj):
        return Stock(**obj['value'])
    
    def parse_trade(self, obj):
        return Trade(**obj['value'])
    
    # def parse_decimal(self, obj):
    #     return Decimal(obj['value'])


activity = {
    "quotes": [
        Stock('TSLA', date(2018, 11, 22), 
              Decimal('338.19'), Decimal('338.64'), Decimal('337.60'), Decimal('338.19'), 365_607),
        Stock('AAPL', date(2018, 11, 22), 
              Decimal('176.66'), Decimal('177.25'), Decimal('176.64'), Decimal('176.78'), 3_699_184),
        Stock('MSFT', date(2018, 11, 22), 
              Decimal('103.25'), Decimal('103.48'), Decimal('103.07'), Decimal('103.11'), 4_493_689)
    ],
    "trades": [
        Trade('TSLA', datetime(2018, 11, 22, 10, 5, 12), 'buy', Decimal('338.25'), 100, Decimal('9.99')),
        Trade('AAPL', datetime(2018, 11, 22, 10, 30, 5), 'sell', Decimal('177.01'), 20, Decimal('9.99'))
    ]
}

serialized = json.dumps(activity, cls=CustomJSONEncoder, indent=2)
print(serialized)

deserialized = json.loads(serialized, cls=CustomJSONDecoder)
print(deserialized)

{
  "quotes": [
    {
      "type": "stock",
      "value": {
        "symbol": "TSLA",
        "date": {
          "type": "date",
          "value": "2018-11-22"
        },
        "open": "338.19",
        "high": "338.64",
        "low": "337.60",
        "close": "338.19",
        "volume": 365607
      }
    },
    {
      "type": "stock",
      "value": {
        "symbol": "AAPL",
        "date": {
          "type": "date",
          "value": "2018-11-22"
        },
        "open": "176.66",
        "high": "177.25",
        "low": "176.64",
        "close": "176.78",
        "volume": 3699184
      }
    },
    {
      "type": "stock",
      "value": {
        "symbol": "MSFT",
        "date": {
          "type": "date",
          "value": "2018-11-22"
        },
        "open": "103.25",
        "high": "103.48",
        "low": "103.07",
        "close": "103.11",
        "volume": 4493689
      }
    }
  ],
  "trades": [
    {
      "type": "trade",
      "value": {
        "

## Specialized dictionaries

### Introduction

Container datatype|Explanation
---|---
`ChainMap`|dict-like class for creating a single view of multiple mappings
`Counter`|dict subclass for counting hashable objects
`OrderedDict`|dict subclass that remembers the order entries were added
`defaultdict`|dict subclass that calls a factory function to supply missing values
`UserDict`|wrapper around dictionary objects for easier dict subclassing

### `defaultdict`

* `class collections.defaultdict(default_factory=None, /[, ...])`
  * Return a new dictionary-like object.
  * `defaultdict` is a subclass of the built-in `dict` class. It overrides one method and adds one writable instance variable. The remaining functionality is the same as for the dict class and is not documented here.
  * The first argument provides the initial value for the `default_factory` attribute; it defaults to `None`. All remaining arguments are treated the same as if they were passed to the `dict` constructor, including keyword arguments.
  * `defaultdict` objects support the following method in addition to the standard dict operations:
    * `__missing__(key)`
      * If the `default_factory` attribute is `None`, this raises a `KeyError` exception with the `key` as argument.
      * If `default_factory` is not `None`, it is called *without arguments* to provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.
      * If calling `default_factory` raises an exception this exception is propagated unchanged.
      * This method is called by the `__getitem__()` method of the `dict` class when the requested key is not found; whatever it returns or raises is then returned or raised by `__getitem__()`.
      * Note that `__missing__()` is not called for any operations besides `__getitem__()`. This means that `get()` will, like normal dictionaries, return `None` as a default rather than using `default_factory`.
  * `defaultdict` objects support the following instance variable:
    * `default_factory`
        * This attribute is used by the `__missing__()` method; it is initialized from the first argument to the constructor, if present, or to `None`, if absent.
        * If it is not `None`, it must be a callable that takes 0 argument and returns the desired default value.

In [22]:
from collections import defaultdict

d = {}
# d['a'] # KeyError
d.get('a') # return None, but the key 'a' is not inserted to d
d.setdefault('a', 0) # return 0, and the key 'a' is inserted to d, this actually behaves similarly to defaultdict
print(d['a'])

# default_dict_int = defaultdict(lambda: 0) # same as below
default_dict_int = defaultdict(int)
print(default_dict_int['a'])

0
0


In [27]:
from collections import defaultdict

persons = {
    'john': {'age': 20, 'eye_color': 'blue'},
    'jack': {'age': 25, 'eye_color': 'brown'},
    'jill': {'age': 22, 'eye_color': 'blue'},
    'eric': {'age': 35},
    'michael': {'age': 27}
}

# method 1:
eye_colors = {}

for person, details in persons.items():
    eye_color = details.get('eye_color', 'unknown')
    eye_colors.setdefault(eye_color, [])
    eye_colors[eye_color].append(person)

print(eye_colors)

# method 2:
eye_colors = defaultdict(list)

for person, details in persons.items():
    details = defaultdict(lambda: 'unknown', details)
    eye_color = details['eye_color']
    eye_colors[eye_color].append(person)

print(eye_colors)

{'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']}
defaultdict(<class 'list'>, {'blue': ['john', 'jill'], 'brown': ['jack'], 'unknown': ['eric', 'michael']})


In [30]:
from collections import defaultdict, namedtuple
import datetime
from functools import wraps

def function_stats():
    data = defaultdict(lambda: {'count': 0, 'first_called_at': datetime.datetime.now(datetime.UTC)})
    Stats = namedtuple('Stats', 'decorator data') # this is actually not necessary

    def decorator(fn):
        @wraps(fn)
        def inner(*args, **kwargs):
            data[fn.__name__]['count'] += 1
            return fn(*args, **kwargs)
        return inner
    
    return Stats(decorator, data)

stats = function_stats()

@stats.decorator
def my_func_1():
    print('my_func_1')

@stats.decorator
def my_func_2():
    print('my_func_2')

my_func_1()
my_func_1()
my_func_2()

stats.data

my_func_1
my_func_1
my_func_2


defaultdict(<function __main__.function_stats.<locals>.<lambda>()>,
            {'my_func_1': {'count': 2,
              'first_called_at': datetime.datetime(2024, 2, 10, 3, 24, 29, 748864, tzinfo=datetime.timezone.utc)},
             'my_func_2': {'count': 1,
              'first_called_at': datetime.datetime(2024, 2, 10, 3, 24, 29, 749864, tzinfo=datetime.timezone.utc)}})

### `OrderedDict`

* Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations. They have become less important now that the built-in `dict` class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7).
* Some differences from dict still remain:
  * The regular `dict` was designed to be very good at mapping operations. Tracking insertion order was secondary.
  * The `OrderedDict` was designed to be good at reordering operations. Space efficiency, iteration speed, and the performance of update operations were secondary.
  * The `OrderedDict` algorithm can handle frequent reordering operations better than `dict`. This makes it suitable for implementing various kinds of LRU caches.
  * The equality operation for `OrderedDict` checks for matching order.
    * A regular `dict` can emulate the order sensitive equality test with `p == q and all(k1 == k2 for k1, k2 in zip(p, q))`.
  * The `popitem()` method of OrderedDict has a different signature. It accepts an optional argument to specify which item is popped.
    * A regular `dict` can emulate `OrderedDict`’s `od.popitem(last=True)` with `d.popitem()` which is guaranteed to pop the rightmost (last) item.
    * A regular `dict` can emulate `OrderedDict`’s `od.popitem(last=False)` with `(k := next(iter(d)), d.pop(k))` which will return and remove the leftmost (first) item if it exists.
  * `OrderedDict` has a `move_to_end()` method to efficiently reposition an element to an endpoint.
    * A regular `dict` can emulate `OrderedDict`’s `od.move_to_end(k, last=True)` with `d[k] = d.pop(k)` which will move the key and its associated value to the rightmost (last) position.
    * A regular dict does not have an *efficient* equivalent for `OrderedDict`’s `od.move_to_end(k, last=False)` which moves the key and its associated value to the leftmost (first) position.
  * Until Python 3.8, `dict` lacked a `__reversed__()` method.
* `class collections.OrderedDict([items])`
  * Return an instance of a `dict` subclass that has methods specialized for rearranging dictionary order.
  * `popitem(last=True)`
  * `move_to_end(key, last=True)`

In [2]:
from collections import OrderedDict

d = {'a': 1, 'b': 2, 'c': 3}
d_ordered = OrderedDict(d)
print(d_ordered)
print(dict(reversed(d.items())))
print(OrderedDict(reversed(d_ordered.items())))

OrderedDict({'a': 1, 'b': 2, 'c': 3})
{'c': 3, 'b': 2, 'a': 1}
OrderedDict({'c': 3, 'b': 2, 'a': 1})


### `Counter`

* A counter tool is provided to support convenient and rapid tallies.
* `class collections.Counter([iterable-or-mapping])`
  * A `Counter` is a `dict` subclass for counting *hashable* objects.
  * It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.
  * Elements are counted from an *iterable* or initialized from another *mapping* (or counter).
  * `Counter` objects have a dictionary interface except that they return a zero count for missing items instead of raising a `KeyError`.
  * Setting a count to zero does not remove an element from a counter. Use `del` to remove it entirely.
* Counter objects support additional methods beyond those available for all dictionaries:
  * `elements()`
    * Return an *iterator* over elements repeating each as many times as its count. Elements are returned in the order first encountered.
    * If an element's count is less than one, `elements()` will ignore it.
  * `most_common([n])`
    * Return a list of the `n` most common elements and their counts from the most common to the least. If `n` is omitted or `None`, `most_common()` returns all elements in the counter.
    * Elements with equal counts are ordered in the order first encountered.
  * `subtract([iterable-or-mapping])`
    * Elements are subtracted from an iterable or from another mapping (or counter). Like `dict.update()` but subtracts counts instead of replacing them. Both inputs and outputs may be zero or negative.
  * `total()`
    * Compute the sum of the counts.
* The usual dictionary methods are available for Counter objects except for two which work differently for counters.
  * `fromkeys(iterable)`
    * This class method is not implemented for `Counter` objects.
  * `update([iterable-or-mapping])`
    * Elements are counted from an iterable or added-in from another mapping (or counter). Like `dict.update()` but adds counts instead of replacing them. Also, the iterable is expected to be a sequence of elements (keys), not a sequence of `(key, value)` pairs.
* Counters support rich comparison operators for equality, subset, and superset relationships: `==`, `!=`, `<`, `<=`, `>`, `>=`. All of those tests treat missing elements as having zero counts so that `Counter(a=1) == Counter(a=1, b=0)` returns true.
* Several mathematical operations are provided for combining Counter objects to produce multisets (counters that have counts greater than zero).
  * Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements.
  * Intersection and union return the minimum and maximum of corresponding counts.
  * Equality and inclusion compare corresponding counts.
  * Each operation can accept inputs with signed counts, but the output will exclude results with counts of zero or less.
  * Unary addition and subtraction are shortcuts for adding an empty counter or subtracting from an empty counter.

In [4]:
from collections import Counter

print(Counter())
print(Counter('hello, python'))
print(Counter({'red': 4, 'blue': 12, 'green': 5}))
print(Counter(cats=4, dogs=8))

c = Counter(a=2, b=-2, c=3, d=0)
print(list(c.elements()))
d = Counter(a=2, b=3, c=5)
print(c - d)
print(c + d)
print(+c)
print(-c)
print(c & d)
print(c | d)
print(c == d)
print(c <= d)
print(c.most_common()[::-1])
print(c.total())

Counter()
Counter({'h': 2, 'l': 2, 'o': 2, 'e': 1, ',': 1, ' ': 1, 'p': 1, 'y': 1, 't': 1, 'n': 1})
Counter({'blue': 12, 'green': 5, 'red': 4})
Counter({'dogs': 8, 'cats': 4})
['a', 'a', 'c', 'c', 'c']
Counter()
Counter({'c': 8, 'a': 4, 'b': 1})
Counter({'c': 3, 'a': 2})
Counter({'b': 2})
Counter({'c': 3, 'a': 2})
Counter({'c': 5, 'b': 3, 'a': 2})
False
True
[('b', -2), ('d', 0), ('a', 2), ('c', 3)]
3


In [12]:
class CounterDict:
    def __init__(self, iterable=None, /, **kwargs):
        try:
            dict1 = dict(iterable)
        except (TypeError, ValueError):
            dict1 = {}
            for item in iterable:
                dict1[item] = dict1.get(item, 0) + 1
        dict2 = dict(**kwargs)
        dict_combined = dict1 | dict2
        for k in dict1.keys() & dict2.keys():
            dict_combined[k] += dict1[k]
        self.dict = dict_combined

    def __repr__(self):
        return f'CounterDict({self.dict})'
    
    def __getitem__(self, key):
        return self.dict.get(key, 0) # return 0 if key does not exist, but the key will not be inserted into the counter
    
    def __setitem__(self, key, value):
        self.dict[key] = value
    
    def elements(self):
        for k, v in self.dict.items():
            for _ in range(v):
                yield k

    def total(self):
        return sum(self.dict.values())

# d = dict(['a1', 'b2', 'c3']) # this works, as each item of the list, e.g. 'a1', is iterable
# print(d) # {'a': '1', 'b': '2', 'c': '3'}

counter = CounterDict({'a': 1, 'b': 2})
counter = CounterDict({'a': 1, 'b': 2}, a=2)
counter = CounterDict(['a', 'b', 'c'], a=2)
print(counter)
print(counter['d'])
print(counter)
counter['d'] = 2
print(counter)
print(counter.total())
print(list(counter.elements()))

CounterDict({'a': 3, 'b': 1, 'c': 1})
0
CounterDict({'a': 3, 'b': 1, 'c': 1})
CounterDict({'a': 3, 'b': 1, 'c': 1, 'd': 2})
7
['a', 'a', 'a', 'b', 'c', 'd', 'd']


In [9]:
import random
from collections import Counter

random.seed(0)

widgets = ['battery', 'charger', 'cable', 'case', 'keyboard', 'mouse']
sales = [(random.choice(widgets), random.randint(1, 10)) for _ in range(100)]
refunds = [(random.choice(widgets), random.randint(1, 5)) for _ in range(20)]

count = 0
for item in sales:
    if item[0] == 'keyboard':
        count += item[1]
print('keyboard', count)

sale_counter = Counter(k for k, v in sales for _ in range(v))
refund_counter = Counter(k for k, v in refunds for _ in range(v))
print(sale_counter)
print(refund_counter)
net_sale = sale_counter - refund_counter
print(net_sale)

keyboard 120
Counter({'keyboard': 120, 'battery': 113, 'mouse': 85, 'case': 78, 'cable': 71, 'charger': 62})
Counter({'keyboard': 19, 'battery': 11, 'charger': 10, 'case': 9, 'mouse': 8, 'cable': 4})
Counter({'battery': 102, 'keyboard': 101, 'mouse': 77, 'case': 69, 'cable': 67, 'charger': 52})


### `ChainMap`

* A `ChainMap` class is provided for quickly linking a number of mappings so they can be treated as a single unit.
* It is often much faster than creating a new dictionary and running multiple `update()` calls.
* The class can be used to simulate nested scopes and is useful in templating.
* `class collections.ChainMap(*maps)`
  * A `ChainMap` groups multiple dicts or other mappings together to create a single, *updatable view*. If no maps are specified, a single empty dictionary is provided so that a new chain always has at least one mapping.
  * The underlying mappings are stored in a *list*. That list is public and can be accessed or updated using the maps attribute. There is no other state.
  * Lookups search the underlying mappings successively until a key is found. In contrast, writes, updates, and deletions only operate on the *first* mapping.
  * A `ChainMap` incorporates the underlying mappings *by reference*. So, if one of the underlying mappings gets updated, those changes will be reflected in ChainMap.
* All of the usual dictionary methods are supported. In addition, there is a `maps` attribute, a method for creating new subcontexts, and a property for accessing all but the first mapping:
  * `maps`
  * `new_child(m=None, **kwargs)`
  * `parents`

Parent-child relationship

* `ChainMap(d1, d2, d3)`
  * `d1` is called a child, `d2` and `d3` are parents.
  * `d.parents`: returns a new `ChainMap` containing the parent elements only.
  * `d.new_child(d4)`: adds `d4` to the front of the chain, same as `ChainMap(d4, d1, d2, d3)`.

In [9]:
from collections import ChainMap

d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}
d3 = {'e': 5, 'f': 6}
d4 = {'g': 7, 'h': 8}

chained = ChainMap(d1, d2, d3)
print(chained)
print(isinstance(chained, dict)) # ChainMap is not subclass of dict
print(chained.maps) # returns a list of underlying dicts
print(chained.parents) # same as ChainMap(*chained.maps[1:])
print(chained.new_child(d4)) # same as ChainMap(*chained.maps[:].insert(0, d4)), and chained = ChainMap(d4, chained) is a different approach that has similar effects
print(ChainMap(d4, chained))
chained_chainmap_child = ChainMap(chained, d4)
print(chained_chainmap_child)
chained_chainmap_child['c'] = 10 # only modifies the child of the child
chained_chainmap_child['g'] = 70
print(chained_chainmap_child)

chained.maps.append(d4) # this will change the ChainMap, actually all operations on ChainMap can be done by changing the underlying dicts in maps
print(chained)

chained['c'] = 30
print(chained) # the child is changed
print(d1) # the underlying dict of the child is also changed
# del chained['d'] # KeyError, as 'd' does not exist in the child although it does exist in the ChainMap

ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})
False
[{'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}]
ChainMap({'c': 3, 'd': 4}, {'e': 5, 'f': 6})
ChainMap({'g': 7, 'h': 8}, {'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})
ChainMap({'g': 7, 'h': 8}, ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}))
ChainMap(ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}), {'g': 7, 'h': 8})
ChainMap(ChainMap({'a': 1, 'b': 2, 'c': 10, 'g': 70}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}), {'g': 7, 'h': 8})
ChainMap({'a': 1, 'b': 2, 'c': 10, 'g': 70}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}, {'g': 7, 'h': 8})
ChainMap({'a': 1, 'b': 2, 'c': 30, 'g': 70}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6}, {'g': 7, 'h': 8})
{'a': 1, 'b': 2, 'c': 30, 'g': 70}


In [10]:
from collections import ChainMap

config = {
    'host': 'prod.deepdive.com',
    'port': 2428,
    'database': 'deepdive',
    'user_id': '$pg_user',
    'user_pwd': '$pg_pwd',
}

local_config = ChainMap({}, config) # create a local config mapping by using ChainMap without copying data from config
print(local_config['host'])
local_config['use_pwd'] = 'my_pwd' # only modifies the ChainMap object local_config, config intact
print(local_config)
print(config)

prod.deepdive.com
ChainMap({'use_pwd': 'my_pwd'}, {'host': 'prod.deepdive.com', 'port': 2428, 'database': 'deepdive', 'user_id': '$pg_user', 'user_pwd': '$pg_pwd'})
{'host': 'prod.deepdive.com', 'port': 2428, 'database': 'deepdive', 'user_id': '$pg_user', 'user_pwd': '$pg_pwd'}


### ```UserDict```

* The class, `UserDict` acts as a wrapper around dictionary objects. The need for this class has been partially supplanted by the ability to subclass directly from `dict`; however, this class can be easier to work with because the underlying dictionary is accessible as an attribute.
* It is not a subclass of `dict` but a mapping type.
* It is actually a head-start on recreating a Python dictionary from scratch that offers different subclassing possibilities.
* `class collections.UserDict([initialdata])`
  * Class that simulates a dictionary.
  * The instance's contents are kept in a regular dictionary, which is accessible via the `data` attribute of `UserDict` instances.
  * If `initialdata` is provided, `data` is initialized with its contents.
  * A reference to `initialdata` will not be kept, allowing it to be used for other purposes.
* In addition to supporting the methods and operations of mappings, `UserDict` instances provide the following attribute:
  * `data`
    * A real dictionary used to store the contents of the `UserDict` class.

In [25]:
from itertools import chain

class IntDict(dict):
    def __init__(self, iterable=None, /, **kwargs):
        super().__init__(IntDict.mapping_conversion(iterable, kwargs))

    def __setitem__(self, key, value):
        # super().__setitem__(key, IntDict.value_conversion(value))
        super().__setitem__(key, int(value))
        
    @classmethod
    def mapping_conversion(cls, iterable, kwargs):
        if isinstance(iterable, dict):
            iterable = iterable.items()
        elif iterable is None:
            iterable = []
        chained = chain(iterable, kwargs.items())
        converted = {}
        for k, v in chained:
            # converted[k] = IntDict.value_conversion(v)
            converted[k] = int(v)
        return converted
    
    # @classmethod
    # def value_conversion(cls, value):
    #     try:
    #         return int(value)
    #     except (TypeError, ValueError):
    #         raise ValueError('the value of each item must be a legal argument of int()')

int_d = IntDict()
print(int_d)
int_d['a'] = 1
print(int_d)
int_d['b'] = '2'
print(int_d)
# int_d['a'] = '1a' # ValueError, invalid value for int()

# int_d = IntDict({'a': '1', 'b': 1.2}, c='3.5') # '3.5' invalid value for int()
int_d = IntDict({'a': '1', 'b': 1.2}, c=3)
print(int_d)
print(list(int_d.items()))
print(int_d.get('d', 0))

{}
{'a': 1}
{'a': 1, 'b': 2}
{'a': 1, 'b': 1, 'c': 3}
[('a', 1), ('b', 1), ('c', 3)]
0


In [30]:
from collections import UserDict

class IntDict(UserDict):
    def __setitem__(self, key, value):
        int(value) # test whether value is a legal argument of int()
        super().__setitem__(key, value) # the original value remains unchanged

    def __getitem__(self, key):
        return int(super().__getitem__(key))
    

int_d = IntDict({'a': 1, 'b': 1.2}, c='1')
# int_d = IntDict({'a': 1, 'b': '1.2'}, c='1') # although __init__ is not implemented, it still successfully find the invalid value '1.2'
print(int_d)
print(list(int_d.items()))
print(dict(int_d))
print(int_d.data)
# int_d['d'] = '1.5' # invalid value

{'a': 1, 'b': 1.2, 'c': '1'}
[('a', 1), ('b', 1), ('c', 1)]
{'a': 1, 'b': 1, 'c': 1}
{'a': 1, 'b': 1.2, 'c': '1'}


In [38]:
from collections import UserDict
from numbers import Real

class LimitedDict(UserDict):
    def __init__(self, keyset, value_range, *args, **kwargs):
        self.keyset = keyset
        self.value_range = value_range
        super().__init__(*args, **kwargs)

    def __setitem__(self, key, value):
        if key not in self.keyset:
            raise KeyError(f'{key!r} is not in keyset {self.keyset}')
        if not isinstance(value, Real):
            raise TypeError(f'{value!r} is not a real number')
        if value < self.value_range[0] or value > self.value_range[1]:
            raise ValueError(f'{value} is not in the range of {self.value_range}')
        super().__setitem__(key, value)


class ColorDict(LimitedDict):
    def __init__(self, *args, default=0, **kwargs):
        self.keyset = {'red', 'green', 'blue'}
        self.value_range = (0, 255)
        super().__init__(self.keyset, self.value_range, *args, **kwargs)
        difference = self.keyset - self.keys()
        for k in difference: # set missing color value to default
            self[k] = default



color_d = ColorDict(red=10, green=10)
print(color_d)
# color_d = ColorDict(red=10, green=10, purple=20) # invalid key 'purple'
# color_d = ColorDict(red=10, green=10, blue=2000) # invalid value 2000
print(color_d['red'])
# color_d['blue'] = 300 # value out of range
print(list(color_d.items()))

{'red': 10, 'green': 10, 'blue': 0}
10
[('red', 10), ('green', 10), ('blue', 0)]


## Coding Exercises

### Exercise 1

* You have text data spread across multiple servers. Each server is able to analyze this data and return a dictionary that contains words and their frequency.
* Your job is to combine this data to create a single dictionary that contains all the words and their combined frequencies from all these data sources.
* Bonus points if you can make your dictionary sorted by frequency (highest to lowest).
* For example, you may have three servers that each return these dictionaries:
    ```Python
    d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
    d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
    d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}
    ```
  * Your resulting dictionary should look like this:
    ```Python
    d = {'python': 17,
         'javascript': 15,
         'java': 13,
         'c#': 12,
         'c++': 10,
         'go': 9,
         'erlang': 5,
         'haskell': 2,
         'pascal': 1}
    ```
* Implement two different solutions to this problem:
  * Using `defaultdict` objects
  * Using `Counter` objects

In [5]:
from collections import defaultdict, Counter

def combine_dicts_defaultdict(dicts):
    combined = defaultdict(int)
    for d in dicts:
        for k, v in d.items():
            combined[k] += v
    return combined

def combine_dicts_counter(dicts):
    # return Counter(k for d in dicts for k, v in d.items() for _ in range(v))
    combined = Counter()
    for d in dicts:
        combined.update(d)
    return Counter(dict(combined.most_common()))


d1 = {'python': 10, 'java': 3, 'c#': 8, 'javascript': 15}
d2 = {'java': 10, 'c++': 10, 'c#': 4, 'go': 9, 'python': 6}
d3 = {'erlang': 5, 'haskell': 2, 'python': 1, 'pascal': 1}

dicts = d1, d2, d3
print(combine_dicts_defaultdict(dicts))
print(combine_dicts_counter(dicts))

defaultdict(<class 'int'>, {'python': 17, 'java': 13, 'c#': 12, 'javascript': 15, 'c++': 10, 'go': 9, 'erlang': 5, 'haskell': 2, 'pascal': 1})
Counter({'python': 17, 'javascript': 15, 'java': 13, 'c#': 12, 'c++': 10, 'go': 9, 'erlang': 5, 'haskell': 2, 'pascal': 1})


### Exercise 2

* Suppose you have a list of all possible eye colors:
    ```Python
    eye_colors = ("amber", "blue", "brown", "gray", "green", "hazel", "red", "violet")
    ```
* Some other collection (say recovered from a database, or an external API) contains a list of `Person` objects that have `eye_color` property.
* Your goal is to create a dictionary that contains the number of people that have the eye color as specified in `eye_colors`. The wrinkle here is that even if no one matches some eye color, say `amber`, your dictionary should still contain an entry `"amber": 0`.

In [11]:
from random import seed, choices
from collections import Counter, defaultdict

class Person:
    def __init__(self, eye_color):
        self.eye_color = eye_color

def eye_color_counter(persons, eye_colors):
    # counter = Counter(dict.fromkeys(eye_colors, 0))
    counter = Counter({eye_color: 0 for eye_color in eye_colors})
    counter.update(p.eye_color for p in persons)
    return counter

def eye_color_defaultdict(persons, eye_colors):
    counter = defaultdict(int, ((eye_color, 0) for eye_color in eye_colors))
    for p in persons:
        counter[p.eye_color] += 1
    return counter

seed(0)
eye_colors = ("amber", "blue", "brown", "gray", "green", "hazel", "red", "violet")
persons = [Person(color) for color in choices(eye_colors[2:], k = 50)]
counter_c = eye_color_counter(persons, eye_colors)
counter_d = eye_color_defaultdict(persons, eye_colors)
print(counter_c)
print(counter_d)
print(counter_c.total() == 50)
print(sum(v for v in counter_d.values()) == 50)

Counter({'violet': 12, 'gray': 10, 'red': 10, 'green': 8, 'hazel': 7, 'brown': 3, 'amber': 0, 'blue': 0})
defaultdict(<class 'int'>, {'amber': 0, 'blue': 0, 'brown': 3, 'gray': 10, 'green': 8, 'hazel': 7, 'red': 10, 'violet': 12})
True
True


### Exercise 3

* You are given three JSON files, representing a default set of settings, and environment specific settings.
  * `common.json`
  * `dev.json`
  * `prod.json`
* Your goal is to write a function that has a single argument (the environment name) and returns the "combined" dictionary that merges the two dictionaries together, with the environment specific settings overriding any common settings already defined.
* For simplicity, assume that the argument values are going to be the same as the file names, without the `.json` extension. So for example, `dev` or `prod`.
* The wrinkle: We don't want to duplicate data for the "merged" dictionary, using `ChainMap` to implement this instead.

In [19]:
from collections import ChainMap
import os
import json
from pprint import pprint

def chain_settings(environment, path, default_settings='common', file_extension='.json'):
    environment_settings_path = f'{os.path.join(*path, environment)}{file_extension}'
    default_settings_path = f'{os.path.join(*path, default_settings)}{file_extension}'
    with open(environment_settings_path) as env_file:
        with open(default_settings_path) as default_file:
            env_dict = json.load(env_file)
            default_dict = json.load(default_file)
            return chain_recursive(env_dict, default_dict)
        
def chain_recursive(d1, d2):
    chained = ChainMap(d1, d2)
    for k, v in d1.items():
        if isinstance(v, dict) and k in d2:
            chained[k] = chain_recursive(v, d2[k])
    return chained

dev_settings = chain_settings('dev', ('.', 'p3_exercise1'))
prod_settings = chain_settings('prod', ('.', 'p3_exercise1'))
pprint(dev_settings)
print(dev_settings['database']['pwd'])
print(dev_settings['database']['port'])
print(dev_settings['data']['numerics']['precision'])
print(prod_settings['logs']['level'])

ChainMap({'data': ChainMap({'input_root': '/dev/path/inputs',
                            'numerics': ChainMap({'type': 'float'},
                                                 {'precision': 6,
                                                  'type': 'Decimal'}),
                            'operators': {'add': '__add__'},
                            'output_root': '/dev/path/outputs'},
                           {'input_root': '/default/path/inputs',
                            'numerics': {'precision': 6, 'type': 'Decimal'},
                            'output_root': '/default/path/outputs'}),
          'database': ChainMap({'pwd': 'test', 'user': 'test'},
                               {'db_name': 'deepdive',
                                'port': 5432,
                                'schema': 'public'}),
          'logs': ChainMap({'format': '%(asctime)s: %(levelname)s: '
                                      '%(clientip)s %(user)s %(filename)s '
                              

## Extras

### `MappingProxyType`

* `class types.MappingProxyType(mapping)`
  * *Read-only* proxy of a mapping.
  * It provides a dynamic view on the mapping's entries, which means that when the mapping changes, the view reflects these changes.
  * Supported operations:
    * `key in proxy`
      * Return `True` if the underlying mapping has a key `key`, else `False`.
    * `proxy[key]`
      * Return the item of the underlying mapping with key `key`. Raises a `KeyError` if `key` is not in the underlying mapping.
    * `iter(proxy)`
      * Return an iterator over the keys of the underlying mapping. This is a shortcut for `iter(proxy.keys())`.
    * `len(proxy)`
      * Return the number of items in the underlying mapping.
    * `copy()`
      * Return a *shallow* copy of the underlying mapping.
    * `get(key[, default])`
      * Return the value for `key` if `key` is in the underlying mapping, else `default`. If `default` is not given, it defaults to `None`, so that this method never raises a `KeyError`.
    * `items()`
      * Return a new view of the underlying mapping's items (`(key, value)` pairs).
    * `keys()`
      * Return a new view of the underlying mapping's keys.
    * `values()`
      * Return a new view of the underlying mapping's values.
    * `reversed(proxy)`
      * Return a reverse iterator over the keys of the underlying mapping.
    * `hash(proxy)`
      * Return a hash of the underlying mapping.

In [23]:
from types import MappingProxyType

class MyClass:
    a = 1

print(MyClass.__dict__)
print(type(MyClass.__dict__))
my_class = MyClass()
print(my_class.__dict__)
print(type(my_class.__dict__))

d = {'a': 1, 'b': 2}
mp = MappingProxyType(d)
print(mp)
d['c'] = 3
print(mp) # the mapping proxy also changed automatically as it dynamically reflects the underlying mapping

{'__module__': '__main__', 'a': 1, '__dict__': <attribute '__dict__' of 'MyClass' objects>, '__weakref__': <attribute '__weakref__' of 'MyClass' objects>, '__doc__': None}
<class 'mappingproxy'>
{}
<class 'dict'>
{'a': 1, 'b': 2}
{'a': 1, 'b': 2, 'c': 3}
