# `gitermap` examples (3)

In this notebook we will look at an advanced extension class which performs some extra work to ensure only unique iterations run. This is done via hashing of the input variables, and if repeats are found they are not executed.

#### Version

In [104]:
import gitermap
gitermap.__version__

'0.1.0'

In [1]:
from gitermap import MapContext, umap
import itertools as it
import time

Repeating the first basic example...

In [2]:
def f1(x):
    time.sleep(0.5)
    return x**2

In [3]:
umap(f1, range(10))

100%|██████████| 10/10 [00:05<00:00,  1.96it/s]


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

## Introducing `UniqueMapContext`

This class extends from `MapContext`, giving you all the functionality you want with `tqdm`, parallelisation and caching, except it will prevent duplicated parameter runs.

This is particularly useful for preventing unnecessary re-runs, and is the beginning of a pseudo-indexing format where each element in a list-comprehension run essentially has it's own unique ID. 

Let's take the following example.

In [4]:
import warnings
from functools import reduce
import operator

Here we will create a `UniqueMapContext` object and compute a list comprehension with a few elements.

In [84]:
ctx = UniqueMapContext()
ctx.compute(f1, [1, 3, 5])

100%|██████████| 3/3 [00:01<00:00,  1.97it/s]


[1, 9, 25]

The parameter values are stored in this hash set:

In [85]:
ctx._hash_cache

{-5907199003289078419, -1080395952817050184, 829895489864556342}

Under the hood, `UniqueMapContext` makes a hash representation of the input list, so it will recognise if the same parameters are fed to it in subsequent calls.

Here if we re-compute, nothing is returned because they have already been computed.

In [86]:
ctx.compute(f1, [1, 3, 5])

100%|██████████| 3/3 [00:00<00:00, 2996.64it/s]


[None, None, None]

`UniqueMapContext` also automatically handles cases where there may be duplicate parameters within the same call. We explicitly drop Nones by using `filter_none=True`:

In [88]:
UniqueMapContext(filter_none=True).compute(f1, [1, 1, 5])

100%|██████████| 3/3 [00:01<00:00,  2.96it/s]


[1, 25]

Also note, `UniqueMapContext` does not pre-compute the hash values, allowing for use of infinite lists or iterables which are added to the hash table Just-In-Time (JIT). 

In [89]:
ctx.compute(f1, it.islice(it.count(), 0, 5))

5it [00:01,  3.25it/s]                       


[0, None, 4, None, 16]

We can also handle cases where multiple parameters are passed, rather than just one. This is achieved by 'stringifying' each element if possible, performing concatenation and then hashing the final result; as inconsistencies can crop up if each element is hashed individually and then added:

In [90]:
def f2(x, y):
    time.sleep(0.5)
    return x**2 + y**2

In [91]:
ctx2 = UniqueMapContext(filter_none=False)
ctx2.compute(f2, [1, 3, 5], [2, 4, 6])

100%|██████████| 3/3 [00:01<00:00,  1.96it/s]


[5, 25, 61]

Now attempting repeat computation...

In [92]:
ctx2.compute(f2, [1, 3, 5], [2, 4, 6])

100%|██████████| 3/3 [00:00<?, ?it/s]


[None, None, None]

### More complex example

This may work well for simple objects, but if we return to `example2`, will it work for more complex objects?

In [93]:
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [94]:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import scale
from sklearn.linear_model import Ridge

import numpy as np

In [95]:
def test_score(alpha, cv, X, y):
    # preprocess X, y
    X_new = scale(X)
    return cross_val_score(Ridge(alpha), X_new, y, cv=cv, scoring="r2")

Simple test...

In [96]:
test_score(1., 5, boston.data, boston.target)

array([ 0.64333037,  0.71683688,  0.58814318,  0.08214252, -0.22702517])

Now for finding the best $\alpha$ as before, except with a `UniqueMapContext` object.

In [100]:
alphas = np.linspace(-4., 5.5, 50)
with UniqueMapContext(filter_none=True) as ctx1:
    alpha1 = ctx1.compute(test_score, alphas, cv=5, X=boston.data, y=boston.target)

100%|██████████| 50/50 [00:00<00:00, 53.67it/s]


As we can see a bunch of cache values have been created based on the alphas variable within the numpy array.

In [101]:
np.asarray(ctx1._hash_cache)

array({3640796605250424320, 8582013873942379647, 4185363145425555971, -9212667650027664242, -796614618200864115, -970730158473949170, -1724854889836662003, 2769429073849820431, -5958354412401878383, 4751731169373952146, -8518525571303823591, -6986906522839265766, -1602496587683899106, -747029755900037345, -2763193489603180127, -8183156502946042078, 8225711593259415837, 6009838678572839713, 6654423652127154338, -488803356001614681, 3456741540624322599, -4608082354363308375, 1718533815024643751, -5879546307033266514, 1731702642152020013, 6604239913841015341, -8647439265574675275, 5373195172001057201, 7404460542217592623, 8967114733100054838, 579423344652318779, 7709594867032831041, -24917617203115193, 8863760601631744331, -1682203351306608178, 6669021080371333839, -6837740234565313580, 315766718079107413, 5480592956131337301, -8070814405452750115, -8211390189641821729, -1591772869852481187, -5249611260693451165, -244246109208971933, -8705009027257791642, -1757981256186294934, -5579318763

Re-running blocks computation:

In [102]:
ctx1.compute(test_score, alphas, cv=5, X=boston.data, y=boston.target)

100%|██████████| 50/50 [00:00<?, ?it/s]


[]

Note that only the `*args` parameter is checked for crossovers - we perform no checking on fixed keyword arguments and assume they do not change between calls to `compute()`. Obviously in practice this could change but changable parameters should always be `*args` within this library such that they are factored in as an iterable part of the list comprehension.

Here we have a duplicate `alpha` value, and an additional uncomputed value: Here the uncomputed value is computed and we ignore the other alpha:

In [103]:
ctx1.compute(test_score, [1.5, alphas[0]], cv=5, X=boston.data, y=boston.target)

100%|██████████| 2/2 [00:00<00:00, 92.09it/s]


[array([ 0.64531348,  0.7182381 ,  0.58865436,  0.08358385, -0.21505711])]

### Limitations

There are a number of limiting factors at play when it comes to using this extension:

1. There may be issues when it comes to stringing floating numbers and inherent precision problems. This could lead to slightly different string results which lead to different hash.
2. Size limits with very large $N$ can be quite inefficient.