In [1]:
import re
import itertools
from collections import defaultdict
from functools import lru_cache

# How to Count Things

This notebook lists problems designed to show how to count things. Right now there are two problems.

# Student Records: Late, Absent, Present

Consider this problem:

> (1) Students at a school must meet with the guidance counselor if they have two absences, or three consecutive late days. Each student's attendance record consists of a string of 'A' for absent, 'L' for late, or 'P' for present. For example: "LAPLPA" requires a meeting (because there are two absences), and "LAPLPL" is OK (there are three late days, but they are not consecutive). Write a function that takes such a string as input and returns `True` if the student's record is OK. 

> (2) Write a function to calculate the number of attendance records of length N that are OK.

For part (1), the simplest approach is to use `re.search`:

In [2]:
def ok(record: str) -> bool: return not re.search(r'LLL|A.*A', record)

In [3]:
def test_ok():
    assert     ok("LAPLLP")
    assert not ok("LAPLLL")   # 3 Ls in a row
    assert not ok("LAPLLA")   # 2 As overall
    assert     ok("APLLPLLP")
    assert not ok("APLLPLLL") # 3 Ls in a row
    assert not ok("APLLPLLA") # 2 As overall
    return 'pass'
    
test_ok()  

'pass'

For part (2), I'll start with a simple (but slow) solution called `total_ok_slow` that enumerates `all_strings` (using `itertools.product`) and counts how many are `ok`. I use the `quantify` recipe ([from `itertools`](https://docs.python.org/3.6/library/itertools.html#itertools-recipes)) to count them:

In [4]:
def all_strings(alphabet, N): 
    "All length-N strings over the given alphabet."
    return map(cat, itertools.product(alphabet, repeat=N))

def total_ok_slow(N: int) -> int:
    "How many strings over 'LAP' of length N are ok?"
    return quantify(all_strings('LAP', N), ok)

def quantify(iterable, pred=bool) -> int:
    "Count how many times the predicate is true of items in iterable."
    return sum(map(pred, iterable))

cat = ''.join

In [5]:
{N: total_ok_slow(N) for N in range(11)}

{0: 1,
 1: 3,
 2: 8,
 3: 19,
 4: 43,
 5: 94,
 6: 200,
 7: 418,
 8: 861,
 9: 1753,
 10: 3536}

This looks good, but
I will need a more efficient algorithm to handle large values of *N*. Here's how I think about it:

* I can't enumerate all the strings; there are too many of them, 3<sup>N</sup>. 
* Even if I only enumerate the ok strings, there are still too many, O(2<sup>N</sup>).
* Instead, I'll want to keep track of a *summary* of all the ok strings of length *N*, and use that to quickly compute a summary of the ok strings of length *N*+1. I recognize this as a *[dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)* approach.

* What is in the summary? A list of all ok strings is too much. A count of the number of ok strings is not enough. Instead, I will group together the strings that have the same number of `'A'` characters in them, and the same number of consecutive `'L'` characters at the end of the string, and count them.  I don't need to count strings that have two or more `'A'` characters, or 3 consecutive `'L'` characters anywhere in the string. And I don't need to worry about runs of 1 or 2 `'L'` characters embedded in the middle of the string. So the summary is a mapping of the form `{(A, L): count, ...}`. 

* For *N* = 2, the summary looks like this:

      #(number_of_A_in_string, number_of_L_at_end_of_string): count
      #(A, L): c
      {(0, 0): 2, # LP, PP
       (0, 1): 1, # PL
       (0, 2): 1, # LL
       (1, 0): 1, # AP, LA, PA
       (1, 1): 1} # AL
 

Here is a function to create the summary for `N+1`, given the summary for `N`:

In [6]:
def next_summary(prev_summary: dict) -> dict:
    "Given a summary of the form {(A, L): count, ...}, return summary for strings one char longer."
    summary = defaultdict(int)
    for (A, L), c in prev_summary.items():
            if A < 1: summary[A+1, 0] += c # transition with 'A'
            if L < 2: summary[A, L+1] += c # transition with 'L'
            summary[A, 0] += c             # transition with 'P'
    return summary

For `N = 0`, the summary is `{(0, 0): 1}`, because there is one string, the empty string, which has no `'A'` nor `'L'`. From there we can proceed in a "bottom-up" fashion to compute the total number of OK strings for any value of `N`:

Here's a "bottom-up" approach for `total_ok` that starts at `0` and works up to `N`:

In [7]:
def total_ok(N) -> int:
    "How many strings of length N are ok?"
    summary = {(0, 0): 1}
    for _ in range(N):
        summary = next_summary(summary)
    return sum(summary.values()) 

We can use this to go way beyond what we could do with `total_ok_slow`:

In [8]:
%time total_ok(300)

CPU times: user 1.28 ms, sys: 16 µs, total: 1.29 ms
Wall time: 1.32 ms


5261545087067582125179062608958232695543100705754634272071166414871321070487675367

There are over 10<sup>80</sup> ok strings of length 300; more than the number of atoms in the universe. But it only took around a millisecond to count them.

Dynamic programming can also be done top-down (where we start at `N` and work down to `0`):

In [9]:
def total_ok(N) -> int:
    "How many strings of length N are ok?"
    return sum(summary_for(N).values())
    
def summary_for(N) -> dict: 
    "The {(A, L): count} summary for strings of length N."
    return ({(0, 0): 1} if N == 0 else next_summary(summary_for(N - 1)))

In [10]:
%time total_ok(300)

CPU times: user 1.7 ms, sys: 78 µs, total: 1.77 ms
Wall time: 1.81 ms


5261545087067582125179062608958232695543100705754634272071166414871321070487675367

We get the same answer in about the same amopunt of time.

Let's verify our results against the slow, reliable `total_ok_slow`, and look at the summaries for the first few values of `N`:

In [11]:
print(' N   ok summary(N)')
print('-- ---- ----------')
for N in range(11): 
    assert total_ok(N) == total_ok_slow(N)
    print('{:2} {:4} {}'.format(N, total_ok(N), dict(summary_for(N))))

 N   ok summary(N)
-- ---- ----------
 0    1 {(0, 0): 1}
 1    3 {(0, 1): 1, (1, 0): 1, (0, 0): 1}
 2    8 {(0, 1): 1, (1, 0): 3, (0, 0): 2, (0, 2): 1, (1, 1): 1}
 3   19 {(0, 1): 2, (1, 2): 1, (0, 0): 4, (1, 0): 8, (0, 2): 1, (1, 1): 3}
 4   43 {(0, 1): 4, (1, 2): 3, (0, 0): 7, (1, 0): 19, (0, 2): 2, (1, 1): 8}
 5   94 {(0, 1): 7, (1, 2): 8, (0, 0): 13, (1, 0): 43, (0, 2): 4, (1, 1): 19}
 6  200 {(0, 1): 13, (1, 2): 19, (0, 0): 24, (1, 0): 94, (0, 2): 7, (1, 1): 43}
 7  418 {(0, 1): 24, (1, 2): 43, (0, 0): 44, (1, 0): 200, (0, 2): 13, (1, 1): 94}
 8  861 {(0, 1): 44, (1, 2): 94, (0, 0): 81, (1, 0): 418, (0, 2): 24, (1, 1): 200}
 9 1753 {(0, 1): 81, (1, 2): 200, (0, 0): 149, (1, 0): 861, (0, 2): 44, (1, 1): 418}
10 3536 {(0, 1): 149, (1, 2): 418, (0, 0): 274, (1, 0): 1753, (0, 2): 81, (1, 1): 861}


# Count Strings with Alphabetic First Occurences

Here's another problem:

> Given an alphabet of length k, how many strings of length k can be formed such that the first occurrences of each character in the string are a prefix of the alphabet?

Let's first make sure we understand the problem. Since *k* could go well beyond 26, I will choose as my alphabet the integers, not the letters `'abc...'`. An alphabet of length *k* is `range(k)`, and a valid string of length 3 could be
`[0, 1, 2]` or `[0, 0, 1]` (or other possibilities). These are valid because the first occurrence of each character for these strings are `[0, 1, 2]` and `[0, 1]`, respectively, and these are prefixes of `range(3)`. But `[0, 0, 2]` is not valid, because the first occurrences are `[0, 2]`, and this is not a prefix (because it is missing the `1`). 

I'll define four key concepts:

In [12]:
def valid(s) -> bool: 
    "A string is valid if its first occurrences are a prefix of the alphabet."
    return is_prefix(first_occurrences(s))

def is_prefix(s) -> bool: 
    "A string is a valid prefix if it is consecutive integers starting from 0."
    return s == list(range(len(s)))

def first_occurrences(s) -> list:
    "The unique elements of s, in the order they first appear." 
    firsts = []
    for x in s:
        if x not in firsts: firsts.append(x)
    return firsts 

def all_strings(k): 
    "All strings of length k over an alphabet of k ints."
    return itertools.product(range(k), repeat=k)

In [13]:
def test(): 
    assert valid([0, 1, 2]) and valid([0, 0, 1])
    assert not valid([0, 0, 2])
    assert is_prefix([0, 1, 2])
    assert first_occurrences([0, 0, 2]) == [0, 2]
    assert set(all_strings(2)) == {(0, 0), (0, 1), (1, 0), (1, 1)}
    #            s             first_occurrences(s) valid(s)
    assert test1([0, 1, 2],    [0, 1, 2],           True)  
    assert test1([0, 0, 0],    [0],                 True)      
    assert test1([1],          [1],                 False)      
    assert test1([0, 1, 3],    [0, 1, 3],           False)
    assert test1([0, 1, 3, 2], [0, 1, 3, 2],        False)
    assert test1([0, 1, 0, 1, 0, 2, 1], [0, 1, 2],  True)
    assert test1([0, 1, 0, 2, 1, 3, 1, 2, 5, 4, 3], [0, 1, 2, 3, 5, 4], False)
    assert test1([0, 1, 0, 2, 1, 3, 1, 2, 4, 5, 3], [0, 1, 2, 3, 4, 5], True)
    return 'ok'

def test1(s, firsts, is_valid):
    return first_occurrences(s) == firsts and valid(s) == is_valid
    
test()

'ok'

First, I will solve the problem in a slow but sure way: generate all possible strings, then count the number that are valid. The complexity of this algorithm is $O(k^{k+1})$, because there are $k^k$ strings, and to validate a string requires looking at all $k$ characters.

In [14]:
def how_many_slow(k) -> int: 
    """Count the number of valid strings. (Try all possible strings.)"""
    return quantify(all_strings(k), valid)

[how_many_slow(k) for k in range(7)]

[1, 1, 2, 5, 15, 52, 203]

Now let's think about how to speed that up. I don't want to have to consider every possible string, because there are too many ($k^k$) of them. Can I group together many strings and just count the number of them, without enumerating each one? For example, if I knew there were 52 valid strings of length $k-1$ (and didn't know anything else about them), can I tell how many valid strings of length $k$ there are? I don't see a way to do this directly, because the number of ways to extend a valid string is dependent on the number of distinct characters in the string. If a string has $m$ distinct characters, then I can extend it in $m$ waysby repeating any of those $m$ characters, or I can introduce a first occurrence of character number $m+1$ in just 1 way.

So I need to keep track of the number of valid strings of length $k$ that have exactly $m$ distinct characters (those characters must be exactly `range(m)`). I'll call that number `C(k, m)`. Because I can reach a recursive call to `C(k, m)` by many paths, I will use the `lru_cache` decorator to keep track of the computations that I have already done. Then I can define `how_many(k)` as the sum over all values of `m` of `C(k, m)`:

In [15]:
@lru_cache()
def C(k, m) -> int:
    "Count the number of valid strings of length k, that use m distinct characters."
    return (1 if k == 0 == m else
            0 if k == 0 != m else
            C(k-1, m) * m + C(k-1, m-1)) # m ways to add an old character; 1 way to add new

def how_many(k): return sum(C(k, m) for m in range(k+1))

In [16]:
how_many(100)

47585391276764833658790768841387207826363669686825611466616334637559114497892442622672724044217756306953557882560751

In [17]:
assert all(how_many(k) == how_many_slow(k) for k in range(7))

In [18]:
for k in itertools.chain(range(10), range(10, 121, 10)):
    print('{:3}  {:12g}'.format(k,  how_many(k)))

  0             1
  1             1
  2             2
  3             5
  4            15
  5            52
  6           203
  7           877
  8          4140
  9         21147
 10        115975
 20   5.17242e+13
 30   8.46749e+23
 40   1.57451e+35
 50   1.85724e+47
 60   9.76939e+59
 70    1.8075e+73
 80   9.91268e+86
 90   1.4158e+101
100  4.75854e+115
110  3.46846e+130
120   5.1263e+145
