# [Day 4: High Entropy Passphrases](https://adventofcode.com/2017/day/4)

## Part A

Looks like we're taking a major step down in difficulty today - for this puzzle all we want to do is figure out whether a row has any duplicated words on it (to be exact, given a textfile we want to count the number of lines without a duplicate word).

In [1]:
def test_row_validator(f):
    assert(f("aa bb cc") == 0)
    assert(f("aa bb cc bb") == 1)
    assert(f("aaa bb a aa") == 0)
    assert(f("red yellow green\nred red red\nblue green green") == 2)

def row_has_duplicates(row):
    words = row.split()
    return len(words) != len(set(words))

def row_validator(s):
    rows = s.split('\n')
    ct = 0
    for row in rows:
        if row_has_duplicates(row):
            ct += 1
    return ct

def row_validator_short(s):
    return sum([int(row_has_duplicates(row)) for row in s.split('\n')])
    
test_row_validator(row_validator)
test_row_validator(row_validator_short)

The only trick here is how we check if a row has duplicates. Python's `set` is a collection of unique elements, so by constructing a set from a list we automatically toss out any duplicate elements. Then, if the set and original list have different lengths, we know there must've been at least one duplicates. While very clean and short, this solution is actually slightly inefficient since it doesn't break out early upon finding a duplicate (e.g if the row is "a a b c d e f g h i" it'll still construct the entire set even though after seeing the first two a's we know that the row has a duplicate). So we can make a slightly faster version that uses the same idea but with a faster exit condition:

In [2]:
def row_has_duplicates_fast(row):
    words = row.split()
    words_set = set()
    for word in words:
        if word in words_set:
            return True
        else:
            words_set.add(word)

If we want to get even fancier though we can use Python's `any` function, which is equivalent to the following:

```python
def any(iterable):
    for element in iterable:
        if element:
            return True
    return False
```

In other words, `any(list)` essentially checks if anything in the list evaluates to True and handles breaking early for us. With this we can hackily shorten to the following:

In [3]:
def row_has_duplicates_fast_and_short(row):
    words = row.split()
    words_set = set()
    return any(True if word in words_set else words_set.add(word) for word in words)

Noticing here that Python's `set.add` returns `None`, which evalutes to `False` if cast to a boolean.

In [4]:
example_set = set()
print(example_set.add(3))
print(bool(example_set.add(3)))

None
False


Now let's test the timing differences of these functions with some different types of strings.

In [5]:
import random

no_dup_row = ' '.join([str(i) for i in range(10000)])

all_dup_row = 'hello ' * 9999 + 'hello'

one_dup_row = list(range(10000))
one_dup_row[1] = 0 # Now there are two copies of 0
random.shuffle(one_dup_row)
one_dup_row = ' '.join([str(i) for i in one_dup_row])

many_dups_row = list(range(5000)) + list(range(5000))
random.shuffle(many_dups_row)
many_dups_row = ' '.join([str(i) for i in many_dups_row])

print('Timing no_dup_row:')
%timeit row_has_duplicates(no_dup_row)
%timeit row_has_duplicates_fast(no_dup_row)
%timeit row_has_duplicates_fast_and_short(no_dup_row)
print()

print('Timing all_dup_row:')
%timeit row_has_duplicates(all_dup_row)
%timeit row_has_duplicates_fast(all_dup_row)
%timeit row_has_duplicates_fast_and_short(all_dup_row)
print()

print('Timing one_dup_row:')
%timeit row_has_duplicates(one_dup_row)
%timeit row_has_duplicates_fast(one_dup_row)
%timeit row_has_duplicates_fast_and_short(one_dup_row)
print()

print('Timing many_dups_row:')
%timeit row_has_duplicates(many_dups_row)
%timeit row_has_duplicates_fast(many_dups_row)
%timeit row_has_duplicates_fast_and_short(many_dups_row)
print()

Timing no_dup_row:
903 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.62 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.94 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Timing all_dup_row:
835 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
329 µs ± 5.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
328 µs ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Timing one_dup_row:
937 µs ± 8.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.42 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.61 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Timing many_dups_row:
926 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
357 µs ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
351 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



So first, the expected part: it looks like our "fast" exit-early method is indeed a bit faster in the cases where we have lots of duplicates. It seems that the shorter version is about the same as the longer one (so it looks like the `any` function is pretty similar to the actual `for` loop we were using).

The slightly weird part is that on the row without duplicates, the "slower" method actually performs significantly better (similarly for the one duplicate row, we assume that the one duplicate is relatively late in the list which is why it took a little over half the time of the no duplicate row). I was a bit confused by this at first because I thought that the faster version was essentially the same as the slower one but with an early exit, but then I realized that during set construction, Python actually probably doesn't have to check for membership at all since the same elements will hash to the same thing and thus safely "overwrite" each other. So actually in our "fast" method we're doing ~twice the work that just constructing the set is because we have to hash twice (once to check if it's in the set, then again when actually adding). 

## Part B

Part B is almost identical for Part A, except that rather than looking for rows without duplicates, we're looking for rows without anagrams. In other words, now if a row contained 'abc' and 'bca' that would invalidate it since they can be rearranged to match each other. The trick here is just to sort each element in the row before doing what we did before. If two words are anagrams (contain the same leters) then after sorting they'll be identical.

In [6]:
def test_row_anagram_validator(f):
    test_row_validator(f) # Should still satisfy all the Part A conditions
    assert(f("abc cba def") == 1)
    assert(f("aaaa aaa aa a") == 0)
    assert(f("racecar carrace") == 1)

def row_has_anagrams(row):
    words = row.split()
    sorted_words = [ ''.join(sorted(word)) for word in words ]
    return len(sorted_words) != len(set(sorted_words))

def row_anagram_validator(s):
    rows = s.split('\n')
    ct = 0
    for row in rows:
        if row_has_anagrams(row):
            ct += 1
    return ct
    
test_row_anagram_validator(row_anagram_validator)