# Common Python Patterns

I'm not sure how often others use these methods but these are things I find myself doing _a lot_ in Python. Please share your own commond approaches to programming problems!

1. List comprehensions
2. `enumerate()` to access the index in a loop
3. Deduping with `set`
4. Making types hashable for use with `set` (e.g. hashable dicts)

## 1. List Comprehensions

[Python Datastructures > List Comprenhensions](https://docs.python.org/3/tutorial/datastructures.html#tut-listcomps)

Python's list comprehensions really stand out as a concise way to construct lists while filtering or transforming entries. They're often a lot fewer lines of code than iterative loops, though they might look unfamiliar to folks coming from other languages. In particular, the filtering condition coming _at the end_ of the expression feels unintuitive to me.

The basic structure of a list comprehension is:

```python
[expression for item in iterable if condition]
```

In [85]:
# at its most basic: random numbers over 50 out of a list of numbers in between 1 and 100
from random import choices
nums = choices(range(1,101),k=5)
print('Numbers:', nums)
print('Numbers > 50:', [n for n in nums if n > 50])

Numbers: [7, 37, 35, 93, 75]
Numbers > 50: [93, 75]


Sometimes the data in our list won't be a simple type like a string or integer, though, and so the "expression" or "item" parts of the list comprehension are a little more complex.

In [86]:
ages = {
    'Ranganathan': 55,
    'Bates': 22,
    'Lorde': 44,
    'Berman': 33,
}
# names of people who are older than 40 —— dict.items() returns a list of (key, value) tuples
names = [name for name, age in ages.items() if age > 40]
print(names)

['Ranganathan', 'Lorde']


A functional programming way of accomplishing the same thing would be to use the `filter` function to filter the `ages.items()` list and then the `map` function to extract the names from the list of filtered tuplies.

In [87]:
names = list(map(lambda tuple: tuple[0], filter(lambda x: x[1] > 40, ages.items())))
print(names)

['Ranganathan', 'Lorde']


Is this really worse? I think the nested functions are harder to read. It may be personal but also `lambda` is one of my least favorite Python symbols, it seems like too obscure a math reference to make sense to many people.

But mainly list comprehensions are more "pythonic"—people tend to use and understand them more than the functional approaches that might be preferred in another language like JavaScript.

## 2. Accessing the Index in a Loop

I often want to know what number iteration I'm on in a loop. For instance, in scripts where I provide a cut-off limit for debugging purposes (e.g. "process the first N items only"), I need a way to know when I hit the limit, but the `for...of` loop in Python doesn't provide an index.

We could keep track with our own counter, but this adds a few lines of code. Let's look at two examples that print the first 3 of 10 random numbers:

In [75]:
from random import choices

# NOT what I'd recommend!
count = 0
limit = 3
items = choices(range(1000), k=10)
for item in items:
    count += 1
    if count > limit:
        break
    print(item)

167
799
511


Now let's use the `enumerate()` function which transforms an iterable into a sequence of `(index, item)` tuples:

In [76]:
limit = 3
items = choices(range(1000), k=10)
for index, item in enumerate(items, 1):
    if index > limit:
        break
    print(item)

922
503
101


This saves us two lines (initializing and incrementing a counter) and a generically-named variable in the scope outside our loop.

## 3. Deduping with `set`

[Python Datastructures > Sets](https://docs.python.org/3/tutorial/datastructures.html#sets)

`set` is one of my favorite, underused Python data structures. Sets are mathematical concepts and they support a number of useful operations like calculating the union, intersection, and difference between sets. But I mostly use sets simply because their members are unique! If we use a set instead of a list or tuple, we get deduping for free, and we can pass a list to the `set()` constructor to dedupe it.

In [77]:
l = [1, 2, 2, 3, 4, 4, 4, 4, 5]
s = set(l)
print('Integers in list:', s)

Integers in list: {1, 2, 3, 4, 5}


We can always work with lists and then convert to a set when we need to dedupe. But if we choose to work with sets directly, their methods are quite different from those of lists. We `add()` elements to a set, not `append()` them, and we can't access elements by index because sets are unordered, though we can `for...in` loop over them just like lists. We can also check for membership with the `in` operator.

In [78]:
s.add(3) # this has no effect since 3 is already in the set and no error is thrown
s.add(6)
if 6 in s:
    print('6 is in the set now')
s.discard(1) # set.remove(e) throws an error if the element is not in the set, set.discard(e) does not
if not 1 in s:
    print('1 is not in the set now')


6 is in the set now
1 is not in the set now


Below, we build a set of all subject headings in a file of MARC records. See the [pymarc](./pymarc.ipynb) notebook for more on working with MARC records.

In [79]:
from pymarc import MARCReader

subjects = set()
with open('assets/100-harvard.mrc', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        if record:
            subjects.update([field.format_field() for field in record.subjects])
print(f"There were {len(subjects)} unique subjects in the file")
print("First three subjects:", ', '.join(list(subjects)[:3]))

There were 116 unique subjects in the file
First three subjects: American Society of Landscape Architects., Hughes, Langston, 1902-1967., African American authors -- Biography.


## 4. Making hashable types for use with `set`

Dictionaries are very common data structures in Python. What happens if we're trying to dedupe a list of dictionaries using set?

In [80]:
librarians = [
    {"name": "Ranganathan", "age": 55},
    {"name": "Bates", "age": 22},
    {"name": "Lorde", "age": 44},
    {"name": "Berman", "age": 22},
    {"name": "Ranganathan", "age": 55} # duplicate!
]

In [81]:
# this is in a second code block because it throws an error
set(librarians) # TypeError: unhashable type: 'dict'

TypeError: unhashable type: 'dict'

In order for the `set` datastructure to deduplicate something, its members need to have a hash method, but the `dict` type lacks one. This is because dictionaries are mutable, and mutable types are not hashable in Python. This is because the hash of an object is based on its contents, and if the contents can change, the hash would have to change too.

Consider our example: what if we change the `name` "Berman" to be "Bates"? Now, because their ages are also the same, we suddenly have two identical dictionaries in our set, but the set cannot contain duplicates.

However, it's actually not hard to create a hashable version of the `dict` type for use with set using a few built-in Python functions. We can convert the dict to a tuple of its items, and then convert that tuple to a hashable type with the `frozenset` constructor.

In [None]:
# create a subclass of dict with a __hash__ method
class Hashabledict(dict):
    def __hash__(self):
        return hash(frozenset(self.items()))

hashed_libns = librarians # use a different var so we can rerun the cell
print(f"There are {len(hashed_libns)} librarians before deduplication")
hashed_libns = set([Hashabledict(d) for d in hashed_libns])
print(f"There are {len(hashed_libns)} librarians after deduplication")

There are 5 librarians before deduplication
There are 4 librarians after deduplication


This approach will mostly work if we have a bunch of dicts we need to deduplicate, but I want to point out limitations. Feel free to skip the details.

The values of our dict's keys need to be hashable, too. So, for instance, if we have a nested dict then the nested elements need to be converted to hashable types. In general, complicated data structures may require a bit more forethought and research into what the `__hash__` and other methods (like `__eq__` which checks whether two objects are equal) are doing.