# Lesson 7: Comprehensions
**Teaching**: 10min<br>
**Exercises**: 10min

## List comprehensions

List comprehensions (in addition to set and dictionary comprehensions) are neat little constructs that will considerably reduce the number of lines of code you need to perform small acts of collecting/aggregating data.

Here's an example that constructs a list of squares:

In [1]:
z = [i**2 for i in range(5)]
print(z)

[0, 1, 4, 9, 16]


You can also add in conditionals, like requiring only even numbers:

In [2]:
z = [i**2 for i in range(10) if i % 2 == 0]
print(z)

[0, 4, 16, 36, 64]


The general form is

    [ expression for var in collection if conditional ]

Let's start with some simple file parsing and get the lengths of words in the standard Unix `words` file:

In [3]:
from mp_workshop.data import words

words = [word.strip() for word in words]
counts = [len(word) for word in words]
print('there are', len(counts), 'words in the file')
print('the maximum word length is', max(counts))

there are 235886 words in the file
the maximum word length is 24


Now, let's do some simple math, and calculate the average and standard deviation of the word counts (note: this is for illustrative purposes. `numpy` has efficient built-in functions for calculating statistics of collections of numbers):

In [4]:
import math

average = sum(counts) / len(counts)
variance = sum([(x - average)**2 for x in counts]) / len(counts)
stddev = math.sqrt(variance)
print(average, '+/-', stddev)

9.569126612007494 +/- 2.927336978822631


Finally, let's add in a conditional. Let's get a count of words that are longer than ten letters and that end in 'y':

In [5]:
# Small enough set of data, so can load all at once and keep in memory.
special_count = len([w for w in words
                     if len(w) > 10 and w.endswith('y')])
print(special_count,
      'words are longer than ten letters and end in "y"')

12446 words are longer than ten letters and end in "y"


This type of construction is more compact and often more readable than something like:

In [6]:
count = 0
for w in words:
    if len(w) > 10 and w.endswith('y'):
        count = count + 1
print(count)

12446


## Set and dictionary comprehensions

Set and dictionary comprehensions allow us to quickly collect unique keys and associated values from a collection.



In [7]:
from pprint import pprint
from mp_workshop.data import crystals

pprint(crystals[0])

{'chemsys': 'Ca-Ge-Mo-O',
 'elements': ['Ca', 'Ge', 'Mo', 'O'],
 'pretty_formula': 'Ca3Ge3(MoO6)2',
 'spacegroup': {'crystal_system': 'cubic', 'number': 230}}


Let's use a set comprehension to find the unique values of 'spacegroup.crystal_system' among the crystals:

In [8]:
crystal_systems = {c['spacegroup']['crystal_system'] for c in crystals}
print(crystal_systems)

{'tetragonal', 'orthorhombic', 'cubic', 'hexagonal', 'trigonal', 'triclinic', 'monoclinic'}


Let's say we want to re-organize our collection of crystal data to facilitate efficient lookup by (formula, crystal_system). A dictionary comprehension can do this succinctly:

In [9]:
crystals_new = {(c['pretty_formula'], c['spacegroup']['crystal_system']): c
                for c in crystals}
print(crystals_new[('Ag', 'cubic')])

{'spacegroup': {'number': 225, 'crystal_system': 'cubic'}, 'elements': ['Ag'], 'chemsys': 'Ag', 'pretty_formula': 'Ag'}


In [10]:
print('Is there an entry for cubic Au?', ('Au', 'cubic') in crystals_new)

Is there an entry for cubic Au? True


## Exercise: Comprehend alkali metal halide crystals

The `pymatgen` materials analysis library provides an *enumeration* of chemical elements via `Element`. `Element` is iterable, yielding one chemical element at a time, where the object representing each element has some useful properties. For example:

In [11]:
from pymatgen import Element

halogens = []
for e in Element:
    if e.is_halogen:
        halogens.append(e)
print(halogens)

[Element F, Element Cl, Element Br, Element I, Element At]


In this exercise, read over the following code to understand what it does, and then see if you can refactor it (along with th above code!) to be more succinct using list comprehensions.

In [12]:
# %load ../code/halide_crystals.py
from pymatgen import Element

halogens = []
for e in Element:
    if e.is_halogen:
        halogens.append(e)

halide_systems = []
for h in halogens:
    alkali_metals = []
    for e in Element:
        if e.is_alkali:
            alkali_metals.append(e)
    for m in alkali_metals:
        chemsys = "-".join(sorted([h.symbol, m.symbol]))
        halide_systems.append(chemsys)

halide_crystals = []
for c in crystals:
    if c['chemsys'] in halide_systems:
        halide_crystals.append(c)

formulae = []
for c in halide_crystals:
    formulae.append(c['pretty_formula'])

print(formulae)

['RbI', 'KBr', 'KI', 'RbI3', 'KF3', 'KF', 'NaBr', 'RbF2', 'CsI4', 'KF2', 'LiBr', 'LiCl', 'NaI', 'CsI3', 'CsF', 'LiF', 'RbCl', 'CsI', 'LiI', 'RbBr', 'NaCl', 'KCl', 'CsBr3', 'CsBr', 'CsCl', 'RbF', 'NaF', 'RbF3']


Note for the adventurous: you can nest `for` headers in a comprehension, e.g.

    [expression
     for var1 in collection1
     for var2 in collection2
     if conditional]

is equivalent to

```python
for var1 in collection1:
    for var2 in collection2:
        if conditional:
            expression
```