# Python Sets
Python has a data type called sets. Sets are unordered collections of unique elements. They are similar to lists and tuples but with a few differences that don't seem helpful until you see how they solve many types of issues. You may not use them as often as lists or tuples, but knowing they exists is helpful to greatly simplify those few instances where they excel.

Sets are part of standard Python. There is no need to import to use them.

To create a set you will need to use the set() function with a list, tuple, array, ... some sort of container. All values in a set will be unique.

In [1]:
import numpy as np
numbers = [1, 2, 3, 4, 5, 4, 3]  # Notice the values repeat
a = set(numbers)  # Create set from list
b = set((4, 5, 6, 7, 4, 5))  # Creaet set from tuple
c = set(np.arange(6, 11))  # Create set from numpy array

print(type(numbers), numbers)
print(type(a), a)  # Notice how it prints the values in curly brackets.

<class 'list'> [1, 2, 3, 4, 5, 4, 3]
<class 'set'> {1, 2, 3, 4, 5}


We can check for the existenance of a value in the set with the same notiation as lists.

In [2]:
1 in numbers  # Does the value 1 exist in the list values?

True

In [3]:
1 in a  # Does the value 1 exist in the set a?

True

In [4]:
6 not in a  # Does the value 6 not exist in a?

True

In [5]:
len(a)  # The size of set a

5

But here is where sets excel. Since they contain a unique set of values we can pass in a list and get a unique set back. It can be any imutable object including a number, string, tuple, array, object.

In [6]:
values = [1, 2, "three", "four", 2, 1, 3, "four", (1, 2)]
set(values)

{(1, 2), 1, 2, 3, 'four', 'three'}

In [7]:
a = set([1, 2, 3, 4])
b = set([3, 4, 5, 6])

print('union:', a.union(b))
print('intersection:', a.intersection(b))
print('difference a - b:', a.difference(b))
print('difference b - a:', b.difference(a))
print('symmetric_difference:', a.symmetric_difference(b))

union: {1, 2, 3, 4, 5, 6}
intersection: {3, 4}
difference a - b: {1, 2}
difference b - a: {5, 6}
symmetric_difference: {1, 2, 5, 6}


We can also check if one set is contained within a different set.

In [8]:
a = set(['one', 'two', 'three', 'four'])
b = set(['two', 'three'])
b.issubset(a)

True

Or if a every element of a set is in another set.

In [9]:
a.issuperset(b)

True

In [10]:
b.issuperset(a)

False

Adding and removing values requires specail methods. They can add inplace or return an updated version.

In [11]:
a.add('five')
a

{'five', 'four', 'one', 'three', 'two'}

In [12]:
a.update(['six', 'seven'])
a

{'five', 'four', 'one', 'seven', 'six', 'three', 'two'}

In [13]:
a.remove('seven')  # Will raise error if 'seven' is not in the set.
a

{'five', 'four', 'one', 'six', 'three', 'two'}

In [14]:
a.discard('seven')  # Handles the fine even with 'seven' not in the set.

We can ask that the set be updated to only keep where values are in both.

In [15]:
a = set([1, 2, 3, 4])
b = set([3, 4, 5, 6])
a.intersection_update(b)
a

{3, 4}

Have specific values removed from a set. Notice that the values in the method are not a set. It will automatically update to work when using the method. But will raise an error if the method is attempted to run on "a" and it is not a set.

In [16]:
a = set([1, 2, 3, 4])
a.difference_update([3, 4, 5, 6])
a

{1, 2}

Or update to only have values that do not match.

In [17]:
a = set([1, 2, 3, 4])
b = [3, 4, 5, 6]
a.symmetric_difference_update(b)
a

{1, 2, 5, 6}

## OK neat tricks but why do I care?
Here are two examples that do the same thing. One with lists and one with sets. Notice how cleaner the set way of doing things is.

In [18]:
a = [1, 2, 3, 4, 5]
b = [2, 3, 4]
c = []
for ii in b:
    if ii in a:
        c.append(ii)      
c

[2, 3, 4]

In [19]:
a = [1, 2, 3, 4, 5]
b = [2, 3, 4]
c = list(set(a).intersection(b))
c

[2, 3, 4]

Here is an example where we add and remove values. Notice that we need to catch exceptions when we use the list as trying to remove a value that does not exist in the list will not work. That is not a problem with the set.

In [20]:
a = [1, 2, 3, 4, 5]
a.append(7)
try:
    a.remove(6)
except ValueError:
    pass

a

[1, 2, 3, 4, 5, 7]

In [21]:
a = set([1, 2, 3, 4, 5])
a.add(7)
a.discard(6)
a = list(a)
a

[1, 2, 3, 4, 5, 7]

There is a second syntax for the methods described above. Both work the same, choose the way that works for you.

In [22]:
a = set([1, 2, 3, 4])
b = set([3, 4, 5, 6])

print('union:', a | b)
print('intersection:', a & b)
print('difference a - b:', a - b)
print('symmetric_difference:', a ^ b)

union: {1, 2, 3, 4, 5, 6}
intersection: {3, 4}
difference a - b: {1, 2}
symmetric_difference: {1, 2, 5, 6}


Here is a typical example of how to simplify making a unique list.

In [23]:
engineers = set(['John', 'Jane', 'Jack', 'Janice', 'Tim'])
programmers = set(['Jack', 'Sam', 'Susan', 'Janice', 'Tim'])
managers = set(['Jane', 'Jack', 'Susan', 'Zack', 'Tim'])

In [24]:
print('union:', engineers | programmers | managers)
print('intersection:', engineers & managers & programmers)
print('difference:', managers - engineers - programmers)

union: {'Sam', 'Jane', 'Susan', 'Janice', 'John', 'Jack', 'Zack', 'Tim'}
intersection: {'Jack', 'Tim'}
difference: {'Zack'}


Depending on the size of comparisons the set searching may be quicker. One down side to useing sets is that you may need to use a list so you will need to convert to set, do the work, convert back to list. The conversion is quick enough you will never know the difference buy you will need to remember to convert.

In [25]:
import random

a = list(range(0, 100000))
b = random.choices(a, k=100)

In [26]:
%%timeit

c = [ii for ii in b if ii in a]

45.6 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [27]:
%%timeit

c = list(set(b) & set(a))

2.14 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Other common functions that work on lists also work on sets.

In [28]:
a = set([20, 1, 2, 3, 4, 0])
b = set([1, 2, 3, 4, 0, 20])

len(a), max(a), min(a), sorted(a), sum(a), a == b

(6, 20, 0, [0, 1, 2, 3, 4, 20], 30, True)