# Set Routines

In your daily work as a data scientist you often need to work with sets. Although it is in many cases sufficient to make use of the built-in set class of Python, there will be still situations where your workflow is based on numpy arrays and it thus becomes more convenient to use specialized numpy methods instead. In the following I will demonstrate the most central set routines in numpy, taken from the official numpy documentation page https://numpy.org/doc/stable/reference/routines.set.html.

In [1]:
# Import numpy
import numpy as np

In [2]:
# Create two sample arrays
x = np.array([1, 1, 2, 3, 3, 4, 5, 5, 5, 6, 7])
y = np.array([1, 1, 4, 4, 5, 5, 5, 8, 8, 9, 10])

In [3]:
# Give intersection: All elements that are in both sets
np.intersect1d(x, y)

array([1, 4, 5])

In [4]:
# Give union: All elements that are in either of the sets
np.union1d(x, y)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [5]:
# Give difference: All elements that are in x_set but not in y_set
np.setdiff1d(x, y)

array([2, 3, 6, 7])

In [6]:
# Give symmetric difference: All elements that are in only one of the sets
np.setxor1d(x, y)

array([ 2,  3,  6,  7,  8,  9, 10])

In [None]:
# Performance comparison
x = np.random.randint(0, 10000, 10000)
y = np.random.randint(0, 10000, 10000)
x_unique = np.unique(x)
y_unique = np.unique(y)
x_set = set(x)
y_set = set(y)
%timeit np.intersect1d(x, y)
%timeit np.intersect1d(x_unique, y_unique, assume_unique=True)
%timeit x_set.intersection(y_set)