<a href="https://colab.research.google.com/github/glassresearch/PLT/blob/master/Python%20colab%20Georgia%20Tech/numpy_boolean_mask_FA25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy Boolean Masks

In [2]:
import numpy as np
import random

### Why are we covering this topic?

#### Historically, students have had difficulty understanding the boolean mask concept when asked to use it in homework and exam exercises. So we introduce it here, to help with that understanding. We will not be doing complex exercises here, but simply introducing the concept.

Additionally, there are many instances in real world data analysis in which we want to only return the data elements the meet (or don't meet) some condition. So understanding what these masks are and how they are used is a necessary skill for DA/DS roles.

Finally, students will see that most of the previous MT2 and Final Exams (though not all) asked the students to do some form of selection, from either a pandas dataframe or numpy array, and the application of a boolean mask to the data structure would be the most appropriate manner of solving the requirement.

# Numpy boolean masks

### What is a boolean mask?

**In pandas, as we saw earlier, a mask is used filter and return only the rows that meet a certain condition.**

What is returned is only the rows that meet the masking condition.

For a good example and review, see this link, down toward the bottom, where it discusses Boolean Masks:  https://www.geeksforgeeks.org/boolean-indexing-in-pandas/

**In numpy however, a mask creates a "truth array" of the same shape as the source array being compared.**

Each element in the "truth array" will have a value of either `True` or `False`, depending on the result of the comparison on that element in the source array.

You can then use the truth array to filter/select only the source array elements that you need in your exercise.

In [3]:
# create two 2x2 numpy arrays
a = np.random.randint(0,10,(3,3))
display(a)

array([[4, 7, 1],
       [7, 0, 5],
       [9, 4, 2]])

In [4]:
a < 5
a[a<5]

array([4, 1, 0, 4, 2])

In [5]:
# using parentheses for readability
truth_less_than_five = (a < 5)
display(truth_less_than_five)
display(truth_less_than_five.dtype)

array([[ True, False,  True],
       [False,  True, False],
       [False,  True,  True]])

dtype('bool')

In [6]:
# using parentheses for readability
truth_greater_equal_five = (a >= 5)
display(truth_greater_equal_five)
display(truth_greater_equal_five.dtype)

array([[False,  True, False],
       [ True, False,  True],
       [ True, False, False]])

dtype('bool')

### So now that we have the truth array, what can we do with it?

1. We can return the values in the source array that meet (or do not meet) the truth condition. We do this by addressing the source array directly, with the truth condition itself.

    -- To select these values from the array, we simply index on the Boolean array; this is known as a masking operation.

2. We can return the (row,column) locations within the source array that meet (or do not meet) the truth condition. We do this using the numpy function np.where().

#### Let's look at an example, first scenario.

In [7]:
a

array([[4, 7, 1],
       [7, 0, 5],
       [9, 4, 2]])

In [8]:
# select the values in the array that meet the condition.
a[a < 5]

array([4, 1, 0, 4, 2])

In [9]:
# select the values in the array that meet the condition.
a[truth_less_than_five]

array([4, 1, 0, 4, 2])

What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is `True`.

**We can then use these values as required in the exercise. This is the key takeaway here. When we want to filter/return only the values that meet some criteria, we want to use a Boolean Mask to do so.**

#### What about the second scenario, in which we want to return the (row,column) locations of the data elements that meet (or do not meet) the condition?

We can use `np.where()` for this.

This function can take 3 parameters, and it has many uses. For our purposes, we are going to show a very simple way to use it. Students should review the below links for additional usages.

https://numpy.org/doc/2.2/reference/generated/numpy.where.html

https://www.geeksforgeeks.org/numpy-where-in-python/#using-numpywhere-with-x-and-y

#### When you use np.where() with a 2D array and a condition, it returns a tuple of arrays.

#### The first array in the tuple contains the row indices where the condition is true, and the second array in the tuple contains the corresponding column indices.

In [10]:
a

array([[4, 7, 1],
       [7, 0, 5],
       [9, 4, 2]])

In [11]:
np.where(a<5,a,'silly')

array([['4', 'silly', '1'],
       ['silly', '0', 'silly'],
       ['silly', '4', '2']], dtype='<U21')

In [13]:
# show the tuple
return_tuple = np.where(a<5)
return_tuple

(array([0, 0, 1, 2, 2]), array([0, 2, 1, 1, 2]))

#### We know that the tuple returns rows and columns, so we can put them in their own variables.

In [14]:
rows, columns = np.where(a<5)
print("Row indices:", rows)
print("Column indices:", columns)

Row indices: [0 0 1 2 2]
Column indices: [0 2 1 1 2]


#### To return the values, address the array with the row and column indices.

In [15]:
selected_elements = a[rows,columns]
print("Selected elements:", selected_elements)

Selected elements: [4 1 0 4 2]


#### Using the truth array from above.

In [16]:
truth_less_than_five

array([[ True, False,  True],
       [False,  True, False],
       [False,  True,  True]])

In [17]:
rows1, columns1 = np.where(truth_less_than_five)
print("Row indices:", rows1)
print("Column indices:", columns1)

Row indices: [0 0 1 2 2]
Column indices: [0 2 1 1 2]


In [18]:
selected_elements1 = a[rows1,columns1]
print("Selected elements1:", selected_elements1)

Selected elements1: [4 1 0 4 2]


**You will see applications of these ideas in the homework notebooks and sample/practice midterm and final exam notebooks.**

The numpy documentation has an excellent reference on the logic and functions you can use when applying Boolean Masks:  https://numpy.org/doc/stable/reference/routines.logic.html

As noted previously, Vanderplas has a good introduction to Boolean Masks in his book:  https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

### What are your questions on Boolean Masks in Numpy?