<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/Copy_of_4_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filtering Data in Numpy

### Introduction

Now that we know how to work with multidimensional arrays, the next step is to lern how to filter by some criteria.  Let's check it out.

### Broadcasting with Boolean Operators

To start, let's introduce another way of select entries from an array in numpy.  We'll first initialize a new array of values zero through 4.

In [0]:
import numpy as np
range_nums = np.arange(5)
range_nums

array([0, 1, 2, 3, 4])

> Note that if we just pass the number 5 to `arange` numpy still returns 5 items, starting from zero.

Ok, so if we want to select every other item from the array above, one way is to create an array of alternating `True` and `False` values...

In [0]:
bool_vals = np.array([True, False, True, False, True])
bool_vals

array([ True, False,  True, False,  True])

And then use this boolean array to select items from our `range_nums` array above.

In [0]:
range_nums[bool_vals]

array([0, 2, 4])

> Let's see that again.

In [0]:
bool_vals = np.array([True, False, True, False, True])
arr = np.array([0, 1, 2, 3, 4])

arr[bool_vals]

array([0, 2, 4])

So we are provide the True or False value to indicate whether to select an item at that index.

This procedure is called **boolean indexing**.

### Boolean Indexing Rows

Notice that we can also use **boolean indexing** on a row of data.  Let's first initialize some data that makes this clear.  

> We initialize a new array where each the items in each row increase.  The `indices` method, which is not that important, accomplishes this.

In [0]:
grid = np.indices([5,5])[0]
grid

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

Then we can use our boolean array to select every other row from the above array.

In [0]:
bool_vals

array([ True, False,  True, False,  True])

In [0]:
grid[bool_vals]

array([[0, 0, 0, 0, 0],
       [2, 2, 2, 2, 2],
       [4, 4, 4, 4, 4]])

### Onto Filtering Data

Now what does this have to do with filtering data you ask?  Well just like with ordinary Python we can use boolean operators on an array in numpy.

Let's see this on a grid of numbers one through twenty five.

In [0]:
increasing_grid = np.arange(1, 26).reshape(5, 5)
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

Now let's return `True` or `False` based on whether the second item less than 10.  Notice that  the first two rows meet this criteria.

We can use broadcasting to see this.

In [0]:
new_bools = increasing_grid[:, 1] < 10
new_bools

array([ True,  True, False, False, False])

And then we can use these values to select the appropriate rows.

In [0]:
increasing_grid[new_bools]

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

And this just returns us the first two rows.

> But generally, we will see all of this in one step:

In [0]:
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [0]:
increasing_grid[increasing_grid[:, 1] < 10]

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

This says select the rows where the second item is less than 10.

* Your turn

Try to select all of the rows where the last element is 20.

In [0]:
increasing_grid[increasing_grid[:, -1] == 20]

# array([[16, 17, 18, 19, 20]])

array([[16, 17, 18, 19, 20]])

We should only return the second to last row.

### Filtering even or odd

Next let's review filtering by only selecting the rows that have an even number in the third column.

> First we'll give you an opportunity to try this yourself.

In [0]:
import numpy as np
increasing_grid = np.arange(1, 26).reshape(5, 5)
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

> Doing so, requires knowledge of `%`, `modulo`, which returns the *remainder* of a divisor.  For example, below is the remainder of dividing each item in the above array by 2.

In [0]:
increasing_grid % 2

array([[1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1]])

Ok, now try using module to only select the rows that have an even number in the third column. 

> Embrace the struggle...we'll show you the answer below.

In [0]:
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

## The answer

1. Select all the rows, the third column

In [0]:
increasing_grid[:, 2]

array([ 3,  8, 13, 18, 23])

2. Then use broadcasting to query each item in that column

In [0]:
increasing_grid[:, 2] % 2

array([1, 0, 1, 0, 1])

> We use the `modulo` operator which returns the remainder. So here, modulo returns the remainder of dividing each item by 2.  So `3 % 2 = 1`.  

> Where the `% 2 == 0`, our element is even.

In [0]:
increasing_grid[:, 2] % 2 == 0

array([False,  True, False,  True, False])

3. Use boolean indexing

Now we can find all of the rows where our third item is even.

In [0]:
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [0]:
increasing_grid[increasing_grid[:, 2] % 2 == 0]

array([[ 6,  7,  8,  9, 10],
       [16, 17, 18, 19, 20]])

### Summary

In this lesson we saw how to filter data in a multidimensional array.  We do so by taking advantage of boolean indexing in Numpy, which allows us to select from an array by passing through True or False value, for the corresponding entries that we wish to return.

In [0]:
increasing_grid

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [0]:
bool_vals = np.array([ True, False,  True, False,  True])

In [0]:
increasing_grid[bool_vals]

array([[ 1,  2,  3,  4,  5],
       [11, 12, 13, 14, 15],
       [21, 22, 23, 24, 25]])

It also uses broadcasting, so we can check whether multiple elements match or do not match some criteria.

In [0]:
increasing_grid[:, 0] < 10

array([ True,  True, False, False, False])

And then combines the two to filter entire rows.

In [0]:
increasing_grid[increasing_grid[:, 0] < 10]

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

### Solutions

In [0]:
increasing_grid[increasing_grid[:, -1] == 20]

### Resources

[Replace corrupted text](https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string)

[Numpy filtering](http://heydenberk.com/blog/posts/demystifying-pandas-numpy-filtering/)