# Comparisons, masks and Boolean logic
--------------------

Using of Boolean masks to examine and manipulate values within NumPy arrays based on some criterion: for example, to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

### 1. Comparison operators as ufuncs
---------------------

* NumPy implements comparison operators such as ``<``, ``>``, ``==``  and ``!=`` as element-wise ufuncs.
* The result of these operators is always an **array** with a Boolean data type.
* ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``.

In [None]:
import numpy as np

In [None]:
x = np.array([1, 2, 3, 4, 5])

In [None]:
x < 3  # less than

In [None]:
x > 3  # greater than

In [None]:
x <= 3  # less than or equal

In [None]:
x >= 3  # greater than or equal

In [None]:
x != 3  # not equal

In [None]:
x == 3  # equal

It is also possible to include compound expressions:

In [None]:
(2 * x) == (x ** 2)

Comparison operators are implemented as ufuncs: 

| Operator	    | Equivalent ufunc    || Operator	   | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``==``         |``np.equal``         ||``!=``         |``np.not_equal``     |
|``<``          |``np.less``          ||``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       ||``>=``         |``np.greater_equal`` |

These works on arrays of any size and shape:

In [None]:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

In [None]:
x < 6

In each case, the result is a Boolean array.

### 2. Working with Boolean Arrays
--------------------

 ``x`` the two-dimensional array:

In [None]:
print(x)

#### 2.1. Counting entries

To count the number of ``True`` entries in a Boolean array:

In [None]:
# how many values less than 6?
np.count_nonzero(x < 6)

Another way to get at this information is to use ``np.sum`` :

In [None]:
np.sum(x < 6)

The benefit of ``sum()`` is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well.

This counts the number of values less than 6 in each row of the matrix:

In [None]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

In [None]:
# how many values less than 6 in each сolumn?
np.sum(x < 6, axis=0)

If we're interested in quickly checking whether any or all the values are true, we can use ``np.any`` or ``np.all``:

In [None]:
# are there any values greater than 8?
np.any(x > 8)

In [None]:
# are there any values less than zero?
np.any(x < 0)

In [None]:
# are all values less than 10?
np.all(x < 10)

In [None]:
# are all values equal to 6?
np.all(x == 6)

``np.all`` and ``np.any`` can be used along particular axes as well. For example:

In [None]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)

Warning: Python's built-in ``sum()``, ``any()``, and ``all()`` functions have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays.

#### 2.2. Boolean operators

* **bitwise logic operators**:  ``&``, ``|``, ``^``, and ``~``.
* NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.

| Operator	    | Equivalent ufunc    || Operator	    | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``&``          |``np.bitwise_and``   ||&#124;         |``np.bitwise_or``    |
|``^``          |``np.bitwise_xor``   ||``~``          |``np.bitwise_not``   |

Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations.

### 3. Boolean arrays as masks
----------------------

A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.

Returning to our ``x`` array from before:

In [None]:
x

Suppose we want an array of all values in the array that are less than 5. We can obtain a **Boolean array** for this condition:

In [None]:
x < 5

Now to **select** these values from the array, we can simply index on this Boolean array; this is known as a **masking** operation:

In [None]:
x[x < 5]

What is returned is a **one-dimensional** array filled with all the values in positions at which the mask array is ``True``.

### 4. Using the keywords ``and/or`` versus the operators ``&/|``
--------------------
*  Differences in use:     
     * ``and`` and ``or`` gauge the truth or falsehood of **entire object**
     * ``&`` and ``|`` refer to **bits within each object**

* Using ``and`` or ``or`` is equivalent to asking Python to treat the object as a single Boolean entity. In Python, all **nonzero integers** will evaluate as **True**. 

In [None]:
bool(42), bool(0)

In [None]:
bool(42 and 0)

In [None]:
bool(42 or 0)

* When   ``&`` and ``|`` are used on integers, the expression operates on the bits of the element, applying  ``and`` or  ``or`` to the individual bits making up the number. The corresponding bits of the binary representation are compared in order to yield the result.

In [None]:
bin(42)

In [None]:
bin(59)

In [None]:
bin(42 & 59)

In [None]:
bin(42 | 59)

* An **array of Booleans**  in NumPy can be thought of as a string of bits where ``1 = True`` and ``0 = False``, and the result of ``&`` and ``|`` operates similarly to above:

In [None]:
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B

* Using ``or`` on these arrays will try to evaluate the truth or falsehood of the entire array object, which is **not a well-defined value**:

In [None]:
A or B

* Similarly, when doing a Boolean expression on a given array, you should use ``|`` or ``&`` rather than ``or`` or ``and``:

In [None]:
x = np.arange(10)
(x > 4) & (x < 8)

* Trying to evaluate the truth or falsehood of the entire array will give the same ``ValueError`` we saw previously:

In [None]:
(x > 4) and (x < 8)

#### 5. Example: Counting Rainy Days

There is a series of data that represents the amount of precipitation each day for a year in a given city.
For example, Seattle in 2014:

In [None]:
import numpy as np
import csv

In [None]:
rainfall=[]
with open('Seattle2014.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for line in csv_reader:
        rainfall.append(float(line['PRCP']))

In [None]:
inches = np.array(rainfall)/ 254.0
print(inches)

In [None]:
inches.shape

The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.

As a first quick visualization, let's look at the histogram of rainy days, which was generated using Matplotlib :

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot styles

In [None]:
plt.hist(inches, 10);

This histogram gives us a general idea of what the data looks like: despite its reputation, the vast majority of days in Seattle saw near zero measured rainfall in 2014.
But this doesn't do a good job of conveying some information we'd like to see: for example, how many rainy days were there in the year? What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain?

Examples of results we can compute when combining masking with aggregations:

In [None]:
print("Number days without rain:      ", np.sum(inches == 0))
print("Number days with rain:         ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches  :", np.sum((inches > 0) &
                                                (inches < 0.2)))

#### Digging into the data

* One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range. Such an approach is very inefficient, both from the standpoint of time writing code and time computing the result.
* NumPy's ufuncs can be used to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise **comparisons** over arrays, and we can then manipulate the results to answer the questions we have.

In [None]:
# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))