# Numerical Python: NumPy II


[NumPy](http://www.numpy.org/) 

# [![Numpy logo](https://numfocus.org/wp-content/uploads/2016/07/numpy-logo-300.png)](https://matplotlib.org/gallery/mplot3d/voxels_numpy_logo.html)

In our last lecture we explored the basics of the numpy module.  
We talked about why we need numpy, how to build various numpy arrays and their fundamental attributes. We ended the class by looking at numpy's ufuncs, which is where numpy power lays.  
Today we are going to explore some more advanced capabilities of numpy :
1. __Aggregation__
1. __Broadcasting__
1. __fancy indexing and boolean indexing__

First thing first, let import numpy

In [2]:
import numpy as np
# This will make sure float print come out nicely
np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)

*** 
# Numpy aggregation

In most data science application we start exploring the data by querying different statistics.   
Numpy allows us to do that quickly by using aggregation functions (you aggregate information as you iterate over the array), which summarize the values in an array.
Some of the most common aggregation are : 
```py
sum, mean, std, var, min, max.   
```
To view the entire aggregation list visit : [Numpy aggregation](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html) (There is a table in the middle of the notebook)   
Here are some examples:

In [None]:
np.random.seed(1101)
a = np.random.randint(10, 20, size=10)
a

In [None]:
a.min(), a.max()

In [None]:
a.mean(), a.std()

In [None]:
a.argmin()

### Aggregation over multi dimensional arrays
As the title suggests we can aggregate over many dimensions, and summarize a statistics over an entire matrix for example:

In [None]:
M = np.random.rand(5, 5)
M.max(), M.min(), M.mean()

But, in many cases you are not interested in statistics of the entire matrix but one of the axis of it, say the columns for example. All you have to do for that is just specify the axis upon which you wish to aggregate.

<img src="http://www.elimhk.com/myblog/wp-content/uploads/2017/04/axis.png" width="">

To remember what would be the output shape of an aggregation over an axis, I like to think about collapsing that axis. So if you have an array with shape (10, 3) and you aggregate over axis 0, you'll end up with (1, 3), since you "collapsed" the 0 axis.

In [None]:
a = np.arange(10).reshape(2, 5)
a, a.sum()

In [None]:
a.sum(axis=0) # How many values are we expecting to get?

In [None]:
a.sum(axis=1) # How many values are we expecting to get?

In [None]:
np.random.seed(109)
a = np.random.randint(low=0, high=100, size=(10, 2))
a

In [None]:
a.mean(axis=0)

***
## Exercise
***

In [4]:
np.random.seed(1111)
X = np.random.randint(low=0, high=50, size=(30, 4))
X

array([[28, 37, 17, 12],
       [34, 24, 22, 20],
       [11, 14,  8, 38],
       [12, 46, 22,  8],
       [41, 42, 12, 30],
       [14, 12,  4, 13],
       [40,  9,  9, 23],
       [18,  0, 36,  8],
       [ 5, 21, 17, 45],
       [32, 45, 11, 31],
       [29, 21, 44, 45],
       [34, 24,  0, 23],
       [29, 47, 25,  0],
       [40, 11, 47, 33],
       [41,  2,  9, 39],
       [40, 11, 38,  7],
       [ 9, 13, 17, 14],
       [27, 22,  2, 35],
       [21, 42, 23, 37],
       [10, 41,  7, 35],
       [13,  5, 33, 32],
       [48, 30, 14, 43],
       [48, 20, 29, 43],
       [13, 10, 21,  6],
       [30, 29,  8,  3],
       [ 0,  2, 25, 23],
       [ 5, 38, 39, 11],
       [21, 37, 22, 15],
       [30, 25, 26, 34],
       [44, 24,  0, 28]])

__Get the mean values of each column in X__

In [5]:
# Your code here
#array([25.57, 23.47, 19.57, 24.47])
X.mean(axis=0)

array([25.57, 23.47, 19.57, 24.47])

__Get the max value of each row in X__

In [6]:
# Your code here
# array([37, 34, 38, 46, 42, 14, 40, 36, 45, 45, 45, 34, 47, 47, 41, 40, 17,
#        35, 42, 41, 33, 48, 48, 21, 30, 25, 39, 37, 34, 44])
X.max(axis=1)

array([37, 34, 38, 46, 42, 14, 40, 36, 45, 45, 45, 34, 47, 47, 41, 40, 17,
       35, 42, 41, 33, 48, 48, 21, 30, 25, 39, 37, 34, 44])

Get the median value of all the values of X

In [7]:
# Your code here
#23.0
np.median(X)

23.0

__Get the variance of each column of X plus the variance of each row of Y.__

In [8]:
np.random.seed(2222)
Y = np.random.randint(low=0, high=50, size=(4, 20))
Y

array([[33, 17, 41,  6, 24, 40, 32, 36, 25,  0, 28, 24, 16, 17,  6, 16,
        28, 18, 36,  2],
       [15, 25, 41,  5, 35, 21,  2, 20, 27, 40, 19, 42, 40,  1, 45, 12,
        17, 44, 19, 22],
       [45, 16, 34, 22,  6, 29, 36, 25,  7, 26, 40, 19, 24, 13, 16, 17,
        36, 17,  8, 10],
       [10, 23, 36, 14, 30, 28, 29,  0, 35, 46, 25, 46,  2,  0, 38, 12,
        26, 16, 33, 26]])

In [9]:
# Your code goes here

# [188.78 201.52 162.85 185.05]
# [145.99 190.84 124.91 186.79]

#array([334.77, 392.36, 287.76, 371.84])
v_x = np.var(X, axis=0)
print(v_x)
v_y = np.var(Y, axis=1)
print(v_y)
v_x + v_y

[188.78 201.52 162.85 185.05]
[145.99 190.84 124.91 186.79]


array([334.77, 392.36, 287.76, 371.84])

## Wine Example
<img src="https://www.ironstonevineyards.com/wp-content/uploads/2017/06/wine-club-cheers.jpg" width="300" height="">

We are going to use the wine dataset from sklearn (THE machine learning module in python).    
In this data, each row corresponds to a specific type of wine. The row values consist of different chemical compounds of the wine(alcohol, malic acid, magnesium etc...) along side the "score" of the wine. The "goal" in this data is to use the chemical properties in order to predict the wine score.  
In machine learning the "properties" are referred to as __features__ while the value we are trying to predict is referred to as the __label__ or __target__.

|  Type |Alcohol | Malic acid| ash | ... |
|---|---|---|---|---|
|Merlot Galil| 12.5 | 2.3| 4.5| ...|
|Merlot Arava| 13.2 | 3.1| 2.5| ...|
|Cabrniet Negev| 14.1 | 3.3| 4.1| ...|

In [42]:
from sklearn import datasets # We are going to use sklearn to load a sample dataset.

# Load the wine dataset.
wine_dataset = datasets.load_wine()

# Extract the feature names from the dataset.
features_names = wine_dataset['feature_names']

# Extract the features matrix from the dataset.
X = wine_dataset['data']
X






array([[  14.23,    1.71,    2.43, ...,    1.04,    3.92, 1065.  ],
       [  13.2 ,    1.78,    2.14, ...,    1.05,    3.4 , 1050.  ],
       [  13.16,    2.36,    2.67, ...,    1.03,    3.17, 1185.  ],
       ...,
       [  13.27,    4.28,    2.26, ...,    0.59,    1.56,  835.  ],
       [  13.17,    2.59,    2.37, ...,    0.6 ,    1.62,  840.  ],
       [  14.13,    4.1 ,    2.74, ...,    0.61,    1.6 ,  560.  ]])

In [39]:
# Let's take a look at what kind of features we have to work with
features_names

# ['alcohol',
#  'malic_acid',
#  'ash',
#  'alcalinity_of_ash',
#  'magnesium',
#  'total_phenols',
#  'flavanoids',
#  'nonflavanoid_phenols',
#  'proanthocyanins',
#  'color_intensity',
#  'hue',
#  'od280/od315_of_diluted_wines',
#  'proline']

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [40]:
# Let's see how much data we have, and a small sanity check
X.shape, len(features_names)
#((178, 13), 13)

((178, 13), 13)

In [None]:
#!conda install scikit-learn

***
## Exercise
***
__Extract the alcohol column__

In [59]:
# Your code here.

wine_dataset.data[:,0]
X[:,0]

array([14.23, 13.2 , 13.16, 14.37, 13.24, 14.2 , 14.39, 14.06, 14.83,
       13.86, 14.1 , 14.12, 13.75, 14.75, 14.38, 13.63, 14.3 , 13.83,
       14.19, 13.64, 14.06, 12.93, 13.71, 12.85, 13.5 , 13.05, 13.39,
       13.3 , 13.87, 14.02, 13.73, 13.58, 13.68, 13.76, 13.51, 13.48,
       13.28, 13.05, 13.07, 14.22, 13.56, 13.41, 13.88, 13.24, 13.05,
       14.21, 14.38, 13.9 , 14.1 , 13.94, 13.05, 13.83, 13.82, 13.77,
       13.74, 13.56, 14.22, 13.29, 13.72, 12.37, 12.33, 12.64, 13.67,
       12.37, 12.17, 12.37, 13.11, 12.37, 13.34, 12.21, 12.29, 13.86,
       13.49, 12.99, 11.96, 11.66, 13.03, 11.84, 12.33, 12.7 , 12.  ,
       12.72, 12.08, 13.05, 11.84, 12.67, 12.16, 11.65, 11.64, 12.08,
       12.08, 12.  , 12.69, 12.29, 11.62, 12.47, 11.81, 12.29, 12.37,
       12.29, 12.08, 12.6 , 12.34, 11.82, 12.51, 12.42, 12.25, 12.72,
       12.22, 11.61, 11.46, 12.52, 11.76, 11.41, 12.08, 11.03, 11.82,
       12.42, 12.77, 12.  , 11.45, 11.56, 12.42, 13.05, 11.87, 12.07,
       12.43, 11.79,

__Find the mean, max and min values of the alcohol feature__

In [48]:
# Your code here.
#(13.00061797752809, 14.83, 11.03)
alc = wine_dataset.data[:,0]
print(alc.mean(),alc.max(),alc.min())



13.00061797752809 14.83 11.03


__find the mean value of the flavanoids column divided by the nonflavanoid_phenols column__

In [58]:
# Your code here.
# 6.8496995472735565
nflavanoids = wine_dataset.data[:,7] 

flavanoids = wine_dataset.data[:,6]

(flavanoids/nflavanoids).mean()


6.8496995472735565

***
## Broadcasting

A very powerful mechanism of NumPy arrays is [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
Broadcasting is used when an operation is used on two arrays of different shapes.
The rules are:

1. If arrays dimension differ, left-pad the smaller array's shape with 1s.
1. If the shapes differ, change any dimension of size 1 to match the dimension of the other array.
1. If shapes still differ, raise an error.

Some exmaples:
![broadcasting examples](http://www.astroml.org/_images/fig_broadcast_visual_1.png)

In [60]:
np.arange(3) + 5

array([5, 6, 7])

In [61]:
np.ones((3,3)) + np.arange(3)

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

In [62]:
np.arange(3).reshape((3, 1)) + np.arange(3)

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

Let's see an example where this breaks

In [67]:
np.ones((3,3, 3)) + np.ones((3, 3))

array([[[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]]])

As an example, we can use broadcasting to quickly build a multiplication table.

In [68]:
# Shape (10,)      shape(10, 1) -> Broadcasting to (10, 10)
np.arange(1, 11) * np.arange(1, 11).reshape(10, 1)

array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])

In [69]:
np.arange(1, 11).reshape(10, 1) * np.arange(1, 11)

array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])

In [70]:
print(np.arange(1, 11))
print(np.arange(1, 11).reshape(10, 1))

[ 1  2  3  4  5  6  7  8  9 10]
[[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]]


***
# Exercise
***
Use `a` and `b` to produce the following output:
```py
array([[ 2,  4,  6,  8],
       [ 3,  6,  9, 12]])
```

In [122]:
# Your code here
np.arange(1, 5) * np.arange(2,4,1).reshape(2, 1)

array([[ 2,  4,  6,  8],
       [ 3,  6,  9, 12]])

__Given a 1D array `X`, calculate the differences between each two elements of `X` using broadcasting and save it to array `D`, Meaning `D[i,j] = X[i] - X[j]`__

In [123]:
X = np.linspace(1, 10, 10)

In [124]:
X

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [131]:
# Your code here
D = [[i- j for i in X] for j in X]
D

D = X - X.reshape(10,1)
D

array([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
       [-1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
       [-2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
       [-3., -2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.],
       [-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.,  5.],
       [-5., -4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.],
       [-6., -5., -4., -3., -2., -1.,  0.,  1.,  2.,  3.],
       [-7., -6., -5., -4., -3., -2., -1.,  0.,  1.,  2.],
       [-8., -7., -6., -5., -4., -3., -2., -1.,  0.,  1.],
       [-9., -8., -7., -6., -5., -4., -3., -2., -1.,  0.]])

***
__In front of you is an array of prices of different products in shekels, let's call the sum of these products a basket. You would like to know the basket price in each of the following currencies:__
1. Dollar (1 shekel -> 0.28)
1. Euro   (1 shekel -> 0.26)
1. Yuan   (1 shekel -> 2.03)
1. Yen    (1 shekel -> 30.11)

__Use broadcasting and aggregation to quickly find out the price of the baskets.__

In [137]:
prices = np.array([50, 25, 80, 100, 150, 275])

# Your code here
curr = np.array([0.28,0.26,2.03,30.11])
prices_m  = curr * prices.sum()
prices_m

array([  190.4,   176.8,  1380.4, 20474.8])

***
__A very common procedure in machine learning is to use a normalization technique on the data prior to feeding it to an algorithm.__


Use aggregation to center mean (having mean of 0) the columns of the following X matrix.

In [139]:
np.random.seed(1111)
X = np.random.randint(low=0, high=50, size=(30, 4))
X

array([[28, 37, 17, 12],
       [34, 24, 22, 20],
       [11, 14,  8, 38],
       [12, 46, 22,  8],
       [41, 42, 12, 30],
       [14, 12,  4, 13],
       [40,  9,  9, 23],
       [18,  0, 36,  8],
       [ 5, 21, 17, 45],
       [32, 45, 11, 31],
       [29, 21, 44, 45],
       [34, 24,  0, 23],
       [29, 47, 25,  0],
       [40, 11, 47, 33],
       [41,  2,  9, 39],
       [40, 11, 38,  7],
       [ 9, 13, 17, 14],
       [27, 22,  2, 35],
       [21, 42, 23, 37],
       [10, 41,  7, 35],
       [13,  5, 33, 32],
       [48, 30, 14, 43],
       [48, 20, 29, 43],
       [13, 10, 21,  6],
       [30, 29,  8,  3],
       [ 0,  2, 25, 23],
       [ 5, 38, 39, 11],
       [21, 37, 22, 15],
       [30, 25, 26, 34],
       [44, 24,  0, 28]])

In [146]:
# Your code here
X_new = X - X.mean(axis=0)
X_new.mean(axis=0)

array([0., 0., 0., 0.])

__Use `np.isclose` to validate that all the columns in the new matrix have mean 0.__

In [153]:
# Your code here
np.isclose(X_new.mean(axis=0),0)

array([ True,  True,  True,  True])

***
__Let X,Y be 2 random variables. In front of you is the joint distribution, J, of X and Y.  J[i. j] = $p(x=i, y=j)$  
Find out if X and Y are independent.__  
Reverse this string ([::-1]) for a hint: 
```py
J nevig eht ot erapmoc dna noitubirtsid tnioj eht etupmoc ,noitagergga gnisu Y dna X fo slanigram eht etupmoC
```

In [154]:
J = np.array(([
    [0.04 , 0.03 , 0.02 , 0.01 ],
    [0.075, 0.1  , 0.05 , 0.025],
    [0.075, 0.1  , 0.05 , 0.025],
    [0.12 , 0.16 , 0.08 , 0.04 ]
]))

In [158]:
# Your code here
s = "J nevig eht ot erapmoc dna noitubirtsid tnioj eht etupmoc ,noitagergga gnisu Y dna X fo slanigram eht etupmoC"
s1 = s[::-1]
s1

'Compute the marginals of X and Y using aggregation, compute the joint distribution and compare to the given J'

In [176]:
np.isclose(J ,J.sum(axis=0) * J.sum(axis=1).reshape(4,1))
#J.sum(axis=1).T.shape

array([[False, False,  True,  True],
       [False, False,  True,  True],
       [False, False,  True,  True],
       [False, False,  True,  True]])

***
## Boolean indexing and Fancy indexing

### Boolean operations
Before we talk about boolean indexing we'll talk about boolean ufuncs.  
We saw we can operate on numpy arrays in an element wise fashion using arithmetic functions, which will result in the computation of the operation on each element. We can also work element wise using boolean operations which will result in a boolean array indicating whether the boolean operator was True or False on each element.

In [3]:
np.random.seed(2611)
a = np.random.randint(low=0, high=50, size=20)
a

array([18, 18, 33,  6, 16, 27, 15, 26, 32, 30, 17,  9, 39, 38, 37, 27, 36,
       43, 21, 10])

In [4]:
a == 18

array([ True,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

In [5]:
a > 18

array([False, False,  True, False, False,  True, False,  True,  True,
        True, False, False,  True,  True,  True,  True,  True,  True,
        True, False])

Given this boolean array we can now check different properties of our original array.  
For example we can check how many entries in our array are 18 - 

In [6]:
(a == 18).sum()

2

We can check if __all__ or __any__ of the elements possess a certain attribute:

In [7]:
(a == 18).any(), (a < 50).any(), (a < 0).any()

(True, True, False)

In [8]:
(a == 18).all(), (a < 50).all(), (a < 0).all()

(False, True, False)

And this obviously work on multi dimensional arrays as well

In [9]:
np.random.seed(23)
A = np.random.randint(low=0, high=30, size=(5, 4))
A

array([[19,  6,  8,  9],
       [22,  8, 13, 12],
       [27,  7, 26, 25],
       [19,  6, 27, 13],
       [12, 17,  2, 11]])

In [10]:
A > 9

array([[ True, False, False, False],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True,  True, False,  True]])

In [11]:
A == 6

array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True, False, False],
       [False, False, False, False]])

In [12]:
(A==6).sum(), (A>6).any(), (A>6).all()

(2, True, False)

We can use the axis parameter to aggregate over an axis and not the entire matrix.

In [13]:
(A == 6).sum(axis=0), (A<6).any(axis=1), (A>6).all(axis=0)

(array([0, 2, 0, 0]),
 array([False, False, False, False,  True]),
 array([ True, False, False,  True]))

### Bitwise operation
A bitwise operation is a function which takes in 2 boolean values {0, 1} an outputs a boolean value{0, 1}. A bitwise operation is defined by a truth table which holds the output for each combination of values.  
Numpy supports 4 boolean bitwise operation :
1. & (AND)
1. | (OR)
1. ^ (XOR)
1. ~ (NOT)  



This enables us to check multiple attributes quickly

In [14]:
np.random.seed(23)
A = np.random.randint(low=0, high=30, size=(5, 4))
A

array([[19,  6,  8,  9],
       [22,  8, 13, 12],
       [27,  7, 26, 25],
       [19,  6, 27, 13],
       [12, 17,  2, 11]])

In [15]:
(A > 8) & (A < 15) # Only values between 8 and 30 are evaluated to True

array([[False, False, False,  True],
       [False, False,  True,  True],
       [False, False, False, False],
       [False, False, False,  True],
       [ True, False, False,  True]])

__Watch Out__ Bitwise operation precede comparison. For example, the following statements fail:

In [16]:
A > 8 & A < 15

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

> __and__ / **or** vs __&__ / **|** It could be confusing to see the difference between `and` and `&` (`or` and `|`). The `and` and `or` keyword works on "truthfulness" of an entire object. When you try to evaluate 
```py 
(A > 8) and (A < 15)
```
The interperter will raise an exception since A>8 can not be evaluated as a boolean value. And `and` is not a ufunc in numpy. But 
```py
(A > 8) | (A < 15)
```
Works since `|` calls a numpy ufunc which operates in an element by element fashion.

In [17]:
(A > 8) and (A < 15)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [18]:
(A > 8).any() and (A < 15).any()

True

One of the coolest features of numpy is that you can use boolean array for indexing. This is usually referred to as __masking__.  
In the example below, we first build a boolean array, indicating whether a valuea is bigger then 7 or not. 
Then, we use this boolean array to extract all the values which are bigger than 7. This is a very powerful and convenient feature.

In [19]:
A[A > 7] # Get all values greater than 7

array([19,  8,  9, 22,  8, 13, 12, 27, 26, 25, 19, 27, 13, 12, 17, 11])

In [20]:
A[(A < 10) & (A != 6)] # Get all values smaller than 10 which are not 6

array([8, 9, 8, 7, 2])

***
### Exercise
***

__Get all the values from A which are between 10 and 20 but not 11__

In [28]:
np.random.seed(99)
A = np.random.randint(low=0, high=20, size=(5, 5))
A

array([[ 1,  3,  8,  9,  8],
       [18,  4,  5,  1,  3],
       [17,  1, 16,  6, 11],
       [ 2,  0, 12,  8,  8],
       [ 7, 15, 15,  5, 14]])

In [29]:
# Your code here
# array([19, 13, 12, 19, 13, 12, 17])

A[ (A > 11) & (A < 20)]
# array([18, 17, 16, 12, 15, 15, 14])

array([18, 17, 16, 12, 15, 15, 14])

__Use np.where to find the indices of all the values between 10 and 20 or 30 to 40 in B__

In [30]:
np.random.seed(2019)
B = np.random.randint(low=0, high=41, size=50)
B

array([ 8, 31, 37, 24, 24, 29, 15, 12, 16,  7, 37, 19, 12, 16, 31,  5, 24,
       28, 21, 27, 15,  1, 10, 32, 11, 18, 15, 22, 22, 33, 35, 16,  6, 33,
       24, 18,  8,  7,  7, 16,  3, 26, 27, 24, 17, 15, 38, 10, 17,  8])

In [31]:
# Your code here
idx = np.where(((B > 10) & (B< 20)) | ((B > 30) & (B< 40)))
idx

(array([ 1,  2,  6,  7,  8, 10, 11, 12, 13, 14, 20, 23, 24, 25, 26, 29, 30,
        31, 33, 35, 39, 44, 45, 46, 48], dtype=int64),)

__Use the same technique to find the indices of all the wines which have `alcohol` level of above 12.5 or `malic acid` of below 2 but not both!__

In [32]:
from sklearn import datasets # We are going to use sklearn to load a sample dataset.

# Load the wine dataset.
wine_dataset = datasets.load_wine()

# Extract the feature names from the dataset.
features_names = wine_dataset['feature_names']

# Extract the features matrix from the dataset.
X = wine_dataset['data']

In [34]:
# Your code here
features_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [47]:
X[:, [0, 1]]

array([[14.23,  1.71],
       [13.2 ,  1.78],
       [13.16,  2.36],
       [14.37,  1.95],
       [13.24,  2.59],
       [14.2 ,  1.76],
       [14.39,  1.87],
       [14.06,  2.15],
       [14.83,  1.64],
       [13.86,  1.35],
       [14.1 ,  2.16],
       [14.12,  1.48],
       [13.75,  1.73],
       [14.75,  1.73],
       [14.38,  1.87],
       [13.63,  1.81],
       [14.3 ,  1.92],
       [13.83,  1.57],
       [14.19,  1.59],
       [13.64,  3.1 ],
       [14.06,  1.63],
       [12.93,  3.8 ],
       [13.71,  1.86],
       [12.85,  1.6 ],
       [13.5 ,  1.81],
       [13.05,  2.05],
       [13.39,  1.77],
       [13.3 ,  1.72],
       [13.87,  1.9 ],
       [14.02,  1.68],
       [13.73,  1.5 ],
       [13.58,  1.66],
       [13.68,  1.83],
       [13.76,  1.53],
       [13.51,  1.8 ],
       [13.48,  1.81],
       [13.28,  1.64],
       [13.05,  1.65],
       [13.07,  1.5 ],
       [14.22,  3.99],
       [13.56,  1.71],
       [13.41,  3.84],
       [13.88,  1.89],
       [13.

In [44]:
alc = X[:,0]
acid = X[:, 1]
idx = np.where( ((alc > 12.5) & (acid >= 2)) | ((alc <= 12.5) & (acid < 2))) 
idx

(array([  2,   4,   7,  10,  19,  21,  25,  39,  41,  43,  45,  46,  48,
         59,  60,  63,  64,  65,  67,  69,  70,  74,  75,  78,  79,  80,
         82,  83,  84,  86,  87,  89,  90,  91,  94,  95,  97,  98, 103,
        106, 108, 109, 111, 113, 114, 115, 116, 117, 118, 123, 126, 128,
        131, 132, 133, 135, 137, 138, 139, 140, 141, 142, 143, 145, 146,
        147, 148, 149, 150, 151, 153, 155, 156, 161, 162, 163, 164, 165,
        166, 167, 168, 169, 171, 172, 173, 174, 175, 176, 177], dtype=int64),)

### Fancy indexing
Once we have extracted the indices from the features we can use those indices to get all the rows that match our criterion. The idea of using an array of integers as an indexer is called fancy indexing.

In [49]:
alc = X[:,0]
acid = X[:, 1]
mask = ((alc > 12.5) & (acid >= 2)) | ((alc <= 12.5) & (acid < 2))

ind = np.where(mask) # Assuming you got mask right.
X[np.where(mask)]

array([[  13.16,    2.36,    2.67, ...,    1.03,    3.17, 1185.  ],
       [  13.24,    2.59,    2.87, ...,    1.04,    2.93,  735.  ],
       [  14.06,    2.15,    2.61, ...,    1.06,    3.58, 1295.  ],
       ...,
       [  13.27,    4.28,    2.26, ...,    0.59,    1.56,  835.  ],
       [  13.17,    2.59,    2.37, ...,    0.6 ,    1.62,  840.  ],
       [  14.13,    4.1 ,    2.74, ...,    0.61,    1.6 ,  560.  ]])

The idea of Fancy indexing is pretty simple : we can use a scalar to pick a specific element from an array. So let's use an array to access multiple elements in an array.

<img src="https://media.giphy.com/media/dQkcf8GANR0ps57oBH/giphy.gif" width="200">

In [51]:
a = np.arange(20)

In [52]:
a[0], a[2], a[5], a[17] # Cumbersome way of accessing 4 elements in an array.

(0, 2, 5, 17)

In [53]:
ind = [0, 2, 5, 17] # Fancy indexing!
a[ind]

array([ 0,  2,  5, 17])

We can also do fancy indexing on multiple axis

In [54]:
A = np.arange(20).reshape(4, 5)
A

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [55]:
row = [0, 1, 3]
col = [1, 2, 4]
A[row, col]

array([ 1,  7, 19])

If you want to take the [1, 2, 4] col values from [0, 1, 3] rows we can use __broadcasting!__

In [58]:
row = np.array([0, 1, 3])
row
row[:, np.newaxis]

array([[0],
       [1],
       [3]])

In [59]:
row = np.array([0, 1, 3])
A[row[:, np.newaxis], col]

array([[ 1,  2,  4],
       [ 6,  7,  9],
       [16, 17, 19]])

and you can also mix between indexing types

In [60]:
np.random.seed(2020)
A = np.random.randint(low=0, high=100, size=(8, 5))
A

# array([[96,  8, 67, 67, 91],
#        [ 3, 71, 56, 29, 48],
#        [32, 24, 74,  9, 51],
#        [11, 55, 62, 67, 69],
#        [48, 28, 20,  8, 38],
#        [84, 65,  1, 79, 69],
#        [74, 73, 62, 21, 29],
#        [90,  6, 38, 22, 63]])

array([[96,  8, 67, 67, 91],
       [ 3, 71, 56, 29, 48],
       [32, 24, 74,  9, 51],
       [11, 55, 62, 67, 69],
       [48, 28, 20,  8, 38],
       [84, 65,  1, 79, 69],
       [74, 73, 62, 21, 29],
       [90,  6, 38, 22, 63]])

Using slicing and fancy indexing:

In [61]:
A[1:3, [1, 2, 4]] # Grab row 1 and 2  take col values 1, 2, 4
# array([[71, 56, 48],
#        [24, 74, 51]])

array([[71, 56, 48],
       [24, 74, 51]])

Using boolean and fancy indexing.

In [80]:
# Take the 1st, 4th, 5th and 6th row, and keep only columns that have a mean greater than 50.
rows = np.array([1, 4, 5, 6])
print(rows)
col = A.mean(0) > 50
#print(col)

A[rows[:, np.newaxis], col]

# array([[ 3, 48],
#        [48, 38],
#        [84, 69],
#        [74, 29]])

[1 4 5 6]


array([[ 3, 48],
       [48, 38],
       [84, 69],
       [74, 29]])

In [81]:
rows[:, np.newaxis], col
#(array([[1],
#         [4],
#         [5],
#         [6]]), array([ True, False, False, False,  True]))

[1 4 5 6]
[ True False False False  True]


(array([[1],
        [4],
        [5],
        [6]]), array([ True, False, False, False,  True]))

# "Losing Your Loops": Fast Numerical Computing with NumPy 

From the PyCon 2015 conferece, a [presentation](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015) by [Jake VanderPlas](http://vanderplas.com).

Also available on [YouTube](https://www.youtube.com/watch?v=EEUXKG97YRw).

# References

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) A thorough tour into Numpy. 

- [Yoav Ram Numpy Notebook](https://github.com/yoavram/SciComPy/blob/master/notebooks/numpy.ipynb) If you want to skim through most topics in Numpy.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)