<a href="https://colab.research.google.com/github/mkjubran/LearnPythonIT/blob/main/Lesson3_Numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<!--NAVIGATION-->


<a href="https://colab.research.google.com/github/smabb/p/blob/master/Lesson3_Numpy.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

|                                                  -                                                  |                                                  -                                                  |                                                  -                                                  |
|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|               [Exercise 1 (rows and columns)](<#Exercise-1-(rows-and-columns&#41;>)               |         [Exercise 2 (row and column vectors)](<#Exercise-2-(row-and-column-vectors&#41;>)         |                        [Exercise 3 (diamond)](<#Exercise-3-(diamond&#41;>)                        |
|                 [Exercise 4 (multiplication table revisited)](<#Exercise-4-(multiplication-table-revisited&#41;>)               |         [Exercise 5 (column comparison)](<#Exercise-5-(column-comparison&#41;>)      | [Exercise 6 (first half second half)](<#Exercise-6-(first-half-second-half&#41;>)                        |
|                 [Exercise 7 (most frequent first)](<#Exercise-7-(most-frequent-first&#41;>)    |
|           



# NumPy

[NumPy](https://docs.scipy.org/doc/numpy/) is a Python library for handling multi-dimensional arrays. It contains both the data structures needed for the storing and accessing arrays, and operations and functions for computation using these arrays. Although the arrays are usually used for storing numbers, other type of data can be stored as well, such as strings. Unlike lists in core Python, NumPy's fundamental data structure, the array, must have the same data type for all its elements. The homogeneity of arrays allows highly optimized functions that use arrays as their inputs and outputs.

There are several uses for high-dimensional arrays in data analysis. For instance, they can be used to:

* store matrices, solve systems of linear equations, find eigenvalues/vectors, find matrix decompositions, and solve other problems familiar from linear algebra
* store multi-dimensional measurement data. For example, an element `a[i,j]` in a 2-dimensional array might store the temperature $t_{ij}$ measured at coordinates i, j on a 2-dimension surface.
* images and videos can be represented as NumPy arrays:

  * a gray-scale image can be represented as a two dimensional array
  * a color image can be represented as a three dimensional image, the third dimension contains the color components red, green, and blue
  * a color video can be represented as a four dimensional array
* a 2-dimensional table might store a sequence of *samples*, and each sample might be divided into *features*. For example, we could measure the weather conditions once per day, and the conditions could include the temperature, direction and speed of wind, and the amount of rain. Then we would have one sample per day, and the features would be the temperature, wind, and rain. In the standard representation of this kind of tabular data, the rows corresponds to samples and the columns correspond to features. We see more of this kind of data in the chapters on Pandas .

In this chapter we will go through:

* Creation of arrays
* Array types and attributes
* Accessing arrays with indexing and slicing
* Reshaping of arrays
* Combining and splitting arrays
* Fast operations on arrays
* Aggregations of arrays
* Rules of binary array operations


We start by importing the NumPy library, and we use the standard abbreviation `np` for it.

In [None]:
#1
import numpy as np

## Creation of arrays
There are several ways of creating NumPy arrays. One way is to give a (nested) list as a parameter to the `array` constructor:

In [None]:
#2
np.array([1,2,3])   

array([1, 2, 3])

Note that leaving out the brackets from the above expression, i.e. calling `np.array(1,2,3)` will result in an error.

Two dimensional array can be given by listing the rows of the array:

In [None]:
#3
np.array([[1,2,3], [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

Similarly, three dimensional array can be described as a list of lists of lists:

In [None]:
#4
np.array([[[1,2], [3,4]], [[5,6], [7,8]]])

array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

There are some helper functions to create common types of arrays:

In [None]:
np.zeros((3,4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

To specify that elements are `int`s instead of `float`s, use the parameter `dtype`:

In [None]:
np.zeros((3,4), dtype=int)

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

Similarly `ones` initializes all elements to one, `full` initializes all elements to a specified value, and `empty` leaves the elements uninitialized:

In [None]:
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [None]:
np.full((2,3), fill_value=7)

array([[7, 7, 7],
       [7, 7, 7]])

The `eye` function creates the identity matrix, that is, a matrix with elements on the diagonal are set to one, and non-diagonal elements are set to zero:

In [None]:
np.eye(5, dtype=int)

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

The `arange` function works like the `range` function, but produces an array instead of a list.

In [None]:
np.arange(0,10,2)

array([0, 2, 4, 6, 8])

### Arrays with random elements

To test our programs we might use real data as input. However, real data is not always available, and it may take time to gather. We could instead generate random numbers to use as substitute. They can be generated really easily with NumPy, and can be sampled from several different distributions, of which we mention below only a few. Random data can simulate real data better than, for example, ranges or constant arrays. Sometimes we also need random numbers in our programs to choose a subset of real data (sampling). NumPy can easily produce arrays of wanted shape filled with random numbers. Below are few examples.

In [None]:
#5
np.random.random((3,4))          # Elements are uniformly distributed from half-open interval [0.0,1.0)

array([[0.80043085, 0.89877704, 0.04152819, 0.11014208],
       [0.52079597, 0.94762113, 0.31677617, 0.0467695 ],
       [0.99369515, 0.70044009, 0.90097943, 0.4810772 ]])

In [None]:
#7
np.random.randint(-2, 10, (3,4))  # Elements are uniformly distributed integers from the half-open interval [-2,10)

array([[5, 1, 8, 5],
       [2, 4, 8, 0],
       [8, 5, 7, 5]])

Sometimes it is useful to be able to recreate exactly the same data in every run of our program. For example, if there is a bug in our program, which manifests itself only with certain input, then to debug our program it needs to behave deterministically. We can create random numbers deterministically, if we always start from the same starting point. This starting point is usually an integer, and we call it a *seed*. Example of use:

In [None]:
#8
np.random.seed(0)
print(np.random.randint(0, 100, 10))
print(np.random.normal(0, 1, 10))

[44 47 64 67 67  9 83 21 36 87]
[ 1.26611853 -0.50587654  2.54520078  1.08081191  0.48431215  0.57914048
 -0.18158257  1.41020463 -0.37447169  0.27519832]


If you run the above cell multiple times, it will always give the same numbers, unlike the earlier examples. Try rerunning them now!

The call to `np.random.seed` initializes the *global* random number generator. The calls `np.random.random`, `np.random.normal`, etc all use this global random number generator. It is however possible to create new random number generators, and use those to sample random numbers from a distribution. Example on usage:

In [None]:
new_generator = np.random.RandomState(seed=123)  
new_generator.randint(0, 100, 10)

array([66, 92, 98, 17, 83, 57, 86, 97, 96, 47])

## Array types and attributes

An array has several attributes: `ndim` tells the number of dimensions, `shape` tells the size in each dimension, `size` tells the number of elements, and `dtype` tells the element type. Let's create a helper function to explore these attributes:

In [None]:
#9
def info(name, a):
    print(f"{name} has dim {a.ndim}, shape {a.shape}, size {a.size}, and dtype {a.dtype}:")
    print(a)

In [None]:
b=np.array([[0,1], [2,3]])
info("b", b)

b has dim 2, shape (2, 2), size 4, and dtype int64:
[[0 1]
 [2 3]]


In [None]:
c=np.array([[[0,1], [2,3]], [[4,6], [5,6]]])          # Creates a 3-dimensional array
info("c", c)


c has dim 3, shape (2, 2, 2), size 8, and dtype int64:
[[[0 1]
  [2 3]]

 [[4 6]
  [5 6]]]
6


You can think of axis-0 denoting which of the 2x2 “sheets” to select from. Then axis-1 specifies the row along the sheets, and axis-2 the column within the row:
   |       -- axis-2 ->
   |    |
   |  axis-1 [0, 1]
   |    |    [2, 3]
   |    V
axis-0
   |      -- axis-2 ->
   |    |
   |  axis-1 [4, 5]
   |    |    [6, 7]
   V    V

In [None]:
d=np.array([[1,2,3,4]])                # a row vector
info("d", d)

d has dim 2, shape (1, 4), size 4, and dtype int64:
[[1 2 3 4]]


## Indexing, slicing and reshaping

### Indexing
One dimensional array behaves like the list in Python:

In [None]:
a=np.array([1,4,2,7,9,5])
print(a[1])
print(a[-2])

4
9


For multi-dimensional array the index is a comma separated tuple instead of a single integer:

In [None]:
b=np.array([[1,2,3], [4,5,6]])
print(b)
print(b[1,2])    # row index 1, column index 2
print(b[0,-1])   # row index 0, column index -1

[[1 2 3]
 [4 5 6]]
6
3


In [None]:
# As with lists, modification through indexing is possible
b[0,0] = 10
print(b)

[[10  2  3]
 [ 4  5  6]]


Note that if you give only a single index to a multi-dimensional array, it indexes the first dimension of the array, that is the rows. For example:

In [None]:
print(b[0])    # First row
print(b[1])    # Second row

[10  2  3]
[4 5 6]


#### Slicing
Slicing works similarly to lists, but now we can have slices in different dimensions:

In [None]:
print(a)
#print(a[1:3])
#print(a[::-1])    # Reverses the array

[1 4 2 7 9 5]
[4 2]
[5 9 7 2 4 1]


In [None]:
print(b)
#print(b[:,0])
#print(b[0,:])
#print(b[:,1:])

[[1 2 3]
 [4 5 6]]


We can even assign to a slice:

In [None]:
b[:,1:] = 7
print(b)

A common idiom is to extract rows or columns from an array:

In [None]:
print(b[:,0])    # First column
print(b[1,:])    # Second row

### Reshaping

When an array is reshaped, its number of elements stays the same, but they are reinterpreted to have a different shape. An example of this is to interpret a one dimensional array as two dimension array:

In [None]:
a=np.arange(9)
anew=a.reshape(3,3)
info("anew", anew)
info("a", a)

anew has dim 2, shape (3, 3), size 9, and dtype int64:
[[0 1 2]
 [3 4 5]
 [6 7 8]]
a has dim 1, shape (9,), size 9, and dtype int64:
[0 1 2 3 4 5 6 7 8]


In [None]:
d=np.arange(4)             # 1d array
dr=d.reshape(1,4)          # row vector
dc=d.reshape(4,1)          # column vector
info("d", d)
info("dr", dr)
info("dc", dc)

d has dim 1, shape (4,), size 4, and dtype int64:
[0 1 2 3]
dr has dim 2, shape (1, 4), size 4, and dtype int64:
[[0 1 2 3]]
dc has dim 2, shape (4, 1), size 4, and dtype int64:
[[0]
 [1]
 [2]
 [3]]


<div class="alert alert-warning">
Note the 1d array and the row and column vectors, which are 2d arrays, are fundamentally different objects, even though they look similar. They behave differently when we combine or otherwise operate arrays of different shapes, as we shall see in the next section and later in this material.
</div>

#### <div class="alert alert-info">Exercise 1 (rows and columns)</div>

Write two functions, `get_rows` and `get_columns`, that get a two dimensional array as parameter.
They should return the list of rows and columns of the array, respectively. The rows and columns should be one dimensional arrays. You may use the *transpose* operation, which flips rows to columns, in your solution. The transpose is done by the `T` method:

[[0 1 9 9]
 [0 4 7 3]
 [2 7 2 0]
 [0 4 5 5]]
[[0 0 2 0]
 [1 4 7 4]
 [9 7 2 5]
 [9 3 0 5]]


## Array concatenation, splitting and stacking

The are two ways of combining several arrays into one bigger array: `concatenate` and `stack`. `Concatenate` takes n-dimensional arrays and returns an n-dimensional array, whereas `stack` takes n-dimensional arrays and returns n+1-dimensional array. Few examples of these:

In [None]:
a=np.arange(2)
b=np.arange(2,5)
print(f"a has shape {a.shape}: {a}")
print(f"b has shape {b.shape}: {b}")
np.concatenate((a,b))  # concatenating 1d arrays

a has shape (2,): [0 1]
b has shape (3,): [2 3 4]


array([0, 1, 2, 3, 4])

In [None]:
c=np.arange(1,5).reshape(2,2)
print(f"c has shape {c.shape}:", c, sep="\n")
np.concatenate((c,c))   # concatenating 2d arrays

c has shape (2, 2):
[[1 2]
 [3 4]]


array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

By default `concatenate` joins the arrays along axis 0. To join the arrays horizontally, add parameter `axis=1`:

In [None]:
np.concatenate((c,c), axis=1)

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

If you want to catenate arrays with different dimensions, for example to add a new column to a 2d array, you must first  reshape the arrays to have same number of dimensions:

In [None]:
print("New row:")
print(np.concatenate((c,a.reshape(1,2))))
print("New column:")
print(np.concatenate((c,a.reshape(2,1)), axis=1))

New row:
[[1 2]
 [3 4]
 [0 1]]
New column:
[[1 2 0]
 [3 4 1]]


Use `stack` to create higher dimensional arrays from lower dimensional arrays:

In [None]:
print(b)
np.stack((b,b))

[2 3 4]


array([[2, 3, 4],
       [2, 3, 4]])

In [None]:
np.stack((b,b), axis=1)

array([[2, 2],
       [3, 3],
       [4, 4]])

Inverse operation of `concatenate` is `split`. Its argument specifies either the number of equal parts the array is divided into, or it specifies explicitly the break points.

In [None]:
d=np.arange(12).reshape(6,2)
print("d:")
print(d)
d1,d2 = np.split(d, 2)
print("d1:")
print(d1)
print("d2:")
print(d2)

d:
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
d1:
[[0 1]
 [2 3]
 [4 5]]
d2:
[[ 6  7]
 [ 8  9]
 [10 11]]


In [None]:
d=np.arange(12).reshape(2,6)
print("d:")
print(d)
parts=np.split(d, (2,3,5), axis=1)
for i, p in enumerate(parts):
    print("part %i:" % i)
    print(p)

d:
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]
part 0:
[[0 1]
 [6 7]]
part 1:
[[2]
 [8]]
part 2:
[[ 3  4]
 [ 9 10]]
part 3:
[[ 5]
 [11]]


#### <div class="alert alert-info">Exercise 2 (row and column vectors)</div>

Create function `get_row_vectors` that returns a list of rows from the input array of shape `(n,m)`, but this time the rows must have shape `(1,m)`. Similarly, create function `get_columns_vectors` that returns a list of columns (each having shape `(n,1)`) of the input matrix .

Example: for a 2x3 input matrix

```
 [[5 0 3]
  [3 7 9]]
```

the result should be

```
Row vectors: 
[array([[5, 0, 3]]), array([[3, 7, 9]])]
Column vectors: 
[array([[5],
        [3]]), 
 array([[0],
        [7]]), 
 array([[3],
        [9]])]
```


<hr/>

[[5 0 3]
 [7 3 9]]
Row vectors: [array([[5, 0, 3]]), array([[7, 3, 9]])]
Column vectors: [array([[5],
       [7]]), array([[0],
       [3]]), array([[3],
       [9]])]


#### <div class="alert alert-info">Exercise 3 (diamond)</div>
Create a function `diamond` that returns a two dimensional integer array where the `1`s form a diamond shape. Rest of the numbers are `0`. The function should get a parameter that tells the length of a side of the diamond. Do this using the `eye` and `concatenate` functions of NumPy and array slicing.

Example of usage:
```
print(diamond(3))
[[0 0 1 0 0]
 [0 1 0 1 0]
 [1 0 0 0 1]
 [0 1 0 1 0]
 [0 0 1 0 0]]
print(diamond(1))
[[1]]
```

[[0 0 1 0 0]
 [0 1 0 1 0]
 [1 0 0 0 1]
 [0 1 0 1 0]
 [0 0 1 0 0]]


## Fast computation using universal functions

In addition to providing a way to store and access multi-dimension arrays, NumPy also provides several routines to perform computations on them. One of the reasons for the popularity of NumPy is that these computations can be very efficient, much more efficient than what Python can normally do. The biggest bottle-necks in efficiency are the loops, which can be iterated millions, billions, or even more times. The loops should be as efficient as possible. What slows down loops in Python as the fact that Python is dynamically typed language. That means that at each expression Python has find out the types of the arguments of the operations. Let's consider the following loop:

In [None]:
L=[1, 5.2, "ab"]
L2=[]
for x in L:
    L2.append(x*2)
print(L2)

[2, 10.4, 'abab']


At each iteration of this loop Python has find out the type of the variable x, which can in this example be an int, a float or a string, and depending on this type call a different function to perform the "multiplication" by two. What makes NumPy efficient, is the requirement that each element in an array must be of the same type. This homogeneity of arrays makes it possible to create *vectorized* operation, which don't operate on single elements, but on arrays (or subarrays). The previous example using vectorized operations of NumPy is shown below.

In [None]:
a=np.array([2.1, 5.0, 17.2])
a2=a*2
print(a2)

[ 4.2 10.  34.4]


Because each iteration is using identical operations only the data differs, this can compiled into machine language, and then performed in one go, hence avoiding Python's dynamic typing. 

In addition to addition there are several mathematical functions defined in the vector form. The basic arithmetic operations are: addition `+`, subtraction `-`, negation `-`, multiplication `*`, division `/`, floor division `//`, exponentation `**`, and remainder `%`. 

The can be combined into more complicated expressions. An example:

In [None]:
b=np.array([-1, 3.2, 2.4])
print(-a**2 * b)

[   4.41   -80.    -710.016]


Several other mathematical functions are defined as well. A few examples of these can be found below.

In [None]:
print(np.abs(b))
print(np.cos(b))
print(np.exp(b))
print(np.log2(np.abs(b)))

In NumPy nomenclature these vector operations are called *ufuncs* (universal functions).

## Aggregations: max, min, sum, mean, standard deviation...

Aggregations allow us to condense the information in an array into just few numbers.

In [None]:
np.random.seed(0)
a=np.random.randint(-100, 100, (4,5))
print(a)
print(f"Minimum: {a.min()}, maximum: {a.max()}")
print(f"Sum: {a.sum()}")
print(f"Mean: {a.mean()}, standard deviation: {a.std()}")

[[ 72 -53  17  92 -33]
 [ 95   3 -91 -79 -64]
 [-13 -30 -12  40 -42]
 [ 93 -61 -13  74 -12]]
Minimum: -91, maximum: 95
Sum: -17
Mean: -0.85, standard deviation: 58.39886557117355


Instead of aggregating over the whole array, we can aggregate over certain axes only as well:

In [None]:
np.random.seed(9)
b=np.random.randint(0, 10, (3,4))
print(b)
print("Column sums:", b.sum(axis=0))
print("Row sums:", b.sum(axis=1))

[[5 6 8 6]
 [1 6 4 8]
 [1 8 5 1]]
Column sums: [ 7 20 17 15]
Row sums: [25 19 15]


![aggregation](https://github.com/smabb/p/blob/master/aggregation.svg?raw=1)

<div class="alert alert-warning">
Note that most of the aggregation functions in NumPy have corresponding methods. In addition, Python language has builtin functions `sum`, `min`, `max`, `any`, and `all` for sequences. Make sure you don't accidentally use these for arrays, since they may have slightly different semantics, and they will be significantly slower than NumPy's functions and methods.
</div>

| Python function | NumPy function | NumPy method |
| ----- | -------------- | ------------ |
| sum   | np.sum         | a.sum |
| -     | np.prod        | a.prod |
| -     | np.mean        | a.mean |
| -     | np.std         | a.std |
| -     | np.var         | a.var |
| min   | np.min         | a.min |
| max   | np.max         | a.max |
| -     | np.argmin      | a.argmin |
| -     | np.argmax      | a.argmax |
| -     | np.median      | - |
| -     | np.percentile  | - |
| any   | np.any         | a.any |
| all   | np.all         | a.all |

 

Let's measure how much slower Python's `sum` function is compared to NumPy's equivalent when aggregating over an array:

In [None]:
a=np.arange(1000)
%timeit np.sum(a)

The slowest run took 64.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.12 µs per loop


In [None]:
%timeit sum(a)

The slowest run took 4.92 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 170 µs per loop


The speed of NumPy is partly due to the fact that its arrays must have same type for all the elements. This requirement allows some efficient optimizations.

## Broadcasting

We have seen that NumPy allows array operations that are performed element-wise. But NumPy also allows binary operations that don't require the two arrays to have the same shape. For example, we can add 4 to all elements of an array with the following expression:

In [None]:
np.arange(3) + np.array([4])

array([4, 5, 6])

In fact, because an array with only one element, say 4, can be thought of as a scalar 4, NumPy allows the following expression, which is equivalent to the above:

In [None]:
np.arange(3) + 4

array([4, 5, 6])

To get can idea of what operations are allowed, i.e. what shapes of the two arrays are *compatible*, it can be useful to think that before the binary operation is performed, NumPy tries to stretch the arrays to have the same shape. For example in above NumPy first stretched the array `np.array([4])` (or the scalar 4), to the array `np.array([4,4,4])` and then performed the element-wise addition. In NumPy this stretching is called *broadcasting*.

![broadcast](https://github.com/smabb/p/blob/master/broadcast.svg?raw=1)

The argument arrays can of course have higher dimensions, as the next example shows:

In [None]:
a=np.full((3,3), 5)
b=np.arange(3)
print("a:", a, sep="\n")
print("b:", b)
print("a+b:", a+b, sep="\n")

a:
[[5 5 5]
 [5 5 5]
 [5 5 5]]
b: [0 1 2]
a+b:
[[5 6 7]
 [5 6 7]
 [5 6 7]]


In this example the second argument was first broadcasted to the array

In [None]:
np.array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

and then the addition was performed. And it may be that both of the argument arrays need to be broadcasted as in the next example:

In [None]:
a=np.arange(3)
b=np.arange(3).reshape((3,1))
info("a", a)
info("b", b)
info("a+b", a+b)

To see what the arguments were broadcasted to before the binary operation, the function `np.broadcast_arrays` can be used:

In [None]:
broadcasted_a, broadcasted_b = np.broadcast_arrays(a,b)
info("broadcasted_a", broadcasted_a)
info("broadcasted_b", broadcasted_b)

To determine if two arrays are broadcast-compatible, align the entries of their shapes such that their trailing dimensions are aligned, and then check that each pair of aligned dimensions satisfy either of the following conditions:

    the aligned dimensions have the same size
    one of the dimensions has a size of 1

The two arrays are broadcast-compatible if either of these conditions are satisfied for each pair of aligned dimensions.


     array-1:         8
     array-2: 5 x 2 x 8
result-shape: 5 x 2 x 8

     array-1:     5 x 2
     array-2: 5 x 4 x 2
result-shape: INCOMPATIBLE

     array-1:     4 x 2
     array-2: 5 x 4 x 2
result-shape: 5 x 4 x 2

     array-1: 8 x 1 x 3
     array-2: 8 x 5 x 3
result-shape: 8 x 5 x 3

     array-1: 5 x 1 x 3 x 2
     array-2:     9 x 1 x 2
result-shape: 5 x 9 x 3 x 2

     array-1: 1 x 3 x 2
     array-2:     8 x 2
result-shape: INCOMPATIBLE

     array-1: 2 x 1
     array-2:     1
result-shape: 2 x 1


Finally an example of a situation where the two array are not compatible:

In [None]:
a=np.array([1,2,3])
b=np.array([4,5])
a+b                 # This does not work since it violates the rule 3 above.


ValueError: ignored

#### <div class="alert alert-info">Exercise 4 (multiplication table revisited)</div>
Write function `multiplication_table` that gets a positive integer `n` as parameter. The function should return an array with shape (n,n). The element at index `(i,j)` should be `i*j`. Don't use `for` loops! In your solution, rely on broadcasting, the `np.arange` function, reshaping and vectorized operators. Example of usage:
```
print(multiplication_table(4))
[[0 0 0 0]
 [0 1 2 3]
 [0 2 4 6]
 [0 3 6 9]]
```
<hr/>

[[0 0 0 0]
 [0 1 2 3]
 [0 2 4 6]
 [0 3 6 9]]


## Comparisons and masking

Just like NumPy allows element-wise arithmetic operations between arrays, for example addition of two arrays, it is also possible to compare two arrays element-wise. For example

In [None]:
a=np.array([1,3,4])
b=np.array([2,2,7])
c = a < b
print(c)

[ True False  True]


Now we can query whether all comparisons resulted `True`, or whether some comparison resulted `True`:

In [None]:
print(c.all())   # were all True
print(c.any())   # was some comparison True

False
True


We can also count the number of comparisons that were `True`. This solution relies on the interpretation that `True` corresponds to 1 and `False` corresponds to 0:

In [None]:
print(np.sum(c))

2


Because the broadcasting rules apply also to comparison, we can write

In [None]:
print(a > 0)

[ True  True  True]


To try these operations on real data, we download  Food Prices data for State of Palestine. We will get the monthly price from year 2007 to 2019. We use the Pandas library, which we will cover later during this course, to load the data.

In [None]:
import pandas as pd
a1=pd.read_csv("http://data.humdata.org/dataset/80999ea7-6b3d-43eb-ac9b-343579548504/resource/7c3802bc-c2a5-4c24-aa9f-299661706293/download/wfp_food_prices_state-of-palestine.csv")['price'].values

In [None]:
a=a1[1:].astype(float)
a1[2:]

array(['2.0', '2.0', '2.0', ..., '2.5', '2.5', '2.5'], dtype=object)

In [None]:

print("Number of items with price below 5", np.sum(a < 5))

Number of items with price below 5 5865


In core Python we can combine truth values using the `and`, `or`, and `not` keywords. For boolean array however we have to use the elementwise operators `&`, `|`, and `~`, respectively. An example of these:

In [None]:
np.sum((0 < a) | (a < 10))     

16891

Another use of boolean arrays is that they can be used to select a subset of elements. For example

In [None]:
c = a > 2
print(a[:10])
print(c[:10])         # print only the first ten elements
print(a[c][:10])      # Select only the first ten > 2 elements 

[2.   2.   2.   2.   2.   2.   2.   2.5  2.5  2.53]
[False False False False False False False  True  True  True]
[2.5  2.5  2.53 2.58 2.6  2.56 2.56 2.61 2.67 2.78]


This operation is called *masking*. It can also be used to assign a new value. For example the following zeroes out the  not matching elements:

In [None]:
print(a[:10])
a[~c] = 0
print(a[:10])

[2.   2.   2.   2.   2.   2.   2.   2.5  2.5  2.53]
[0.   0.   0.   0.   0.   0.   0.   2.5  2.5  2.53]


#### <div class="alert alert-info">Exercise 5 (column comparison)</div>

Write function `column_comparison` that gets a two dimensional array as parameter. The function should return a new array containing those rows from the input that have the value in the second column larger than in the second last column. You may assume that the input contains at least two columns. Don't use loops, but instead vectorized operations. 

For array

```
 [[8 9 3 8 8]
 [0 5 3 9 9]
 [5 7 6 0 4]
 [7 8 1 6 2]
 [2 1 3 5 8]]
```
the result would be
```
 [[8 9 3 8 8]
 [5 7 6 0 4]
 [7 8 1 6 2]]
```
<hr/>

[[8 9 3 8 8]
 [5 7 6 0 4]
 [7 8 1 6 2]]


#### <div class="alert alert-info">Exercise 6 (first half second half)</div>

Write function `first_half_second_half` that gets a two dimensional array of shape `(n,2*m)` as a parameter. The input array has `2*m` columns. The output from the function should be a matrix with those rows from the input that have the sum of the first `m` elements larger than the sum of the last `m` elements on the row. Your solution should call the `np.sum` function  exactly twice.

Example of usage:
```python
a = np.array([[1, 3, 4, 2],
              [2, 2, 1, 2]])
first_half_second_half(a)
array([[2, 2, 1, 2]])
```
<hr/>

array([[2, 2, 1, 2]])

## Fancy indexing

Using indexing we can get a single elements from an array. If we wanted multiple (not necessarily contiguous) elements, we would have to index several times:

In [None]:
np.random.seed(0)
a=np.random.randint(0, 20,20)
a2=np.array([a[2], a[5], a[7]])
print(a)
print(a2)

[12 15  0  3  3  7  9 19 18  4  6 12  1  6  7 14 17  5 13  8]
[ 0  7 19]


That's quite verbose. *Fancy indexing* provides a concise syntax for accessing multiple elements:

In [None]:
idx=[2,5,7]           # List of indices
print(a[idx])         # In fancy indexing in place of a single index, we can provide a list of indices
print(a[[2,5,7]])     # Or directly

[ 0  7 19]
[ 0  7 19]


We can also assign to multiple elements through fancy indexing:

In [None]:
a[idx] = -1
print(a)

[12 15 -1  3  3 -1  9 -1 18  4  6 12  1  6  7 14 17  5 13  8]


Fancy indexing works also for higher dimensional arrays:

In [None]:
b=np.arange(16).reshape(4,4)
print(b)
row=np.array([0,2])
col=np.array([1,3])
print(b[row, col])

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[ 1 11]


One can also combine normal indexing, slicing and fancy indexing:

In [None]:
b[:,[0,2]]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14]])

## Sorting arrays



In [None]:
a=np.array([2,1,4,3,5])
print(np.sort(a))          # Does not modify the argument
print(a)

[1 2 3 4 5]
[2 1 4 3 5]


In [None]:
a.sort()            # Modifies the argument
print(a)

[1 2 3 4 5]


In [None]:
b=np.random.randint(0,10, (4,4))
print(b)

[[9 4 3 0]
 [3 5 0 2]
 [3 8 1 3]
 [3 3 7 0]]


In [None]:
np.sort(b, axis=0)           # sort each column

array([[3, 3, 0, 0],
       [3, 4, 1, 0],
       [3, 5, 3, 2],
       [9, 8, 7, 3]])

In [None]:
np.sort(b, axis=1)           # Sort each row

array([[0, 3, 4, 9],
       [0, 2, 3, 5],
       [1, 3, 3, 8],
       [0, 3, 3, 7]])

Note that each row or column is sorted independently.

A related operation is the `argsort` function. Which doesn't sort the elements, but returns the indices of the sorted elements. An example will demonstrate this:

In [None]:
a=np.array([23,12,47,35,59])
print("Array a:", a)
idx = np.argsort(a)
print("Indices:", idx)

Array a: [23 12 47 35 59]
Indices: [1 0 3 2 4]


These indices say that the smallest element of the array is in position 1 of `a`, second smallest elements is in position 0 of `a`, third smallest is in position 3, and so on. We can verify that these indices will indeed order the elements using fancy indexing:

In [None]:
print(a[idx])

[12 23 35 47 59]


#### <div class="alert alert-info">Exercise 7 (most frequent first)</div>

**<span style="color:red">Note:</span>** This exercise is fairly difficult. 

Write function `most_frequent_first` that gets a two dimensional array and an index `c` of a column as parameters. The function should then return the array whose rows are sorted based on column `c`, in the following way. Rows are ordered so that those rows with the most frequent element in column `c` come first, then come the rows with the second most frequent element in column `c`, and so on. Therefore, the values outside column `c` don't affect the ordering in any way.

Example of usage:
```
a:
 [[5 0 3 3 7 9 3 5 2 4]
 [7 6 8 8 1 6 7 7 8 1]
 [5 9 8 9 4 3 0 3 5 0]
 [2 3 8 1 3 3 3 7 0 1]
 [9 9 0 4 7 3 2 7 2 0]
 [0 4 5 5 6 8 4 1 4 9]
 [8 1 1 7 9 9 3 6 7 2]
 [0 3 5 9 4 4 6 4 4 3]
 [4 4 8 4 3 7 5 5 0 1]
 [5 9 3 0 5 0 1 2 4 2]]
print(most_frequent_first(a, -1))
 [[4 4 8 4 3 7 5 5 0 1]
 [2 3 8 1 3 3 3 7 0 1]
 [7 6 8 8 1 6 7 7 8 1]
 [5 9 3 0 5 0 1 2 4 2]
 [8 1 1 7 9 9 3 6 7 2]
 [9 9 0 4 7 3 2 7 2 0]
 [5 9 8 9 4 3 0 3 5 0]
 [0 3 5 9 4 4 6 4 4 3]
 [0 4 5 5 6 8 4 1 4 9]
 [5 0 3 3 7 9 3 5 2 4]]
```

If we look at the last column, we see that the number 1 appears three times, then both numbers 2 and 0 appear twice, and lastly numbers 3, 9, and 4 appear only once. Note that, for example, among those rows that contain in column `c` a number that appear twice in column `c` the order can be arbitrary.

Hint: the function `np.unique` may be useful.

<hr/>

## Summary 
* The efficiency of NumPy is based on the fact that the same operations can be performed on elements fast, if
all the elements have the same type. These are called vectorized operations
* We know how to create, reshape, perform basic access, combine, split, and aggregate arrays
* You found that comparisons are also vectorized operations, and that the result of a comparison can be used to mask (i.e. restrict) further operations on arrays
* You can select a list of columns using fancy indexing
