# NumPy Introduction

![image](../../images/numpy_logo.png)

[**NumPy**](https://numpy.org/) is a Python package for large-scale data processing. It provides a high-performance multidimensional array object, called `ndarray`, and tools for working with these arrays.

**Arrays** are very frequently used in **Machine Learning** and **Data Science**. **Lists** in `Python` serve the role of arrays and are great to represent a single row of data, but they don't scale well. Thus, in order to work with large datasets, we need something much better.

The main reason why **NumPy** is so much faster than **lists** is because **NumPy** arrays are stored in memory in a contiguous block of memory, which makes them faster to access.

**Fun fact**: **NumPy** is partially written in `Python`, but most of the parts are implemented in `C`/`C++`.

# Install and Use NumPy for the first time

`numpy` should be automatically installed as part of `anaconda`. Nonetheless, if for some reasons it is missing, you can enter the following command in a `anaconda prompt` or a `terminal` to install `numpy`:

```bash
pip install numpy
```

You can verify that it is installed by entering the following codes in a cell in a notebook:

```python
try:
    import numpy as np
    print(f"The version of numpy is: {np.__version__}")
except ImportError:
    print("numpy is not installed!")
```

It is extremely common to use `np` as the alias for `numpy` in `Python`.

```python
import numpy as np
```

`as` is used to assign an alias to a module. The alias is used to refer to the module in the rest of the code.

First, let's go through an example to see the power of `numpy`. We'll start with 2 lists having `1,000,000` random numbers each.

```python
from random import randrange

list1 = [randrange(1,1000000) for i in range(1000000)]
list2 = [randrange(1,1000000) for i in range(1000000)]

print(f"list1 contains {len(list1):,} elements")
print(f"list2 contains {len(list2):,} elements")
```

**Exercise**

1. Let's calculate the product of each pair of elements from the two lists `list1` and `list2`. In other words, given an index `i`, we want to calculate the product of `list1[i]` and `list2[i]`.
2. Save the list of products in a new list called `product_list`.

**Hint**: 
- Use `for` loop to iterate through the lists. 
- You can use `zip()` to iterate through the lists at the same time.
- List comprehension.

In [None]:
%%timeit
# [TODO]
# for loop ONLY approach


In [None]:
# show first 5 elements of product_list


In [None]:
%%timeit
# [TODO]
# list comprehension + zip approach


In [None]:
# show first 5 elements of product_list


`%%timeit` is a magic command that runs the code in the cell and measures the time it takes to run.

As you can see, we use `zip()` and **list comprehension** to work out an elegant way of calculating the pair-wise product of the two lists.
Nonetheless, we **won't be able to multiply the two lists together directly**. We'll need to use `numpy` to do that.

```python
list1 * list2 # will throw an error
```

With `numpy`, we can easily multiply the 2 `numpy` arrays together. Let's import `numpy` and convert the 2 lists into `numpy.ndarray`.

```python
import numpy as np

np_list1 = np.array(list1)
np_list2 = np.array(list2)

print(type(np_list1), type(np_list2))
```

In [None]:
%%timeit


In [None]:
# show first 5 elements of np_product_list


**Exercise**

Let's build a summary table that looks like the one below based on the speed test we did earlier.

![image](../../images/np_speedtest_table.png)

<font size="5">[TODO] 📖</font>


**Exercise**

Why do we need `list` when we already have `numpy`?

**Hint**: Can `numpy` array contains elements of different types?

<font size="5">[TODO] 📖</font>


# Dimensions in Arrays

[**Dimensions**](https://www.splashlearn.com/math-vocabulary/geometry/dimensions) in mathematics are the measure of the size or distance of an object or region or space in one direction. In simpler terms, it is the measurement of the length, width, and height of anything. 

We have probably all come across **Catersian** coordianate system. Below is an illustration of 2-D coordinate system.

![image](../../images/1920px-Cartesian-coordinate-system.svg.png)

Types of figures based on dimensions:
- A point is a 0-dimensional object.
- A line is a 1-dimensional object.
- A plane is a 2-dimensional object.
- A volume is a 3-dimensional object.

![image](../../images/Dimensions-2.webp)

Similarly, arrays can have multiple dimensions.

**0-D Arrays**, or scalars, are the simplest arrays. For example, `np.array(1)` is a 0-D array.

```python
zero_d_array = np.array(1)

print(zero_d_array)
```

As shown in the above example, `shape` is an attribute of `np.ndarray` object that returns the shape of the array. The **shape of an array** is the number of elements in each dimension. This is a very important method that you'll be using a lot when analysing data.

**1-D Arrays**, or vectors, are 1-dimensional arrays. For example, `np_product_list`, that we saw above, is a 1-D array.

```python
print(f"Shape of np_product_list: {np_product_list.shape}")
```

We can see that `np_product_list` has 1 dimension containing `1000000` elements.

Below is another example:

![image](../../images/nparray_creation.png)

**2-D Arrays**, or matrices, are 2-dimensional arrays. For example, `np.array([1, 2, 3], [4, 5, 6])` is a 2-D array.

```python
matrix = np.array([1, 2, 3], [4, 5, 6])

print(matrix)
print(f"Shape of matrix: {matrix.shape}")
print(f"Number of dimensions of matrix: {matrix.ndim}")
```

`matrix.shape` returns a tuple showing that the first dimension has 2 elements while the second dimension has 3 elements.

`matrix.ndim` returns the number of dimensions of the array.

**3-D Arrays** are arrays having matrices as its elements. For example, `np.array([[1, 2, 3], [4, 5, 6]])` is a 3-D array.

```python
three_d_array = np.array([[1, 2, 3], [4, 5, 6]])

print(three_d_array)
print(f"Shape of three_d_array: {three_d_array.shape}")
print(f"Number of dimensions of three_d_array: {three_d_array.ndim}")
```

**Exercise**

Without using `ndim` method, can you figure out the number of dimensions of the following arrays?

```python
a = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
b = np.array([[1]])
c = np.array([[[1], [2], [3]], [[4], [5], [6]]])
d = np.array([[[[1], [2], [3]], [[4], [5], [6]]]])

print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
```

In [None]:
# [TODO]


**Exercise**

Without using `ndim` method, write a function that returns the number of dimensions of the array.

**Hint**: Similar to **list**, length of a **tuple** can be accessed using `len()` function. 

In [None]:
# [TODO]


# Access elements in an Array

Similar to **list**, we can access elements in an array using indexing or slicing.

![image](../../images/nparray_slicing.png)

```python
arr = np.array([1, 2])

print(f"Element at index 0: {arr[0]}")
print(f"Elements from index 0 to 2 (not inclusive): {arr[0:2]}")
```

Similarly, by providing the indices of the element, we can easily access elements in a 2-D array.

![image](../../images/nparray_2d.png)

```python
two_d_array = np.array([[1, 2, 3], [4, 5, 6]])

print(f"Element at index (0, 2): {two_d_array[0, 2]}")
```

# Modify Arrays

Similar to **list**, we can also modify the arrays by assigning new values to the elements.

```python
two_d_array = np.array([[1, 2, 3], [4, 5, 6]])

print("Before: ")
print(two_d_array)

two_d_array[0, 2] = 10
print("----------------------")
print("After: ")
print(two_d_array)
```

# Sort Arrays

There are many times when we need to sort the elements of an array. Luckily, there is a method called `np.sort()` for that purpose.

```python
arr = np.array([3, 1, 2, 3, 4, 8, 1, 4, 7, 0])

print("Before: ")
print(arr)
print("----------------------")
print("After: ")
print(arr.sort())
```

Can `np.sort()` be used for 2-D arrays?

```python
two_d_array = np.array([[2, 1, 7], [6, 5, 4]])

print("Before: ")
print(two_d_array)

print("----------------------")
print("After: ")
print(np.sort(two_d_array))
```

We can see that the `np.sort()` function sorts the array dimension by dimension in ascending order by default.

`np.sort()` function is actually a lot more than that. Nonetheless, we shouldn't worry about it here. You can read more about `np.sort()` [here](https://numpy.org/doc/stable/reference/generated/numpy.sort.html).

# Searching Arrays

It is also very common to search for elements in an array. We will learn how to do that using `np.where()` method.

**NOTE**: `np.where()` is a very common method for data professionals.

In the following example, let's say we have an array `heights` containing the heights (in cm) of applicants for a flight attendant position. As this is a job having a strict height requirement (≥ 1.70 cm), we need to find out which applicants are qualified based on their heights. We will be using `np.where()` to filter out the index within the `heights` array of the qualified applicants.

```python
heights = np.array([1.63, 1.68, 1.71, 1.89, 1.79, 1.56, 1.7])

print(np.where(heights >= 1.7))
```

As you can see, `np.where()` returns all the indices of the elements that satisfy the condition.

**Exercise**

Let's find out the indices of the odd elements in the following array.

```python
numbers = np.array([10, 20, 11, 40, 1, 54, 3])
```

In [None]:
# [TODO]


**Exercise**

What else can `np.where()` do? Can we print the **docstring** of `np.where()`? 

In [None]:
# [TODO]


As you can see, `np.where()` takes a condition as an argument and can also return `x` or `y` depending on the condition.

Thus, going back to the example about flight attendant applications above, we can use `np.where()` to generate a new array containing the result of each applicant's application based on their height.

```python
np.where(heights >= 1.7, "Accepted", "Rejected")

# array(['Rejected', 'Rejected', 'Accepted', 'Accepted', 'Accepted', 'Rejected', 'Accepted'], dtype='<U8')
```

You can read more about `np.where()` [here](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

What if we don't just want the indices of the elements that satisfy the condition, but the actual elements themselves?

Filtering in an array is very simple. You can combine the result returned by `np.where()` with the original array to get the elements that satisfy the condition.

Let's say we want to return all heights (in cm) of the applicants that are qualified.

```python
heights[np.where(heights >= 1.7)]
```

There is an even simpler way to do the exact same thing. We can drop `np.where()` completely and write the condition inside the square brackets.

```python
heights[heights >= 1.7]
```

**Exercise**

How many applicants are qualified for the flight attendant position?

In [None]:
# [TODO]


# Array Arithmetic

We'll use the following 2 arrays to demonstrate some of the basic arithmetic operations that we can perform on arrays.

```python
arr1 = np.array([1, 2, 3])
arr2 = np.array([0, 5, 6])
```

**Array addition** is performed element-wise. We can also add two arrays together using `np.add()` function.

![image](../../images/nparray_addition.png)

```python
arr1 + arr2

np.add(arr1, arr2)
```

**Array subtraction** is performed element-wise. We can also subtract two arrays together using `np.subtract()` function.

![image](../../images/nparray_subtraction.png)

```python
arr1 - arr2

np.subtract(arr1, arr2)
```

**Exercise**

What will be the result if we multiply `arr1` and `arr2` together?

In [None]:
# [TODO]


In [None]:
# [TODO]


**Array multiplication** is also performed element-wise. We can also multiply two arrays together using `np.multiply()` function.

**Exercise**

What will be the result if we divide `arr1` by `arr2`?

In [None]:
# [TODO]


In [None]:
# [TODO]


Unsurprisingly, **array division** is also performed element-wise, and we can use `np.divide()` function to perform the division.

# Other useful functions

There are a number of functions that we can use to aggregate the elements of an array. They are very commonly used in data science. 
- `np.sum()`
- `np.mean()`
- `np.min()`
- `np.max()`
- `np.median()`
- `np.std()`

All of these functions can be applied to arrays of different dimensions. In my experience, the most common use case is to aggregate the elements of a 1-D array.

In [None]:
arr = ...

print(f"Sum of all elements: {}")
print(f"Average of all elements: {}")
print(f"The smallest element: {}")
print(f"The largest element: {}")
print(f"The median of the array: {}")
print(f"The standard deviation of the array: {}")

# Transposing and Reshaping Arrays

**Transposing** is a simple operation that rotates the elements of an array. In order to transpose a matrix, we can either use `.T` or `np.transpose()` function. It's very common to use `.T`.

![image](../../images/nparray_transpose.png)

Let's look at an example:

```python
arr = np.array([[1, 4], [2, 5], [3, 6]])

print("Before transposing:")
print(arr)
print(f"Original array has {arr.shape} shape.")
print("\n")
print("After transposing:")
print(arr.T)
print(f"Transposed array has {arr.T.shape} shape.")
```

**Reshaping** is a more complicated operation. We can use `.reshape()` or `np.reshape()` function to reshape an array.

Let's say we want to reshape a 1-D array into a 2-D array.

![image](../../images/nparray_reshape.png)

```python
arr = np.array([1, 2, 3, 4, 5, 6])

print("Before reshaping:")
print(arr)
print(f"Original array has {arr.shape} shape.")
print("\n")
print("After reshaping:")
print(arr.reshape(2, 3))
print(f"Reshaped (2, 3) array has {arr.reshape(2, 3).shape} shape.")
print("\n")
print(arr.reshape(3, 2))
print(f"Reshaped (3, 2) array has {arr.reshape(3, 2).shape} shape.")
```

# Broadcasting

**Broadcasting** describes how **NumPy** handles the case when two arrays have different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. 

Thus, with the help of broadcasting, we can perform arithmetic operations on array and scalar values since scalar values are considered 0-dimensional arrays.

![image](../../images/nparray_addition_scalar.png)

```python
arr1 + 1
```

In this case, `1` is broadcasted into an array of shape `(3,)` having all element as the value of the scalar `1`.

Can the following operation be performed? Remember that: `arr1 = np.array([1, 2, 3])`

```python
arr1 + np.array([2, 3])
```

In [None]:
# [TODO]


Unfortunately, this is not possible. The two arrays are not **broadcastable**.

How do we determine if arrays are **broadcastable**?
- The arrays all have exactly the same shape.
- The arrays all have the same number of dimensions and the length of each dimensions is either a common length or 1.
- Arrays that have too few dimensions can have their shapes prepended with a dimension of length 1 to satisfy property 2.

It's **quite difficult to understand**, isn't it? 😂

I personally use `np.broadcast_to()` function to broadcast the array to the same shape as the other array. If it returns in an error, it means that the two arrays are not **broadcastable**. 