# Introduction to NumPy


The learning objectives of this section are:

* Understand advantages of vectorised code using NumPy (over standard python ways)
* Create NumPy arrays
    * Convert lists and tuples to NumPy arrays 
    * Create (initialise) arrays
* Inspect the structure and content of arrays
* Subset, slice, index and iterate through arrays
* Compare computation times in NumPy and standard Python lists

Documentation - https://numpy.org/doc/1.17/user/index.html

---
---

<center><h1> 📍 📍 Basics of NumPy 📍 📍 </h1></center>


NumPy is the fundamental package for scientific computing with Python. It contains among other things:
 
 - A powerful N-dimensional array object
 - Sophisticated (broadcasting) functions
 - Tools for integrating C/C++ and Fortran code
 - Useful linear algebra, Fourier transform, and random number capabilities

***Source:*** https://numpy.org/

---

### NumPy Basics

NumPy is a library written for scientific computing and data analysis. It stands for numerical python.

The most basic object in NumPy is the ```ndarray```, or simply an ```array```, which is an **n-dimensional, homogenous** array. By homogenous, we mean that all the elements in a NumPy array have to be of the **same data type**, which is commonly numeric (float or integer). 

Let's see some examples of arrays.

In [2]:
# Import the numpy library
# np is simply an alias, you may use any other alias, though np is quite standard
import numpy as np

In [6]:
# Creating a 1-D array using a list
# np.array() takes in a list or a tuple as argument, and converts into an array
array_1d = np.array([2, 4, 5, 6, 7, 9])
print(array_1d)
print(type(array_1d))

[2 4 5 6 7 9]
<class 'numpy.ndarray'>


In [7]:
# Creating a 2-D array using two lists
array_2d = np.array([[2, 3, 4], [5, 8, 7]])
print(array_2d)

[[2 3 4]
 [5 8 7]]


In NumPy, dimensions are called **axes**. In the 2-d array above, there are two axes, having two and three elements respectively. 

In NumPy terminology, for 2-D arrays:
* ```axis = 0``` refers to the rows
* ```axis = 1``` refers to the columns

<img src="numpy_axes.jpg" style="width: 600px; height: 400px">

***Difference between numpy array and list***

- Data type should be same of all the elements in a numpy array whereas it can be different in case of lists.
- **Broadcasting:** https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

---

### Data Types

In [5]:
np.int64      # Signed 64-bit integer types
np.float32    # Standard double-precision floating point
np.complex    # Complex numbers represented by 128 floats (deprecated)
np.bool       # Boolean type storing TRUE and FALSE values (deprecated)
np.object     # Python object type (deprecated)
np.string_    # Fixed-length string type
np.unicode_   # Fixed-length unicode type

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.complex    # Complex numbers represented by 128 floats
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool       # Boolean type storing TRUE and FALSE values
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object     # Python object type


numpy.str_

### Inspect the Structure and Content of Arrays

It is helpful to inspect the structure of numpy arrays, especially while working with large arrays. Some attributes of numpy arrays are:
* ```shape```: Shape of array (n x m)
* ```dtype```: data type (int, float etc.)
* ```dtype.name```: Name of data type
* ```ndim```: Number of dimensions (or axes)
* ```itemsize```: Memory used by each array elememnt in bytes
* ```len```: Length of array
* ```size```: Number of array elements
* ```astype(int)```: Convert an array to a different type
* `tolist()`: convert arr to a Python list
* `np.info(np.eye)`: View documentation for np.eye


Let's say you are working with a moderately large array of size 1000 x 300. First, you would want to wrap your head around the basic shape and size of the array. 

In [20]:
# Initialising a random 1000 x 300 array
rand_array = np.random.random((1000, 300))

# Print the second row
print(rand_array[1, ])

[0.95379865 0.69153536 0.9672269  0.82658772 0.06857876 0.27588065
 0.94088474 0.02118527 0.11433937 0.08588699 0.55268493 0.03593231
 0.62285732 0.26309438 0.62439537 0.27235462 0.69119299 0.65906702
 0.26692236 0.63444846 0.28217604 0.16554515 0.48720348 0.69545284
 0.46551563 0.45304871 0.43850198 0.0533307  0.02195423 0.25541826
 0.5815934  0.49811979 0.33218227 0.90891216 0.57537938 0.62510908
 0.89675417 0.73278707 0.55913366 0.54327231 0.88490117 0.08725671
 0.75934837 0.2450785  0.9929978  0.72969578 0.50617252 0.74244521
 0.7865385  0.28493166 0.25986354 0.73480758 0.33668559 0.71767419
 0.71177967 0.0139024  0.54062491 0.91204439 0.811006   0.12885883
 0.94603847 0.12732127 0.01200975 0.02925136 0.62104327 0.80452503
 0.17215504 0.79369298 0.4598322  0.82498396 0.69102751 0.30640054
 0.661006   0.47619005 0.03793968 0.56286448 0.937146   0.86312438
 0.53289312 0.90964941 0.78201121 0.00520719 0.82778476 0.78680527
 0.2974833  0.88405646 0.30626502 0.09430897 0.1820487  0.7362

In [21]:
# Inspecting shape, dtype, ndim and itemsize
print("Shape: {}".format(rand_array.shape))
print("dtype: {}".format(rand_array.dtype))
print("Dimensions: {}".format(rand_array.ndim))
print("Item size: {}".format(rand_array.itemsize))

Shape: (1000, 300)
dtype: float64
Dimensions: 2
Item size: 8


Reading 3-D arrays is not very obvious, because we can only print maximum two dimensions on paper, and thus they are printed according to a specific convention. Printing higher dimensional arrays follows the following conventions:
* The last axis is printed from left to right
* The second-to-last axis is printed from top to bottom
* The other axes are also printed top-to-bottom, with each slice separated by another using an empty line 

Let's see some examples.

In [9]:
# Creating a 3-D array
# reshape() simply reshapes a 1-D array 
array_3d = np.arange(36).reshape(3, 3, 4)
print(array_3d)

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]

 [[24 25 26 27]
  [28 29 30 31]
  [32 33 34 35]]]


### Asking For Help

In [6]:
np.info(np.ndarray.dtype)

Data-type of the array's elements.

Parameters
----------
None

Returns
-------
d : numpy dtype object

See Also
--------
numpy.dtype

Examples
--------
>>> x
array([[0, 1],
       [2, 3]])
>>> x.dtype
dtype('int32')
>>> type(x.dtype)
<type 'numpy.dtype'>


### Advantages of NumPy 

What is the use of arrays over lists, specifically for data analysis? Putting crudely, it is **convenience and speed **:<br>
1. You can write **vectorised** code on numpy arrays, not on lists, which is **convenient to read and write, and concise**. 
2. Numpy is **much faster** than the standard python ways to do computations.

Vectorised code typically does not contain explicit looping and indexing etc. (all of this happens behind the scenes, in precompiled C-code), and thus it is much more concise.

Let's see an example of convenience, we'll see one later for speed. 

Say you have two lists of numbers, and want to calculate the element-wise product. The standard python list way would need you to map a lambda function (or worse - write a ```for``` loop), whereas with NumPy, you simply multiply the arrays.

In [1]:
list_1 = [3, 6, 7, 5]
list_2 = [4, 5, 1, 7]

# the list way to do it: map a function to the two lists
product_list = list(map(lambda x, y: x*y, list_1, list_2))
print(product_list)


[12, 30, 7, 35]


In [4]:
print(list_1*list_2)

TypeError: can't multiply sequence by non-int of type 'list'

In [7]:
# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)

array_3 = array_1*array_2
print(array_3)
print(type(array_3))

[12 30  7 35]
<class 'numpy.ndarray'>


As you can see, the NumPy way is clearly more concise.

Even simple mathematical operations on lists require for loops, unlike with arrays. For example, to calculate the square of every number in a list:

In [8]:
# Square a list
list_squared = [i**2 for i in list_1]

# Square a numpy array
array_squared = array_1**2

print(list_squared)
print(array_squared)

[9, 36, 49, 25]
[ 9 36 49 25]


This was with 1-D arrays. You'll often work with 2-D arrays (matrices), where the difference would be even greater. With lists, you'll have to store matrices as lists of lists and loop through them. With NumPy, you simply multiply the matrices.

* The last axis has 4 elements, and is printed from left to right.
* The second last has 3, and is printed top to bottom
* The other axis has 2, and is printed in the two separated blocks

### Compare Computation Times in NumPy and Standard Python Lists

We mentioned that the key advantages of numpy are convenience and speed of computation. 

You'll often work with extremely large datasets, and thus it is important point for you to understand how much computation time (and memory) you can save using numpy, compared to standard python lists.   

Let's compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers. 

In [34]:
## Comparing time taken for computation
list_1 = [i for i in range(1000000)]
list_2 = [j**2 for j in range(1000000)]

# list multiplication
import time

# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0 
print(t1-t0)


# numpy array 
array_1 = np.array(list_1)
array_2 = np.array(list_2)

t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0

print(t1-t0)

print("The ratio of time taken is {}".format(list_time/numpy_time))

0.13862943649291992
0.00394892692565918
The ratio of time taken is 35.10559681217171


In [1]:
import numpy as np
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

In [2]:
%timeit my_arr2 = my_arr * 2

1.76 ms ± 62.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [3]:
%timeit my_list2 = [x * 2 for x in my_list]

80.5 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In this case, numpy is **an order of magnitude faster** than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.

Some reasons for such difference in speed are:
* NumPy is written in C, which is basically being executed behind the scenes
* NumPy arrays are more compact than lists, i.e. they take much lesser storage space than lists


The following discussions demonstrate the differences in speeds of NumPy and standard python:
1. https://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast
2. https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists
