# numpy 

numpy, or numerical Python is a library that is used for advanced mathematical computations while at the same time, maintaing high levels of performance. The goal of this notebook is to give anyone with knowledge in Python a quick crash course on the capabilities of numpy and the things they should read up more on, depending on their use case.

In [2]:
import numpy as np

The sole reason that numpy is imported as **np** is convention. You are free to use another alias but its not recommended as this is what you will find everywhere and its better to stick to standards

In [3]:
np.__version__

'1.18.1'

### nd-array

The primary reason that numpy is fast is because of the nd-array type that it uses to store and manipulate data

An ndarray is a generic multidimensional container for homogenous data. It provides vectorized arithmetic operations and sophisticated broadcasting capabilities. Every ndarray has 2 properties: shape and dtype. shape is a tuple providing the dimension of the array and dtype provides you the datatype of the array.

The dtype of the array can also be explicity specified while defining the array giving you fine-tuned control over the array. 

Lets create a numpy array from an array. This is possible by passing the array as input to the *np.array* function

In [4]:
nparray = np.array([1,2,3])
nparray

array([1, 2, 3])

A numpy array has numerous properties which gives more information about them

Datatype of the array

In [5]:
nparray.dtype

dtype('int32')

Size of the array

In [7]:
nparray.size

3

Shape of the array

In [8]:
nparray.shape

(3,)

The below 2 cells of code warrant a detailed explanation. The *itemsize* parameter returns the size of a single item in the array. In this case, we have an integer array taking up 32 bits of space for a single item and this is equivalent to 4 bytes(1 byte = 8 bits). The following cell showcases the *nbytes* parameter which returns the size in bytes of the entire array, thereby providing a value of 12 bytes(4 bytes * 3 items).

To put it short: *itemsize* provides the size of a single item in the array while *nbytes* returns the size of the entire array.

In [21]:
nparray.itemsize

4

In [22]:
nparray.nbytes

12

There is a reason that numpy's nd-array is faster than Python's native list. Lets look at this in depth in the following cells

In [9]:
%timeit pythonList = [i for i in range(10000)]

440 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [10]:
%timeit npList = np.arange(10000)

8.19 µs ± 93.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


numpy arrays are homogenous and is handled faster in memory. Now compare this to a Python List where you can put anything in; every entry in a Python list is a Python object and this causes overhead in computations. This is the primary reason why numpy arrays are significantly faster than the traditional Python lists

Lets now look at some of the functions which makes numpy a flexible and handy library.

### Generating data with numpy

*arange* generates a list of numbers within the range of the digit passed

In [11]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

*linspace* returns a set of lineary-spaced items within the range passed as input. In linspace, the starting digit, ending digit along with the number of digits required as passed as input. Basically, it returns an array with the required number of digits in a specified interval

In [14]:
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

*ones* creates an array filled with ones. The parameter passed as input is the size of the required array

In [15]:
np.ones(5)

array([1., 1., 1., 1., 1.])

*zeros* creates an array filled with zeroes. The parameter passed as input is the size of the required array

In [16]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

*zeros_like* creates an array with the same size as the array passed as input. The generated array will have zeros as elements

In [18]:
np.zeros_like(np.arange(5))

array([0, 0, 0, 0, 0])

*eye* creates an identity matrix. The generated matrix will be of the dimensions of the integer passed as input

In [19]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

*empty* creates an array filled with garbage values, usually zeroes. The parameter passed as input is the size of the required array

In [20]:
np.empty(5)

array([1., 1., 1., 1., 1.])

### Indexing

numpy follows the usual rules of Python when it comes to indexing and slicing. I am laying out a few examples below that you can play around with and experiment with:

In [24]:
nparray[1]

2

In [25]:
nparray[-1]

3

#### Slicing

In [26]:
nparray[1:2]

array([2])

In [27]:
nparray[:2]

array([1, 2])

In [28]:
nparray[:]

array([1, 2, 3])

In [29]:
nparray[1:]

array([2, 3])

In [30]:
largeArray = np.arange(100)

In [32]:
largeArray[::20]

array([ 0, 20, 40, 60, 80])

In [33]:
largeArray[1:10:2]

array([1, 3, 5, 7, 9])

In [34]:
largeArray[::-20]

array([99, 79, 59, 39, 19])

In [35]:
largeArray[10:1:-2]

array([10,  8,  6,  4,  2])

An important thing to keep in mind while using slicing in numpy is that the slices are essentially references(views) and hence, any changes that you make to the sliced data will reflect in the parent. Lets look at an example

In [36]:
smallArray = largeArray[:10]

In [37]:
smallArray

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [38]:
largeArray[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [39]:
smallArray[0] = 666

In [44]:
smallArray

array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9])

In [43]:
largeArray[:10]

array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9])

The copy() method can be used to create copies instead of such views

### Axes

A numpy array can be multi-dimensional. It also gives you the ability to change an existing array to a shape of your liking provided that it meets multiple constraints

*reshape* lets you do exactly what the name says. It lets you shape the array into the dimensions passed as input. If you do not pass a dimension that the array can be reshaped into, then the function will return an error. Say, the array has 10 elements and you try to reshape it to an array of shape 3x5, then reshape will return an error

In [47]:
np.arange(1, 7).reshape((2, 3))

array([[1, 2, 3],
       [4, 5, 6]])

In [52]:
np.arange(1, 4)

array([1, 2, 3])

In [53]:
np.arange(1, 4).reshape(1,3)

array([[1, 2, 3]])

*newaxis* is used to create a new axis in the data. It is commonly used when working on modelling techniques as models require the data to be shaped in a certain manner. As you can see below, if the *newaxis* parameter is in the first position, then a new row vector will be generated. If its in the second position, then a column will be created with each of the elements being a separate vector.

In [54]:
np.arange(1, 4)[np.newaxis, :]

array([[1, 2, 3]])

In [55]:
np.arange(1, 4)[:, np.newaxis]

array([[1],
       [2],
       [3]])

### Array Concatenation

Arrays can be concatenated in numpy using the *concatenate* method. The list of arrays to be concatenated is to be passed as input to the concatenate function.

In [56]:
np.concatenate([smallArray, largeArray])

array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9, 666,   1,   2,
         3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
        16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,
        29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,
        81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99])

In [57]:
np.concatenate([smallArray, largeArray, [888, 999]])

array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9, 666,   1,   2,
         3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
        16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,
        29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,
        81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99, 888, 999])

### numpy Ufuncs

The primary purpose of Ufunc is to be able to speed up repeated operations on values in a numpy Array. It can work both between a scalar value & an array and between 2 arrays

In [77]:
3 * np.arange(0, 10)

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

In [65]:
np.arange(0, 10) + np.arange(20, 30)

array([20, 22, 24, 26, 28, 30, 32, 34, 36, 38])

This is not just syntactically better & intuitive, Its also faster. Lets try that out below

In [67]:
%timeit 3 * smallArray

1.18 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [76]:
%%timeit

for i in range(len(smallArray)):
    3 * smallArray[i]

6.99 µs ± 410 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


As can be seen above, the ufunc is multitudes faster than the looped version of the code. This difference becomes more and more pronounced as the computational logic involved gets more complex

### Aggregation

In this section, we will explore the various aggregation functions that numpy provides. numpy ships with a standard *sum()* function which returns the sum of all elements in the array. You may ask what the difference is, between the native Python function and the numpy function! After all, they are doing the same functionality; provide the sum of elements. Let's check it out below

In [79]:
np.sum(smallArray)

711

In [80]:
sum(smallArray)

711

In [86]:
hugeArray = np.random.randint(100000, size=1000000)
%timeit np.sum(hugeArray)
%timeit sum(hugeArray)

527 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


  """Entry point for launching an IPython kernel.


260 ms ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


From the above code block, its pretty evident how fast numpy is, compared to the native functions. This applies to most, if not all, aggregation functions available in numpy

In [95]:
np.min(smallArray)

1

In [96]:
np.max(smallArray)

666

In [144]:
np.std(smallArray)

198.31512801599376

In [146]:
np.mean(smallArray)

71.1

In [148]:
np.median(smallArray)

5.5

One thing to keep in note while using the aggregation functions is that these are prone to NaN values while you are using them, i.e., if you have a NaN-value in your array, these aggregation functions will fail. In such cases you can use their NaN-safe alternatives. Just to give you an example: nansum() is the NaN-safe alternative for the sum function

In [154]:
np.nanmean(np.array([1, 2, np.nan]))

1.5

In [155]:
np.mean(np.array([1, 2, np.nan]))

nan

### Broadcasting

We are now going to look at an operation that you might have used a lot without having a conceptual understanding. In broadcasting, you perform a operation on an entity that has a different shape/dimensions.

A simple example for this could be adding a scalar to a numpy array. You can basically think of the values being duplicated to match the dimension of the array followed by the operation that is to be performed. 

Broadcasting is an operation that can be elaborated on, but that is not my goal here with this post

In [101]:
a = np.arange(1, 7)
a

array([1, 2, 3, 4, 5, 6])

In [105]:
a + 1

array([2, 3, 4, 5, 6, 7])

### Logical Operations

There will be times when you would like to perform logical checks on a piece of data. numpy provides all the normal logical operations that you would expect: greater than, less than & equal to checks. The logical functions return an array of results in Boolean indicating if they satisfied the condition or not

In [115]:
x = np.array([1, 2, 3, 4])

In [116]:
x > 2

array([False, False,  True,  True])

In [117]:
x == 2

array([False,  True, False, False])

numpy also provides useful functions such as any() and all() which are used to perform the following checks: if there is any element that fulfilled the condition or if all the elements satisfy the condition, respectively. They provide a single Boolean value as output which indicates the result

In [118]:
np.any(x == 2)

True

In [119]:
np.all(x == 2)

False

From the above 2 code blocks: the following can be understood: 
* any(x == 2): It checks if any of the elements in the array met the condition passed
* all(x == 2): It checks if all the elements in the array met the condition passed

Say we want a count of values that satisfy this particular condition, you can use the sum() method as follows. It counts the number of True values in the array

In [120]:
np.sum(x == 2)

1

This can also be compounded to check for multiple conditions. The same can be done for the *any* and *all* functions

In [124]:
np.sum((x == 2) | (x == 3))

2

In [157]:
np.any((x == 2) | (x == 3))

True

#### Masking

You might have seen the above arrays which prvide values as a Boolean array and wondered how that's helpful because you still need to provide further operations to make sense of the output. Here is where masking comes in; the True/False array can be passed *into* the array to provide only those values which meet the conditions

In [125]:
x[x == 2]

array([2])

In [126]:
x[x > 2]

array([3, 4])

### Fancy Indexing

Fancy indexing is nothing but the ability to access multiple elements of the array at once

In [131]:
x

array([1, 2, 3, 4])

In [132]:
x[[0, 2, 3]]

array([1, 3, 4])

As you might have guessed, we can pass in an array as the list of indices too

In [133]:
indexList = [0, 2, 3]
x[indexList]

array([1, 3, 4])

The true power of fancy indexing comes out when it is combined with the likes of slicing, indexing and broadcasting

### Sorting

The sorting algorithm provided by numpy is very efficient. There are 2 ways you can sort a numpy array; in-place and by calling the numpy sort function that returns the sorted array

Let's shuffle our array first so that we can sort it and play around with it

In [158]:
np.random.shuffle(x)
x

array([3, 2, 1, 4])

Calling *np.sort* on the array will return a copy of the array that is sorted and does not sort the array in place as demonstrated below

In [163]:
np.sort(x)

array([1, 2, 3, 4])

In [160]:
x

array([3, 2, 1, 4])

If you call *sort* on the array, then the array will be sorted in-place and there's no need to save it to another array. Both methods perform the same functionality and are used depending on the required use-case

In [164]:
x.sort()

In [142]:
x

array([1, 2, 3, 4])

### Concluding Notes

numpy has a lot more capabilities, especially with regards to higher-dimensional data. numpy can easily, work with, and manipulate data of higher dimensions. I have not decided to go into such details as that would defeat the purpose of this post. If you want to read more about numpy, I would highly recommend the following resources:
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) - The Data Science Handbook is a very popular book in the Dat Science domain and its ebook version is available for free. It will let you go into more details on numpy and, if interested, pandas, matplotlib & scikit-learn
* [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) - Written by the creator of Pandas, you can go through this book for in-depth examples of the usage of numpy
* And of course, [numpy documentation](https://numpy.org/doc/) - If you are a seasoned developer and know exactly what you are looking, you could directly go to the numpy documentation as it has a lot of user guides and tutorials and is not just a website that hosts a reference documentation 