![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39118381-910eb0c2-46e9-11e8-81f1-a5b897401c23.jpeg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Numpy: Numeric computing library

NumPy (Numerical Python) is one of the core packages for numerical computing in Python. Pandas, Matplotlib, Statmodels and many other Scientific libraries rely on NumPy.

NumPy major contributions are:

* Efficient numeric computation with C primitives
* Efficient collections with vectorized operations
* An integrated and natural Linear Algebra API
* A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.

Let's develop on efficiency. 

## Number and vector representaion in Python vs NumPy

In Python, **everything is an object**, which means that even simple ints are also objects, with all the required attributes to make object work. We call them "Boxed Ints". To make Python easier to work with, it requires all the added weight. So a simple integer could take upto 20 bytes! As datasets get larger (ex: 7 billlion records), this drastically affects performance (especially when floats are involved). Further, the built in data structures in Pythin like lists and dictonaries are not optimized for low level computing. The items of a typical list will most likely *not* be put in contiguous postions in memory and each item will be wrapped in an object. Hence, we cannot rely on advanced CPU directives and instructions for processing matrices because items have been converted to objects.

In contrast, NumPy uses primitive numeric types (floats, ints) which makes storing and computation efficient. We can explicitly state the size of the itegers we're creating, hence limiting the overall size of the dataset by a huge margin. Also, items of a numpy array will be placed in contiguous positions in memory, take only the explicit space allocated to them, and can make use of very efficient low-level instructions from the CPU for matrix calculations. This fast array processing is especially important for machine learning work. This makes computaions very fast and matrix calculations possible through use of NumPy.



<img src="https://docs.google.com/drawings/d/e/2PACX-1vTkDtKYMUVdpfVb3TTpr_8rrVtpal2dOknUUEOu85wJ1RitzHHf5nsJqz1O0SnTt8BwgJjxXMYXyIqs/pub?w=726&h=396" />


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on! 

In [306]:
import sys
import numpy as np

## Basic Numpy Arrays

They look like python lists, but are significantly different in terms of charachtersitics and performance

For Numpy arrays 
* all elements have to be of the same data type
* the size of each element is fixed


In [307]:
np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

In [308]:
a = np.array([1, 2, 3, 4])

In [309]:
b = np.array([0, .5, 1, 1.5, 2])

In [310]:
a[0]

1

In [311]:
a[0], a[1]

(1, 2)

In [312]:
a[0:]

array([1, 2, 3, 4])

In [313]:
# 1st index is inclusive, 2nd index isn't
a[1:3]

array([2, 3])

In [314]:
a[1:-1]

array([2, 3])

In [315]:
# array[::n] gives every nth elemant starting from 0th element
# c = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
# c[::3]
# array([ 0,  3,  6,  9, 12, 15])

a[::2]

array([1, 3])

In [316]:
b

array([0. , 0.5, 1. , 1.5, 2. ])

In [317]:
b[0], b[2], b[-1]

(0.0, 1.0, 2.0)

**Multiindexing** is possible with NumPy, where the result is again a numpy array, as seen below.

**REMEMBER** the multi indices must be included within [].

In [318]:
b[[0, 2, -1]]

array([0., 1., 2.])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Array Types

In [319]:
a

array([1, 2, 3, 4])

In [320]:
a.dtype

dtype('int32')

In [321]:
b

array([0. , 0.5, 1. , 1.5, 2. ])

In [322]:
b.dtype

dtype('float64')

Explicitly asking for a type of items

In [323]:
np.array([1, 2, 3, 4], dtype=np.float)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.array([1, 2, 3, 4], dtype=np.float)


array([1., 2., 3., 4.])

To avoid the above error use the code below:

In [324]:
np.array([1, 2, 3, 4], dtype=float)

array([1., 2., 3., 4.])

In [325]:
# Small integer for performance enhancement
np.array([1, 2, 3, 4], dtype=np.int8)

array([1, 2, 3, 4], dtype=int8)

In [326]:
c = np.array(['a', 'b', 'c'])

In [327]:
# Shows unicode representation
c.dtype

dtype('<U1')

In [328]:
# No point in storing strings, objects in numpy arrays
d = np.array([{'a': 1}, sys])

In [329]:
# Type 'object'
d.dtype

dtype('O')

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Dimensions and shapes

In [330]:
# Creating a matrix, a 2D array

A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

In [331]:
# Number of columns and rows

A.shape

(2, 3)

In [332]:
# Number of dimensions

A.ndim

2

In [333]:
# Total number of elements

A.size

6

In [334]:
# Creating a cube

B = np.array([
    [
        [12, 11, 10],
        [9, 8, 7],
    ],
    [
        [6, 5, 4],
        [3, 2, 1]
    ]
])

In [335]:
B

array([[[12, 11, 10],
        [ 9,  8,  7]],

       [[ 6,  5,  4],
        [ 3,  2,  1]]])

In [336]:
B.shape

(2, 2, 3)

In [337]:
B.ndim

3

In [338]:
B.size

12

**Be careful when creating multi dimensional arrays by hand!.**

If the shape isn't consistent (different number of elements within elements), it'll just fall back to regular Python objects:

In [339]:
# It has since been deprecated!

C = np.array([
    [
        [12, 11, 10],
        [9, 8, 7],
    ],
    [
        [6, 5, 4]
    ]
])

  C = np.array([


In [340]:
C.dtype

dtype('O')

In [341]:
C.shape

(2,)

In [342]:
C.size

2

In [343]:
type(C[0])

list

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Indexing and Slicing of Matrices

In [344]:
# Creating a matrix

A = np.array([
#.   0. 1. 2.
    [1, 2, 3], # 0
    [4, 5, 6], # 1
    [7, 8, 9]  # 2
])

In [345]:
A[1]

array([4, 5, 6])

In [346]:
# Selecting a single value within the 2D array

A[1][0]

4

In [347]:
# This is done better through multidimensional selection of numpy
# A[d1, d2, d3, d4] -  d stands for dimension, and d'n' is a selector

In [348]:
A[1, 0]

4

In [349]:
A[0:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [350]:
# Select all rows and give elements up to the 2nd element only

A[:, :2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [351]:
# Select up to 2nd row and give elements up to the 2nd element in each of those rows

A[:2, :2]

array([[1, 2],
       [4, 5]])

In [352]:
# Select up to 2nd row and give elements starting from 2nd element to the end in each of those rows

A[:2, 2:]

array([[3],
       [6]])

In [353]:
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [354]:
# Modification of a matrix, replacing an enire row
# Will work if dimensions match

A[1] = np.array([10, 10, 10])

In [355]:
A

array([[ 1,  2,  3],
       [10, 10, 10],
       [ 7,  8,  9]])

In [356]:
# An expand operation
# Assigns a single value to the entire row!

A[2] = 99

In [357]:
A

array([[ 1,  2,  3],
       [10, 10, 10],
       [99, 99, 99]])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Summary statistics

* Built in with numpy

In [358]:
a = np.array([1, 2, 3, 4])

In [359]:
a.sum()

10

In [360]:
a.mean()

2.5

In [361]:
a.std()

1.118033988749895

In [362]:
a.var()

1.25

In [363]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

In [364]:
A.sum()

45

In [365]:
A.mean()

5.0

In [366]:
A.std()

2.581988897471611

In [367]:
# Sum of columns (d1)

A.sum(axis=0)

array([12, 15, 18])

In [368]:
# Sum of rows (d2)

A.sum(axis=1)

array([ 6, 15, 24])

In [369]:
A.mean(axis=0)

array([4., 5., 6.])

In [370]:
A.mean(axis=1)

array([2., 5., 8.])

In [371]:
A.std(axis=0)

array([2.44948974, 2.44948974, 2.44948974])

In [372]:
A.std(axis=1)

array([0.81649658, 0.81649658, 0.81649658])

And [many more](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html#array-methods)...

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Broadcasting and Vectorized operations

Vectorized operations can happen between vectors & vectors as well as vectors & scalers.

They are extremly fast.

In [373]:
# arange function creates a numerical range (start, stop, step), excluding stop value in the range.

a = np.arange(10)

In [374]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [375]:
# The operation is broadcasted to all elements individually
a + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [376]:
a * 10

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [377]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [378]:
# A mutable operation

a += 100

In [379]:
a

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

In [380]:
# A normal python list
l = [0, 1, 2, 3, 4, 5]

In [381]:
# This is a list comprehension, where which we iterate through each element.
# This is similar to a vectorized operation in numpy.

[i * 10 for i in l]

[0, 10, 20, 30, 40, 50]

In [382]:
a = np.arange(4)

In [383]:
a

array([0, 1, 2, 3])

In [384]:
b = np.array([10, 10, 10, 10])

In [385]:
b

array([10, 10, 10, 10])

In [386]:
# For vector operations to take place, both vectors need to have the same shape
a + b

array([10, 11, 12, 13])

In [387]:
a * b

array([ 0, 10, 20, 30])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Boolean arrays
_(Also called masks)_

In [388]:
a = np.arange(10)

In [389]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [390]:
# Typical python way of selecting an element

a[0], a[-1]

(0, 9)

In [391]:
# Selecting elements through multiindexing, specific to numpy

a[[0, -1]]

array([0, 9])

In [392]:
# Selecting elements through a Boolean array. 
# But this isn't clearly scalable!

a[[True, False, False, True, True, False, False, True, True, False]]

# BUT arrays such as [True, False, False, True] are the result of 
# broadcasting Boolean operations

array([0, 3, 4, 7, 8])

In [393]:
# This is a broadcasted operation where 10 is added to each element of the array

a + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

### Filtering through Boolean arrays

In [394]:
# Similarly, each element is compared to 2 through the broadcasting operatio below
# It results in a Boolean list

a >= 6

array([False, False, False, False, False, False,  True,  True,  True,
        True])

In [395]:
# NOW, those two can be combined to return all the values from the array,
# which satisfies the given condition

a[a >= 6]

array([6, 7, 8, 9])

In [396]:
# Calculating mean of the set

a.mean()

4.5

In [397]:
# Return all elements greater than the mean

a[a > a.mean()]

array([5, 6, 7, 8, 9])

In [398]:
# ~ is the Boolean NOT operator
# This expression returns all elements lesser than the mean

a[~(a > a.mean())]

array([0, 1, 2, 3, 4])

In [399]:
# In numpy | (pipe) is the OR operator

a[(a == 5) | (a == 8)]

array([5, 8])

In [400]:
# In numpy & (ampersand) is the AND operator

a[(a > 2) & (a % 2 == 0)]

array([4, 6, 8])

In [401]:
A = np.random.randint(100, size=(3, 3))

In [402]:
A

array([[98, 65, 74],
       [42, 82, 76],
       [31, 51, 63]])

In [403]:
# Manually typing a Boolean array

A[np.array([
    [True, False, True],
    [False, True, False],
    [True, False, True]
])]

array([98, 74, 82, 31, 63])

In [404]:
# This query generates a Boolean array which satisfies the specified condition

A > 30

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [405]:
# Combining the two as above, gives us the filtered result

A[A > 30]

array([98, 65, 74, 42, 82, 76, 31, 51, 63])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Linear Algebra

NumPy already contains all the most important operations for linear algebra already optimized with low level semantics. So it's going to be extremely fast.

In [406]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

In [407]:
B = np.array([
    [6, 5],
    [4, 3],
    [2, 1]
])

In [408]:
# Dot product through function use

A.dot(B)

array([[20, 14],
       [56, 41],
       [92, 68]])

In [409]:
# Dot product through operator use

A @ B

array([[20, 14],
       [56, 41],
       [92, 68]])

In [410]:
# Cross product through function use

np.cross(A, B)

array([[-15,  18,  -7],
       [-18,  24,  -8],
       [ -9,  18,  -9]])

In [411]:
# Transposing a matrix

B.T

array([[6, 4, 2],
       [5, 3, 1]])

In [412]:
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [413]:
B.T @ A

array([[36, 48, 60],
       [24, 33, 42]])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Size of objects in Memory

### Int, floats

In [414]:
# An integer in Python is > 24bytes (NOT BITS, bytes!)

sys.getsizeof(1)

28

In [415]:
# Size of 1 and 100000 are the same in memory. Therefore very inefficient

sys.getsizeof(100000)

28

In [416]:
# Longs are even larger

sys.getsizeof(10**100)

72

In [417]:
# Numpy size is much smaller

np.dtype(int).itemsize

4

In [418]:
# Numpy size is much smaller

np.dtype(np.int8).itemsize

1

In [419]:
np.dtype(float).itemsize

8

### Lists are even larger

In [420]:
# A one-element list

sys.getsizeof([1])

64

In [421]:
# An array of one element in numpy

np.array([1]).nbytes

4

### And performance is also important

In [422]:
# A python list

l = list(range(10000))

In [423]:
l[-1]

9999

In [424]:
# A numpy array

a = np.arange(10000)

In [425]:
%time np.sum(a ** 2)

CPU times: total: 0 ns
Wall time: 0 ns


-1724114088

In [426]:
%time sum([x ** 2 for x in l])

CPU times: total: 0 ns
Wall time: 4.01 ms


333283335000

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Useful Numpy functions

### `random` 

NumPy includes a full subpackage, numpy.random, dedicated to working with random numbers. 

In [427]:
# numpy.random.random() returns an array of specified shape 
# and fills it with random floats in the half-open interval [0.0, 1.0).

np.random.random(size=2)

array([0.33798291, 0.30480404])

In [459]:
# np.random.normal() returns an array of specified shape
# and fills it with random values from the normal distribution.

normal = np.random.normal(size=10000)

In [458]:
# The standard deviation of the generated set of numbers is almost 1.

normal.std()

0.9973825431938228

In [452]:
# Create an array of the given shape 
# and populate it with random samples from a uniform distribution over [0, 1).

np.random.rand(2, 4)

array([[0.98096233, 0.31570036, 0.37759529, 0.49857129],
       [0.8920334 , 0.6697007 , 0.91204055, 0.90674142]])

---
### `arange`

In [430]:
# start

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [431]:
# start, stop

np.arange(5, 10)

array([5, 6, 7, 8, 9])

In [432]:
# start, stop, step

np.arange(0, 1, .1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

---
### `reshape`

Reshaping means changing the shape of an array.

The shape of an array is the number of elements in each dimension.

By reshaping we can add or remove dimensions or change number of elements in each dimension.

**Remember** Throws an error if the shape cannot be closed with the set of numbers.

In [461]:
np.arange(10).reshape(2, 5)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [434]:
np.arange(10).reshape(5, 2)

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [462]:
np.arange(12).reshape(2, 2, 3)

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

---
### `linspace`

When you’re working with numerical applications using NumPy, you often need to create an array of numbers. 

In many cases you want the _numbers to be evenly spaced_, but there are also times when you may need _non-evenly spaced numbers_. 

One of the key tools you can use in both situations is np.linspace()

np.linspace() has two required parameters, start and stop, which you can use to set the beginning and end of the range.

The function return a closed range - a range which *includes* the end value by default.

In [464]:
np.linspace(0, 1)

# This code returns an ndarray with equally spaced intervals between the start and stop values. 
# This is a vector space, also called a linear space, which is where the name linspace comes from.

array([0.        , 0.02040816, 0.04081633, 0.06122449, 0.08163265,
       0.10204082, 0.12244898, 0.14285714, 0.16326531, 0.18367347,
       0.20408163, 0.2244898 , 0.24489796, 0.26530612, 0.28571429,
       0.30612245, 0.32653061, 0.34693878, 0.36734694, 0.3877551 ,
       0.40816327, 0.42857143, 0.44897959, 0.46938776, 0.48979592,
       0.51020408, 0.53061224, 0.55102041, 0.57142857, 0.59183673,
       0.6122449 , 0.63265306, 0.65306122, 0.67346939, 0.69387755,
       0.71428571, 0.73469388, 0.75510204, 0.7755102 , 0.79591837,
       0.81632653, 0.83673469, 0.85714286, 0.87755102, 0.89795918,
       0.91836735, 0.93877551, 0.95918367, 0.97959184, 1.        ])

The array in the example above is of length 50, which is the default number. 

In most cases, you’ll want to set your own number of values in the array. Depending on the application you’re developing, you may think of num as the *sampling*, or *resolution*, of the array you’re creating.

You can do so with the optional parameter num:

In [465]:
np.linspace(0, 1, num = 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [467]:
# The num parament can be used as a postional argument, withought explicitly mentioning its name in the function call.

np.linspace(0, 1, 20)

array([0.        , 0.05263158, 0.10526316, 0.15789474, 0.21052632,
       0.26315789, 0.31578947, 0.36842105, 0.42105263, 0.47368421,
       0.52631579, 0.57894737, 0.63157895, 0.68421053, 0.73684211,
       0.78947368, 0.84210526, 0.89473684, 0.94736842, 1.        ])

By default, np.linspace() uses a closed interval, [start, stop], in which the endpoint is included. 

This will often be your desired way of using this function. 

However, if you need to create a linear space with a **half-open interval**, [start, stop), then you can set the optional Boolean parameter endpoint to False:

In [437]:
np.linspace(0, 1, 20, False)

array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95])

### `arange vs linspace`

**Here’s a good rule of thumb for deciding which of the two functions to use:**

* Use np.linspace() when the exact values for the start and end points of your range are the important attributes in your application.
* Use np.arange() when the step size between values is more important.

---
### `zeros`, `ones`, `empty`

In [438]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [439]:
np.zeros((3, 3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [440]:
np.zeros((3, 3), dtype=np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.zeros((3, 3), dtype=np.int)


array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [441]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [442]:
np.ones((3, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [469]:
# The numpy.empty() function is used to create a new array of given shape and type, without initializing entries. 
# It is typically used for large arrays when performance is critical, and the values will be filled in later.

np.empty(5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [475]:
np.empty((2, 2))

array([[0.25, 0.5 ],
       [0.75, 1.  ]])

---
### `identity` and `eye`

The main difference is that with eye the diagonal can may be offset, whereas **identity only fills the main diagonal**.

In [445]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [446]:
np.eye(3, 3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [447]:
# k = 0 by default

np.eye(8, 4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [483]:
# Assigning a k value not equal to 0 shifts the diagonal

np.eye(8, 4, k=1)

array([[0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [449]:
np.eye(8, 4, k=-3)

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])

In [484]:
"Hello World"[6]

'W'

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)