<a href="https://colab.research.google.com/github/pablocurcodev/machine_learning/blob/main/NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NumPy Basics: Arrays and Vectorized Computation**

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy’s array objects as one of the standard interface lingua francas for data exchange. Much of the knowledge about NumPy that I cover is transferable to pandas as well.

Here are some of the things you’ll find in NumPy:

ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities

Mathematical functions for fast operations on entire arrays of data without having to write loops

Tools for reading/writing array data to disk and working with memory-mapped files

Linear algebra, random number generation, and Fourier transform capabilities

A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

Source: https://learning.oreilly.com/library/view/python-for-data/9781098104023/ch04.html

For most data analysis applications, the main areas of functionality are:

Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kind of computation

Common array algorithms like sorting, unique, and set operations

Efficient descriptive statistics and aggregating/summarizing data

Data alignment and relational data manipulations for merging and joining heterogeneous datasets

Expressing conditional logic as array expressions instead of loops with if-elif-else branches

Group-wise data manipulations (aggregation, transformation, and function application)

In [2]:
import numpy as np

my_arr = np.arange(1_000_000)

my_list = list(range(1_000_000))

%timeit my_arr2 = my_arr * 2

%timeit my_list2 = [x * 2 for x in my_list]

# NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.


1.3 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
94.6 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **The NumPy ndarray: A Multidimensional Array Object**

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

In [3]:
import numpy as np

data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])

print(data)

print(data * 10)

print(data + data)

[[ 1.5 -0.1  3. ]
 [ 0.  -3.   6.5]]
[[ 15.  -1.  30.]
 [  0. -30.  65.]]
[[ 3.  -0.2  6. ]
 [ 0.  -6.  13. ]]


An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

In [4]:
print(data.shape)
print(data.dtype)

(2, 3)
float64


The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion

In [5]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
print(arr1)

[6.  7.5 8.  0.  1. ]


Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [6]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions, with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes

In [7]:
print(arr2.ndim)
print(arr2.shape)

2
(2, 4)


In addition to numpy.array, there are a number of other functions for creating new arrays. As examples, numpy.zeros and numpy.ones create arrays of 0s or 1s, respectively, with a given length or shape. numpy.empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape

In [8]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [9]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [10]:
np.empty((2, 3, 2))

array([[[4.87854459e-310, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000]],

       [[0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000]]])

It’s not safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero “garbage” values. You should use this function only if you intend to populate the new array with data.

numpy.arange is an array-valued version of the built-in Python range function

In [11]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

## **Data Types for ndarrays**

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data

In [12]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

print(arr1.dtype)
print(arr2.dtype)

float64
int32


In [13]:
arr = np.array([1, 2, 3, 4, 5])
print(arr.dtype)

float_arr = arr.astype(np.float64)
print(float_arr.dtype)

int64
float64


In [14]:
float_arr = arr.astype(np.float64)
print(float_arr)

[1. 2. 3. 4. 5.]


In [15]:
print(float_arr.dtype)

float64


In [16]:
# In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated:

arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
print(arr)
print(arr.dtype)
arr = arr.astype(np.int32)
print(arr)
print(arr.dtype)


[ 3.7 -1.2 -2.6  0.5 12.9 10.1]
float64
[ 3 -1 -2  0 12 10]
int32


There are shorthand type code strings you can also use to refer to a dtype:

In [17]:
zeros_uint32 = np.zeros(8, dtype="u4")
zeros_uint32

array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

## **Arithmetic with NumPy Arrays**

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise

In [18]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

print(arr)
print(arr * arr)
print(arr - arr)
print(1 / arr)
print(arr ** 2)

[[1. 2. 3.]
 [4. 5. 6.]]
[[ 1.  4.  9.]
 [16. 25. 36.]]
[[0. 0. 0.]
 [0. 0. 0.]]
[[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]
[[ 1.  4.  9.]
 [16. 25. 36.]]


In [19]:
# Evaluating operations between differently sized arrays is called broadcasting
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
print(arr2 > arr)

[[False  True False]
 [ True False  True]]


## **Basic Indexing and Slicing**

In [20]:
arr = np.arange(10)
print(arr)
print(arr[5])
print(arr[5:8])
arr[5:8] = 12
print(arr)
arr_slice = arr[5:8]
print(arr_slice)

# Now, when I change values in arr_slice, the mutations are reflected in the original array arr:
arr_slice[1] = 12345
print(arr)

[0 1 2 3 4 5 6 7 8 9]
5
[5 6 7]
[ 0  1  2  3  4 12 12 12  8  9]
[12 12 12]
[    0     1     2     3     4    12 12345    12     8     9]


In [21]:
# The “bare” slice [:] will assign to all values in an array
arr_slice[:] = 64
print(arr)

[ 0  1  2  3  4 64 64 64  8  9]


In [22]:
# If you want a copy of a slice of an ndarray instead of a view,
# you will need to explicitly copy the array—for example,
# arr[5:8].copy(). As you will see, pandas works this way, too.

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays

In [23]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d[2])

# Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements.
print(arr2d[0][2])
print(arr2d[0, 2])

[7 8 9]
3
3


In [24]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr3d)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


In [25]:
print(arr3d[0])

[[1 2 3]
 [4 5 6]]


In [26]:
print(arr3d[1, 0])

[7 8 9]


Note that in all of these cases where subsections of the array have been selected, the returned arrays are views.

## **Indexing with slices**

In [27]:
arr=np.array(([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9]))
print(arr[1:6])

print(arr2d)
# select the first two rows of arr2d.
print(arr2d[:2])

[ 1  2  3  4 64]
[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 2 3]
 [4 5 6]]


In [28]:
# select the second row but only the first two columns
lower_dim_slice = arr2d[1, :2]
lower_dim_slice

array([4, 5])

In [29]:
# select the third column but only the first two rows
arr2d[:2, 2]


array([3, 6])

In [36]:
# Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing

print(arr2d[:, :1])
print(arr2d[:, 1:2])


[[1]
 [4]
 [7]]
[[2]
 [5]
 [8]]


In [37]:
# Of course, assigning to a slice expression assigns to the whole selection:
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

## **Boolean Indexing**

In [49]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2],
                  [-12, -4], [3, 4]])

print(names)
print(data)
print()
print(names == "Bob")

# This Boolean array can be passed when indexing the array:
print(data[names == "Bob"])

['Bob' 'Joe' 'Will' 'Bob' 'Will' 'Joe' 'Joe']
[[  4   7]
 [  0   2]
 [ -5   6]
 [  0   0]
 [  1   2]
 [-12  -4]
 [  3   4]]

[ True False False  True False False False]
[[4 7]
 [0 0]]


The Boolean array must be of the same length as the array axis it’s indexing. You can even mix and match Boolean arrays with slices or integers (or sequences of integers; more on this later).

In [40]:
data[names == "Bob", 1:]

array([[7],
       [0]])

In [41]:
data[names == "Bob", 1]

array([7, 0])

In [44]:
# To select everything but "Bob" you can either use != or negate the condition
# using ~:

print(names != "Bob")
print(data[~(names == "Bob")])

# The ~ operator can be useful when you want to invert a Boolean array referenced by a variable

cond = names == "Bob"
print(data[~cond])

[False  True  True False  True  True  True]
[[  0   2]
 [ -5   6]
 [  1   2]
 [-12  -4]
 [  3   4]]
[[  0   2]
 [ -5   6]
 [  1   2]
 [-12  -4]
 [  3   4]]


In [45]:
# To select two of the three names to combine multiple Boolean conditions, use Boolean arithmetic operators like & (and) and | (or)

mask = (names == "Bob") | (names == "Will")
print(mask)
print(data[mask])

[ True False  True  True  True False False]
[[ 4  7]
 [-5  6]
 [ 0  0]
 [ 1  2]]


Selecting data from an array by Boolean indexing and assigning the result to a
new variable always creates a copy of the data, even if the returned array is unchanged.

The Python keywords and and or do not work with Boolean arrays. Use & (and) and | (or) instead.



In [51]:
print(data)
data[data < 0] = 0
print(data)

data[names != "Joe"] = 7
print(data)

[[4 7]
 [0 2]
 [0 6]
 [0 0]
 [1 2]
 [0 0]
 [3 4]]
[[4 7]
 [0 2]
 [0 6]
 [0 0]
 [1 2]
 [0 0]
 [3 4]]
[[7 7]
 [0 2]
 [7 7]
 [7 7]
 [7 7]
 [0 0]
 [3 4]]


## **Transposing Arrays and Swapping Axes**

In [52]:
arr = np.arange(15).reshape((3, 5))
print(arr)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [53]:
print(arr.T)

[[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]


When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using numpy.dot

In [54]:
arr = np.array([[0, 1, 0], [1, 2, -2], [6, 3, 2], [-1, 0, -1], [1, 0, 1]])

print(arr)
print('-----------------')
print(np.dot(arr.T, arr))

[[ 0  1  0]
 [ 1  2 -2]
 [ 6  3  2]
 [-1  0 -1]
 [ 1  0  1]]
-----------------
[[39 20 12]
 [20 14  2]
 [12  2 10]]


The @ infix operator is another way to do matrix multiplication

In [55]:
arr.T @ arr

array([[39, 20, 12],
       [20, 14,  2],
       [12,  2, 10]])

Simple transposing with .T is a special case of swapping axes. ndarray has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rearrange the data

In [56]:
arr

array([[ 0,  1,  0],
       [ 1,  2, -2],
       [ 6,  3,  2],
       [-1,  0, -1],
       [ 1,  0,  1]])

In [58]:
arr.swapaxes(0, 1)

# swapaxes similarly returns a view on the data without making a copy

array([[ 0,  1,  6, -1,  1],
       [ 1,  2,  3,  0,  0],
       [ 0, -2,  2, -1,  1]])