<a href="https://colab.research.google.com/github/pablocurcodev/machine_learning/blob/main/NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NumPy Basics: Arrays and Vectorized Computation**

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy’s array objects as one of the standard interface lingua francas for data exchange. Much of the knowledge about NumPy that I cover is transferable to pandas as well.

Here are some of the things you’ll find in NumPy:

ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities

Mathematical functions for fast operations on entire arrays of data without having to write loops

Tools for reading/writing array data to disk and working with memory-mapped files

Linear algebra, random number generation, and Fourier transform capabilities

A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

Source: https://learning.oreilly.com/library/view/python-for-data/9781098104023/ch04.html

For most data analysis applications, the main areas of functionality are:

Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kind of computation

Common array algorithms like sorting, unique, and set operations

Efficient descriptive statistics and aggregating/summarizing data

Data alignment and relational data manipulations for merging and joining heterogeneous datasets

Expressing conditional logic as array expressions instead of loops with if-elif-else branches

Group-wise data manipulations (aggregation, transformation, and function application)

In [6]:
import numpy as np

my_arr = np.arange(1_000_000)

my_list = list(range(1_000_000))

%timeit my_arr2 = my_arr * 2

%timeit my_list2 = [x * 2 for x in my_list]

# NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.


1.41 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
77.5 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **The NumPy ndarray: A Multidimensional Array Object**

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

In [8]:
import numpy as np

data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])

print(data)

print(data * 10)

print(data + data)

[[ 1.5 -0.1  3. ]
 [ 0.  -3.   6.5]]
[[ 15.  -1.  30.]
 [  0. -30.  65.]]
[[ 3.  -0.2  6. ]
 [ 0.  -6.  13. ]]


An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

In [9]:
print(data.shape)
print(data.dtype)

(2, 3)
float64


The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion

In [11]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
print(arr1)

[6.  7.5 8.  0.  1. ]


Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [12]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions, with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes

In [14]:
print(arr2.ndim)
print(arr2.shape)

2
(2, 4)
