# NUMPY

NumPy is a Python library that provides a simple yet powerful data structure: the n-dimensional array. This is the foundation on which almost all the power of Python’s data science toolkit is built, and learning NumPy is the first step on any Python data scientist’s journey.

https://en.wikipedia.org/wiki/Matrix_(mathematics)

Here are the top *four benefits* that NumPy can bring to your code:
- **More speed**: NumPy uses algorithms written in C that complete in nanoseconds rather than seconds.
- **Fewer loops**: NumPy helps you to reduce loops and keep from getting tangled up in iteration indices.
- **Clearer code**: Without loops, your code will look more like the equations you’re trying to calculate.
- **Better quality**: There are thousands of contributors working to keep NumPy fast, friendly, and bug free.

Because of these benefits, NumPy is the de facto standard for multidimensional arrays in Python data science, and many of the most popular libraries are built on top of it. Learning NumPy is a great way to set down a solid foundation as you expand your knowledge into more specific areas of data science.Because of these benefits, NumPy is the de facto standard for multidimensional arrays in Python data science, and many of the most popular libraries are built on top of it. Learning NumPy is a great way to set down a solid foundation as you expand your knowledge into more specific areas of data science.

In [1]:
import numpy as np

$$Matrix =
\begin{bmatrix}
1&2&3\\
4&5&6\\
7&7&9
\end{bmatrix}$$

In [2]:
digits = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 7, 9]
])
digits

array([[1, 2, 3],
       [4, 5, 6],
       [7, 7, 9]])

### Hello NumPy: Curving Test Grades Tutorial

This first example introduces a few core concepts in NumPy that you’ll use throughout the rest of the tutorial:

* Creating arrays using numpy.array()
* Treating complete arrays like individual values to make vectorized calculations more readable
* Using built-in NumPy functions to modify and aggregate the data

These concepts are the core of using NumPy effectively.

The scenario is this: You’re a teacher who has just graded your students on a recent test. Unfortunately, you may have made the test too challenging, and most of the students did worse than expected. To help everybody out, you’re going to curve everyone’s grades.

It’ll be a relatively rudimentary curve, though. You’ll take whatever the average score is and declare that a C. Additionally, you’ll make sure that the curve doesn’t accidentally hurt your students’ grades or help so much that the student does better than 100%.

In [3]:
curve_center = 80
# Create NumPy array
grades = np.array([72, 35, 64, 88, 51, 90, 74, 12])
def curve(grades):
    # take mean of Numpy array
    average = grades.mean()
    change = curve_center - average
    new_grades = grades + change
    # you limit, or clip, the values to a set of minimums and maximums.
    return np.clip(new_grades, grades, 100)
curve(grades)

array([ 91.25,  54.25,  83.25, 100.  ,  70.25, 100.  ,  93.25,  31.25])

### new_grades = grades + change

This involves two important concepts at once:
* Vectorization
* Broadcasting

**Vectorization** is the process of performing the same operation in the same way for each element in an array. This removes for loops from your code but achieves the same result.

**Broadcasting** is the process of extending two arrays of different shapes and figuring out how to perform a vectorized calculation between them. Remember, grades is an array of numbers of shape (8,) and change is a scalar, or single number, essentially with shape (1,). In this case, NumPy adds the scalar to each item in the array and returns a new array with the results.

### np.clip(new_grades, grades, 100)

This is another example of *broadcasting*. For the second argument to clip(), you pass grades, ensuring that each newly curved grade doesn’t go lower than the original grade. But for the third argument, you pass a single value: 100. NumPy takes that value and broadcasts it against every element in new_grades, ensuring that none of the newly curved grades exceeds a perfect score.


### Vectors, 

which are one-dimensional arrays of numbers, are the least complicated to keep track of. Two dimensions aren’t too bad, either, because they’re similar to spreadsheets. But things start to get tricky at three dimensions, and visualizing four? Vector example 

$$\hat V =\begin{bmatrix}
1\\
3\\
6\\
700\\
800
\end{bmatrix}$$


### Mastering Shape

**Shape** is a key concept when you’re using multidimensional arrays. At a certain point, it’s easier to forget about visualizing the shape of your data and to instead follow some mental rules and trust NumPy to tell you the correct shape.

*All arrays have a property called .shape that returns a tuple of the size in each dimension.* It’s less important which dimension is which, but it’s critical that the arrays you pass to functions are in the shape that the functions expect. A common way to confirm that your data has the proper shape is to print the data and its shape until you’re sure everything is working like you expect.

In [6]:
temperatures = np.array([
    29.3, 42.1, 18.8, 16.1, 38.0, 12.5,
    12.6, 49.9, 38.6, 31.3, 9.2, 22.2
])
print(temperatures.shape)
temperatures

(12,)


array([29.3, 42.1, 18.8, 16.1, 38. , 12.5, 12.6, 49.9, 38.6, 31.3,  9.2,
       22.2])

In [7]:
temperatures = temperatures.reshape(2, 2, 3)
print(temperatures.shape)
temperatures

(2, 2, 3)


array([[[29.3, 42.1, 18.8],
        [16.1, 38. , 12.5]],

       [[12.6, 49.9, 38.6],
        [31.3,  9.2, 22.2]]])

In [8]:
np.swapaxes(temperatures, 1, 2)

array([[[29.3, 16.1],
        [42.1, 38. ],
        [18.8, 12.5]],

       [[12.6, 31.3],
        [49.9,  9.2],
        [38.6, 22.2]]])

### Understanding Axes

The example above shows how important it is to know not only what shape your data is in but also which data is in which axis. In NumPy arrays, axes are zero-indexed and identify which dimension is which. For example, a two-dimensional array has a vertical axis (axis 0) and a horizontal axis (axis 1). Lots of functions and commands in NumPy change their behavior based on which axis you tell them to process.

This example will show how .max() behaves by default, with no axis argument, and how it changes functionality depending on which axis you specify when you do supply an argument:

In [9]:
table = np.array([
    [5, 3, 7, 1],
    [2, 6, 7, 9],
    [1, 2, 1, 1],
    [4, 3, 2, 0]
])

table.max()

9

In [10]:
table.max(axis = 0)

array([5, 6, 7, 9])

In [11]:
table.max(axis = 1)

array([7, 9, 2, 4])

**By default, .max() returns the largest value in the entire array, no matter how many dimensions there are**. However, once you specify an axis, it performs that calculation for each set of values along that particular axis. For example, with an argument of axis=0, .max() selects the maximum value in each of the four vertical sets of values in table and returns an array that has been flattened, or aggregated into a one-dimensional array.

**In fact, many of NumPy’s functions behave this way: If no axis is specified, then they perform an operation on the entire dataset. Otherwise, they perform the operation in an axis-wise fashion.**

### Broadcasting

Fundamentally, it functions around one rule: arrays can be broadcast against each other if their dimensions match or if one of the arrays has a size of 1.

If the arrays match in size along an axis, then elements will be operated on element-by-element, similar to how the built-in Python function zip() works. If one of the arrays has a size of 1 in an axis, then that value will be broadcast along that axis, or duplicated as many times as necessary to match the number of elements along that axis in the other array.

Here’s a quick example. Array A has the shape (4, 1, 8), and array B has the shape (1, 6, 8). Based on the rules above, you can operate on these arrays together:

* In axis 0, A has a 4 and B has a 1, so B can be broadcast along that axis.
* In axis 1, A has a 1 and B has a 6, so A can be broadcast along that axis.
* In axis 2, the two arrays have matching sizes, so they can operate successfully.

All three axes successfully follow the rule.

In [15]:
# This is a good way to create an array from a range using arange()!
A = np.arange(32).reshape(4, 1, 8)
A

array([[[ 0,  1,  2,  3,  4,  5,  6,  7]],

       [[ 8,  9, 10, 11, 12, 13, 14, 15]],

       [[16, 17, 18, 19, 20, 21, 22, 23]],

       [[24, 25, 26, 27, 28, 29, 30, 31]]])

In [16]:
B = np.arange(48).reshape(1, 6, 8)
B

array([[[ 0,  1,  2,  3,  4,  5,  6,  7],
        [ 8,  9, 10, 11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20, 21, 22, 23],
        [24, 25, 26, 27, 28, 29, 30, 31],
        [32, 33, 34, 35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44, 45, 46, 47]]])


**A has 4 planes, each with 1 row and 8 columns. B has only 1 plane with 6 rows and 8 columns.**

The way broadcasting works is that NumPy duplicates the plane in B three times so that you have a total of four, matching the number of planes in A. It also duplicates the single row in A five times for a total of six, matching the number of rows in B. Then it adds each element in the newly expanded A array to its counterpart in the same location in B. The result of each calculation shows up in the corresponding location of the output.

Once again, even though you can use words like “plane,” “row,” and “column” to describe how the shapes in this example are broadcast to create matching three-dimensional shapes, things get more complicated at higher dimensions. A lot of times, you’ll have to simply follow the broadcasting rules and do lots of print-outs to make sure things are working as planned.

Understanding broadcasting is an important part of mastering vectorized calculations, and vectorized calculations are the way to write clean, idiomatic NumPy code.

In [17]:
# A has 4 planes, each with 1 row and 8 columns. 
# B has only 1 plane with 6 rows and 8 columns.
A + B

array([[[ 0,  2,  4,  6,  8, 10, 12, 14],
        [ 8, 10, 12, 14, 16, 18, 20, 22],
        [16, 18, 20, 22, 24, 26, 28, 30],
        [24, 26, 28, 30, 32, 34, 36, 38],
        [32, 34, 36, 38, 40, 42, 44, 46],
        [40, 42, 44, 46, 48, 50, 52, 54]],

       [[ 8, 10, 12, 14, 16, 18, 20, 22],
        [16, 18, 20, 22, 24, 26, 28, 30],
        [24, 26, 28, 30, 32, 34, 36, 38],
        [32, 34, 36, 38, 40, 42, 44, 46],
        [40, 42, 44, 46, 48, 50, 52, 54],
        [48, 50, 52, 54, 56, 58, 60, 62]],

       [[16, 18, 20, 22, 24, 26, 28, 30],
        [24, 26, 28, 30, 32, 34, 36, 38],
        [32, 34, 36, 38, 40, 42, 44, 46],
        [40, 42, 44, 46, 48, 50, 52, 54],
        [48, 50, 52, 54, 56, 58, 60, 62],
        [56, 58, 60, 62, 64, 66, 68, 70]],

       [[24, 26, 28, 30, 32, 34, 36, 38],
        [32, 34, 36, 38, 40, 42, 44, 46],
        [40, 42, 44, 46, 48, 50, 52, 54],
        [48, 50, 52, 54, 56, 58, 60, 62],
        [56, 58, 60, 62, 64, 66, 68, 70],
        [64, 66, 68, 70, 72,

## Data Science Operations: Filter, Order, Aggregate

### Indexing

Indexing uses many of the same idioms that normal Python code uses. You can use positive or negative indices to index from the front or back of the array. You can use a colon (:) to specify “the rest” or “all,” and you can even use two colons to skip elements as with regular Python lists.

Here’s the difference: NumPy arrays use commas between axes, so you can index multiple axes in one set of square brackets.

https://en.wikipedia.org/wiki/Magic_square#Albrecht_D%C3%BCrer's_magic_square

The number square below has some amazing properties. If you add up any of the rows, columns, or diagonals, then you’ll get the same number, 34. That’s also what you’ll get if you add up each of the four quadrants, the center four squares, the four corner squares, or the four corner squares of any of the contained 3 × 3 grids. You’re going to prove it!

In [18]:
square = np.array([
    [16, 3, 2, 13],
    [5, 10, 11, 8],
    [9, 6, 7, 12],
    [4, 15, 14, 1]
])

for i in range(4):
    assert square[:, i].sum() == 34
    assert square[i, :].sum() == 34

In [20]:
[square[:, 0].sum(), square[:, 2].sum(), 
 square[:, 3].sum(), square[:, 1].sum()]

[34, 34, 34, 34]

In [21]:
[square[0, :].sum(), square[1, :].sum(), 
 square[2, :].sum(), square[3, :].sum()]

[34, 34, 34, 34]

### Masking and Filtering

**Index-based selection is great, but what if you want to filter your data based on more complicated nonuniform or nonsequential criteria? This is where the concept of a mask comes into play.**

A mask is an array that has the exact same shape as your data, but instead of your values, it holds Boolean values: either True or False. You can use this mask array to index into your data array in nonlinear and complex ways. It will return all of the elements where the Boolean array has a True value.

**np.linspace()** generates n numbers evenly distributed between a minimum and a maximum, which is useful for evenly distributed sampling in scientific plotting.

**array.reshape() can take -1 as one of its dimension sizes**. *That signifies that NumPy should just figure out how big that particular axis needs to be based on the size of the other axes.* In this case, with 24 values and a size of 4 in axis 0, axis 1 ends up with a size of 6.

In [22]:
numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)
numbers

array([[ 5,  6,  8, 10, 12, 14],
       [16, 18, 20, 22, 24, 26],
       [28, 30, 32, 34, 36, 38],
       [40, 42, 44, 46, 48, 50]])

In [23]:
mask = numbers % 4 == 0

In [24]:
mask

array([[False, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False]])

In [25]:
numbers[mask]

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

In [26]:
numbers[numbers % 4 == 0]

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

* **mask** is created by performing a vectorized Boolean computation, taking each element and checking to see if it divides evenly by four. This returns a mask array of the same shape with the element-wise results of the computation.
* Use **mask** to index into the original numbers array. This causes the array to lose its original shape, reducing it to one dimension, but you still get the data you’re looking for.

### Normal Distribution
https://en.wikipedia.org/wiki/Normal_distribution

https://en.wikipedia.org/wiki/Standard_deviation

The normal distribution is a probability distribution in which roughly 95.45% of values occur within two standard deviations of the mean.

In [27]:
from numpy.random import default_rng
rng = default_rng()
values = rng.standard_normal(10000)
values[:5]

array([-0.86894925,  0.68194156, -0.38301715,  0.12908635,  1.56870913])

In [29]:
std = values.std()
std

0.9918000944942111

In [30]:
filtered = values[(values > -2 * std) & (values < 2* std)]
filtered.size

9547

In [31]:
values.size, filtered.size/values.size

(10000, 0.9547)

### Transposing, Sorting, and Concatenating

In [32]:
a = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])
a

array([[1, 2],
       [3, 4],
       [5, 6]])

In [33]:
a.T

array([[1, 3, 5],
       [2, 4, 6]])

In [34]:
a.transpose()

array([[1, 3, 5],
       [2, 4, 6]])

In [35]:
data = np.array([
    [7, 1, 4],
    [8, 6, 5],
    [1, 2, 3]
])
data

array([[7, 1, 4],
       [8, 6, 5],
       [1, 2, 3]])

In [36]:
np.sort(data)

array([[1, 4, 7],
       [5, 6, 8],
       [1, 2, 3]])

In [37]:
np.sort(data, axis=None)

array([1, 1, 2, 3, 4, 5, 6, 7, 8])

In [38]:
np.sort(data, axis=0)

array([[1, 1, 3],
       [7, 2, 4],
       [8, 6, 5]])

In [40]:
c = np.array([
    [3, 5],
    [7, 2]
])
d = np.array([
    [4, 8],
    [7, 2]
])
np.hstack((c, d))

array([[3, 5, 4, 8],
       [7, 2, 7, 2]])

In [41]:
np.vstack((c, d))

array([[3, 5],
       [7, 2],
       [4, 8],
       [7, 2]])

In [42]:
np.vstack((d, c))

array([[4, 8],
       [7, 2],
       [3, 5],
       [7, 2]])

In [43]:
np.concatenate((c, d))

array([[3, 5],
       [7, 2],
       [4, 8],
       [7, 2]])

In [44]:
np.concatenate((c, d), axis=None)

array([3, 5, 7, 2, 4, 8, 7, 2])

### Aggregating
Many of the mathematical, financial, and statistical functions use aggregation to help you reduce the number of dimensions in your data.

### Practical Example 1: Implementing a Maclaurin Series

Now it’s time to see a realistic use case for the skills introduced in the sections above: implementing an equation.

**One of the hardest things about converting mathematical equations to code without NumPy is that many of the visual similarities are missing, which makes it hard to tell what portion of the equation you’re looking at as you read the code. Summations are converted to more verbose for loops, and limit optimizations end up looking like while loops.**

Using NumPy allows you to keep closer to a one-to-one representation from equation to code.

In this next example, you’ll encode the Maclaurin series (https://mathworld.wolfram.com/MaclaurinSeries.html) for ex. Maclaurin series are a way of approximating more complicated functions with an infinite series of summed terms centered about zero.

For $e^{x}$, the Maclaurin series is the following summation:

$$e^{x} = \sum_{n=0}^\infty\frac{x^n}{n!} = 1+x+\frac{x^2}{2}+\frac{x^3}{6}+\quad...$$

You add up terms starting at zero and going theoretically to infinity. Each nth term will be x raised to n and divided by n!, which is the notation for the factorial operation.

In [46]:
from math import e, factorial

fac = np.vectorize(factorial)

def e_x(x, terms=10):
    """Approximate e^x using a given number of terms of 
    the Maclaurin's series"""
    
    n = np.arange(terms)
    return np.sum((x**n)/ fac(n))

if __name__ == "__main__":
    print(f'Actual:{e**3}')
    
    print(f'N (terms)\tMaclaurin\tError')
    
    for n in range(1, 14):
        maclaurin = e_x(3, terms=n)
        print(f"{n}\t\t{maclaurin:.03f}\t\t{e**3 - maclaurin:.03f}")

Actual:20.085536923187664
N (terms)	Maclaurin	Error
1		1.000		19.086
2		4.000		16.086
3		8.500		11.586
4		13.000		7.086
5		16.375		3.711
6		18.400		1.686
7		19.412		0.673
8		19.846		0.239
9		20.009		0.076
10		20.063		0.022
11		20.080		0.006
12		20.084		0.001
13		20.085		0.000


In [48]:
n =np.arange(10)
n

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [49]:
np.sum((2**n)/ fac(n))

7.3887125220458545

### Optimizing Storage: Data Types
In NumPy, though, there’s a little more detail that needs to be covered. NumPy uses C code under the hood to optimize performance, and it can’t do that unless all the items in an array are of the same type. That doesn’t just mean the same Python type. They have to be the same underlying C type, with the same shape and size in bits!