# Chapter 4.  NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python.   
Most computational packages providing scientific functionality use NumPy's array objects as the *lingua franca* for data exchange.   

Here are some of the things you’ll find in NumPy: 
* ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities. 
* Mathematical functions for fast operations on entire arrays of data without having to write loops. 
* Tools for reading/writing array data to disk and working with memory-mapped files. 
* Linear algebra, random number generation, and Fourier transform capabilities. 
* A C API for connecting NumPy with libraries written in C, C++, or FORTRAN. 

**Because NumPy provides an easy-to-use C API, it is straightforward to pass data to external libraries written in a low-level language and also for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C/C++/Fortran codebases and giving them a dynamic and easy-to-use interface.** 

For most data analysis applications, the main areas of functionality I'll focus on are: 
* Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations 
* Common array algorithms like sorting, unique, and set operations 
* Efficient descriptive statistics and aggregating/summarizing data 
* Data alignment and relational data manipulations for merging and joining together heterogeneous datasets 
* Expressing conditional logic as array expressions instead of loops with if-elifelse branches 
* Group-wise data manipulations (aggregation, transformation, function application) 

While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics
or analytics, especially on tabular data. pandas also provides some more domainspecific functionality like time series manipulation, which is not present in NumPy.

One of the reasons NumPy is so important for numerical computations in Python is
because it is designed for efficiency on large arrays of data. There are a number of
reasons for this:
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays without the need for Python for loops.

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

In this chapter and throughout the book, I use the standard
NumPy convention of always using import numpy as np.

To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list:


In [None]:
#import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))

Now let’s multiply each sequence by 2:

In [None]:
%time for _ in range(10): my_arr2 = my_arr * 2

In [None]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

In [None]:
%%time
for _ in range(10): 
        my_arr2 = my_arr * 2

<span style="color:red">NumPy-based algorithms are generally **10 to 100 times faster (or more)** than their pure Python counterparts and use significantly less memory.

## 4.1  The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray,
which is a fast, flexible container for large datasets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements.


To give you a flavor of how NumPy enables batch computations with similar syntax
to scalar values on built-in Python objects, I first import NumPy and generate a small
array of random data:


In [None]:
import numpy as np
# Generate some standard normal random data
data = np.random.randn(2, 3)
data

In [None]:
?np.random.randn

In [None]:
data * 10

In [None]:
data + data

### <span style="color:red">An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. </span>  
Every array has a **shape**, a tuple indicating the size of each dimension, and a **dtype**, an object describing the *data type* of the array

In [None]:
data.shape

In [None]:
data.dtype

Whenever you see “array,” “NumPy array,” or “ndarray” in the text,
with few exceptions they all refer to the same thing: the ndarray
object.

### Creating ndarrays

The easiest way to create an array is to use the **array** function. This accepts any
sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:


In [None]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

In [None]:
print(arr1)

In [None]:
print(type(arr1))

In [None]:
data11 = (6, 7.5, 8, 0, 1)
arr11 = np.array(data1)
arr11

In [None]:
data11 = {6, 7.5, 8, 0, 1}
arr11 = np.array(data1)
arr11

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array.

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
print(arr2)
arr2

Since data2 was a list of lists, the NumPy array **arr2** has two dimensions with shape
inferred from the data. We can confirm this by inspecting the **ndim** and **shape**
attributes:

In [None]:
arr2.ndim

In [None]:
arr2.shape

Unless explicitly specified (more on this later), **np.array** tries to infer a good data
type for the array that it creates. The data type is stored in a special **dtype** metadata
object; for example, in the previous two examples we have:

In [None]:
arr1.dtype

In [None]:
arr2.dtype

In addition to **np.array**, there are a number of other functions for creating new
arrays. As examples, **zeros** and **ones** create arrays of 0s or 1s, respectively, with a
given length or shape. **empty** creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple
for the shape:

In [None]:
np.zeros(10)

To create a higher dimensional array with these methods, pass a tuple for the shape.

In [None]:
np.zeros((3, 6))

**empty** creates an array without initializing its values to any particular value.  

In [None]:
np.empty((2, 3, 2))

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">It's not safe to assume that np.empty will return an array of all zeros.   
In some cases, it may return uninitialized “garbage” values.


**arange** is an array-valued version of the built-in Python range function.

In [None]:
np.arange(15)

<img style="float: left;" src="pic/pic_4_1.png" width="700">

### Data Types for ndarrays

The *data type* or **dtype** is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data.

In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

In [None]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [None]:
arr1.dtype

In [None]:
arr2.dtype

The numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating the number of bits per element. A standard doubleprecision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. See Table 4-2 for a full listing of NumPy’s supported data types.

<img style="float: left;" src="pic/pic_0_2.png">

Don’t worry about memorizing the NumPy dtypes, especially if you’re a new user.   
It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, boolean, string, or general Python object.   
When you need more control over how data are stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.

<img style="float: left;" src="pic/pic_4_2.png" width="700">

You can explicitly convert or *cast* an array from one dtype to another using ndarray’s **astype** method.

In [None]:
arr = np.array([1, 2, 3, 4, 5])

In [None]:
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)

In [None]:
float_arr

In [None]:
print(float_arr)

In [None]:
float_arr.dtype

If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated.

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

In [None]:
arr.astype(np.int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form.

In [None]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)

In [None]:
numeric_strings.dtype

In [None]:
numeric_strings

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type.

In [None]:
x=np.array(['3','4'],dtype=np.string_)
print(x.dtype)
x

In [None]:
y=np.array(['3','4'])
print(y.dtype)
y

In [None]:
x.astype(np.int32)

In [None]:
y.astype(np.int32)

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. 

<img style="float: left;" src="pic/pic_0_2.png">

Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype.


### Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this <span style="color:red"> **vectorization** <span style="color:black">. Any arithmetic operations between equal-size arrays applies the operation element-wise:


In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [None]:
arr

In [None]:
arr * arr

In [None]:
arr - arr

In [None]:
1 / arr

The number of digits of precision for floating point is 4. Why?   --- > Take a look at first cell.

In [None]:
np.set_printoptions(precision=4, suppress=True)

In [None]:
1 / (arr * 100000)

In [None]:
arr * 100000000000

In [None]:
x=np.array([0.00001, 50000])

In [None]:
x

In [None]:
1 / x

In [None]:
np.set_printoptions(precision=5, suppress=False)

In [None]:
np.set_printoptions?

suppress : bool, optional

If True, always print floating point numbers using fixed point notation, in which case numbers equal to zero in the current precision will print as zero. If False, then scientific notation is used when absolute value of the smallest number is < 1e-4 or the ratio of the maximum absolute value to the minimum is > 1e3. The default is False.

In [None]:
1 / (arr * 100000)

In [None]:
arr * 100000000000

In [None]:
x=np.array([0.00001, 50000])

In [None]:
x

In [None]:
1 / x

In [None]:
np.set_printoptions(precision=4, suppress=True)

In [None]:
arr ** 0.5

Comparisons between arrays of the same size yield boolean arrays.

In [None]:
arr1 =  np.array([1., 2., 3.])

In [None]:
arr2 = np.array([0., 4., 1.])

In [None]:
arr2 > arr1

Comparisons between arrays of the same size yield boolean arrays:


In [None]:
arr3 = np.array([[0., 4., 1.], [7., 2., 12.]])

In [None]:
arr3

In [None]:
arr3 > arr

Operations between differently sized arrays is called <span style="color:red"> **broadcasting** <span style="color:black"> and will be discussed in more detail in Appendix A. 

### Basic Indexing and Slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select
a subset of your data or individual elements. One-dimensional arrays are simple; on
the surface they act similarly to Python lists:

In [None]:
arr = np.arange(10)

In [None]:
arr

In [None]:
arr[5]

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12

In [None]:
arr

##### python list

In [None]:
x=list(range(10))
x

In [None]:
x[5:8]=12

In [None]:
x[5:8]=[12,12,12]
x

In numpy, as you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or *broadcasted* henceforth) to the entire selection. <br><br>

<br><br><br>

<span style="color:red">An important first distinction from Python’s built-in lists is that array slices are *views* on the original array.   
This means that the data is not copied, and any modifications to the view will be reflected in the source array. 

In [None]:
arr = np.arange(10)


In [None]:
arr_slice = arr[5:8]

In [None]:
arr_slice

In [None]:
arr_slice[1] = 12345

In [None]:
arr

In [None]:
arr_slice[:] = 64

In [None]:
arr

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy().
 

With higher dimensional arrays, you have many more options.  
In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:


In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [None]:
arr2d

In [None]:
arr2d[2]

Thus, individual elements can be accessed recursively. But that is a bit too much
work, so you can pass a comma-separated list of indices to select individual elements.
So these are equivalent:


In [None]:
arr2d[0][2]

In [None]:
arr2d[0, 2]

See Figure 4-1 for an illustration of indexing on a two-dimensional array. I find it
helpful to think of axis 0 as the “rows” of the array and axis 1 as the “columns.”


<img style="float: left;" src="pic/pic_4_3.png" width="700">

In multidimensional arrays, if you omit later indices, the returned object will be a
lower dimensional ndarray consisting of all the data along the higher dimensions. So
in the 2 × 2 × 3 array arr3d:

In [None]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

arr3d[0] is a 2 × 3 array:

In [None]:
arr3d[0]

Both scalar values and arrays can be assigned to arr3d[0]:

In [None]:
old_values = arr3d[0].copy()

In [None]:
arr3d[0] = 42
arr3d

In [None]:
arr3d[0] = old_values
arr3d

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),
forming a 1-dimensional array:

In [None]:
arr3d[1, 0]

This expression is the same as though we had indexed in two steps:

In [None]:
x = arr3d[1]

In [None]:
x

In [None]:
x[0]

#### Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the
familiar syntax:

In [None]:
arr

In [None]:
arr[1:6]

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit
different:


In [None]:
arr2d

In [None]:
arr2d[:2]

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a
range of elements along an axis. It can be helpful to read the expression arr2d[:2] as
“select the first two rows of arr2d.”

You can pass multiple slices just like you can pass multiple indexes:

In [None]:
arr2d[:2, 1:]

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

For example, I can select the second row but only the first two columns like so:


In [None]:
arr2d[1, :2]

Similarly, I can select the third column but only the first two rows like so:

In [None]:
arr2d[:2, 2]

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing:


In [None]:
arr2d[:, :1]

Of course, assigning to a slice expression assigns to the whole selection:

In [None]:
arr2d[:2, 1:] = 0
arr2d

<img style="float: left;" src="pic/pic_4_4.png" width="700">

In [None]:
arr2d

In [None]:
arr2d[2]

In [None]:
arr2d[1:]

In [None]:
arr2d[2,:]

In [None]:
arr2d[2,2]

In [None]:
arr2d[2:,:]

<pre>
expression의 법칙  
original: 2 dim array  - - - > 최소 1 dim, 생략은 slice 1로 가정
    
slicing expression     resulting shape dim
no index   no slice         2 dim
1  index   no slice         1 dim
no index    1 slice         2 dim
1  index    1 slice         1 dim
2  index   no slice         not array
no index    2 slice         2 dim

### Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names
with duplicates. I’m going to use here the randn function in numpy.random to generate
some random normally distributed data:

In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [None]:
data = np.random.randn(7, 4)

In [None]:
names

In [None]:
data

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'.  
Like arithmetic operations, comparisons (such as ==) with arrays are also **vectorized**.   
Thus, comparing names with the string 'Bob' yields a boolean array

In [None]:
names == 'Bob'

This boolean array can be passed when indexing the array:

In [None]:
data[names == 'Bob']

The boolean array must be of the same length as the array axis it’s indexing. You can
even mix and match boolean arrays with slices or integers (or sequences of integers;
more on this later).

<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">Boolean selection will not fail if the boolean array is not the correct length, so I recommend care when using this feature.

In [None]:
names2 = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe', 'Bob'])

In [None]:
names

In [None]:
data[names == 'Bob']

In these examples, I select from the rows where names == 'Bob' and index the columns, too:


In [None]:
data[names == 'Bob', 2:]

In [None]:
data[names == 'Bob', 3]

To select everything but 'Bob', you can either use != or negate the condition using ~:


In [None]:
names != 'Bob'

In [None]:
data[~(names == 'Bob')]

The ~ operator can be useful when you want to invert a general condition.

In [None]:
cond = names == 'Bob'
data[~cond]

Selecting two of the three names to combine multiple boolean conditions, use
boolean arithmetic operators like & (and) and | (or):

In [None]:
mask = (names == 'Bob') | (names == 'Will')
mask

In [None]:
data[mask]

Selecting data from an array by boolean indexing *always* creates a copy of the data,
even if the returned array is unchanged.


<img style="float: left;" src="pic/pic_0_1.png">

<span style="color:red">The Python keywords and and or do not work with boolean arrays. Use & (and) and | (or) instead.


다음 코드는 에러

In [None]:
mask = (names == 'Bob') and (names == 'Will')

Setting values with boolean arrays works in a common-sense way. To set all of the
negative values in data to 0 we need only do:

In [None]:
data[data < 0] = 0
data

In [None]:
names

Setting whole rows or columns using a one-dimensional boolean array is also easy:

In [None]:
data[names != 'Joe'] = 7
data

### Fancy Indexing

*Fancy indexing* is a term adopted by NumPy to describe indexing using integer arrays.

In [None]:
arr = np.zeros((8, 4))

In [None]:
for i in range(8):
    arr[i] = i

In [None]:
arr

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order.

In [None]:
arr[[4, 3, 0, 6]]

Hopefully this code did what you expected! Using negative indices selects rows from
the end:

In [None]:
arr[[-3, -5, -7]]

In [None]:
arr = np.arange(32).reshape((8, 4))

In [None]:
arr

Passing multiple index arrays does something slightly different; it selects a onedimensional array of elements corresponding to each tuple of indices.

In [None]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. Regardless of how many dimensions the array has (here, only 2), the result of fancy indexing is always one-dimensional. 

### Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything.   
Arrays have the **transpose** method and also the special T attribute.

In [None]:
arr = np.arange(15).reshape((3, 5))

In [None]:
arr

In [None]:
arr.T

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using **np.dot**.

In [None]:
arr = np.random.randn(6, 3)

In [None]:
arr.T

In [None]:
arr

In [None]:
np.dot(arr.T, arr)

## Universal Functions: Fast Element-Wise Array Functions

In [None]:
L=[1,2,3,4]

In [None]:
import math
math.sqrt(L)

A universal function, or *ufunc*, is a function that performs element-wise operations on data in ndarrays.  
You can think of them as fast *vectorized wrappers* for simple functions that take one or more scalar values and produce one or more  scalar results.   

Many ufuncs are simple element-wise transformations, like **sqrt** or **exp**:

In [None]:
arr = np.arange(10)

In [None]:
arr

In [None]:
np.sqrt(arr)

In [None]:
np.exp(arr)

These are referred to as *unary* ufuncs.   
Others, such as **add** or **maximum**, take two arrays (thus, binary ufuncs) and return a single array as the result.


In [None]:
x = np.random.randn(8)

In [None]:
y = np.random.randn(8)

In [None]:
x

In [None]:
y

In [None]:
np.maximum(x, y)

Here, numpy.maximum computed the element-wise maximum of the elements in x and y. 

<img style="float: left;" src="pic/pic_4_5.png" width="700">

<img style="float: left;" src="pic/pic_4_6.png" width="700">

<img style="float: left;" src="pic/pic_4_7.png" width="700">

## 4.3 Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as <span style="color:blue">**concise array expressions**<span style="color:black"> that might otherwise require writing loops.     
This practice of replacing explicit loops with array expressions is commonly referred to as <span style="color:red">**vectorization**<span style="color:black">.   
In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. 

As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2) across a regular grid of values.   
The **np.meshgrid** function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:

In [None]:
points = np.arange(-5, 5, 0.01) # 1000 equally spaced points

In [None]:
points

In [None]:
xs, ys = np.meshgrid(points, points)

In [None]:
xs

In [None]:
ys

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)
z

As a preview of Chapter 9, I use matplotlib to create visualizations of this twodimensional array:


In [None]:
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")

plt.draw()

In [None]:
plt.close('all')

### Expressing Conditional Logic as Array Operations

The **numpy.where** function is a *vectorized version* of the ternary expression **x if condition else y**. 

In [None]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr.  
A list comprehension doing this might look like:

In [None]:
result = [(x if c else y)
          for x, y, c in zip(xarr, yarr, cond)]
result

This has multiple problems.   
First, it will not be very fast for large arrays (because all the work is being done in interpreted Python code).   
Second, it will not work with multidimensional arrays.   
With np.where you can write this very concisely:

In [None]:
result = np.where(cond, xarr, yarr)
result

The second and third arguments to **np.where** don’t need to be arrays; one or both of them can be scalars.   
A typical use of where in data analysis is to produce a new array of values based on another array.   
Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2.   
This is very easy to do with np.where:

In [None]:
arr = np.random.randn(4, 4)

In [None]:
arr

In [None]:
arr > 0

In [None]:
np.where(arr > 0, 2, -2)

You can combine scalars and arrays when using np.where. For example, I can replace
all positive values in arr with the constant 2 like so:


In [None]:
np.where(arr > 0, 2, arr) # set only positive values to 2

### Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.   
You can use aggregations (often called *reductions*) like **sum**, **mean**, and **std** (standard deviation) either by calling the array instance method or using the top-level NumPy function.   

Here I generate some normally distributed random data and compute some aggregate statistics:


In [None]:
arr = np.random.randn(5, 4)

In [None]:
arr

In [None]:
arr.mean()

In [None]:
np.mean(arr)

In [None]:
arr.sum()

In [None]:
np.sum(arr)

Functions like **mean** and **sum** take an optional axis argument that computes the statistic over the given axis, resulting in an array with one fewer dimension.

*m* x *n* arrray 에서, mean(0) (sum,max,min 동일)의 결과는 크기 *n*의 array 이고, mean(1) 의 결과는 크기 *m*의 array 이다. 

In [None]:
arr.sum(axis=0)

In [None]:
arr.mean(axis=1)

In [None]:
arr.mean(1)

Here, **arr.mean(1)** means “compute mean across the columns” where **arr.sum(0)** means “compute sum down the rows.”   

Other methods like **cumsum** and **cumprod** do not aggregate, instead producing an array of the intermediate results:

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()

In multidimensional arrays, accumulation functions like cumsum return an array of
the same size, but with the partial aggregates computed along the indicated axis
according to each lower dimensional slice:


In [None]:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

In [None]:
arr

In [None]:
arr.cumsum(axis=0)

In [None]:
arr.cumprod(axis=1)

<img style="float: left;" src="pic/pic_4_8.png" width="700">

### Methods for Boolean Arrays

Boolean values are coerced to 1 (**True**) and 0 (**False**) in the preceding methods.   
Thus, **sum** is often used as a means of counting **True** values in a boolean array.

In [None]:
arr = np.random.randn(100)

In [None]:
arr

In [None]:
arr>0

In [None]:
(arr > 0).sum() # Number of positive values

There are two additional methods, **any** and **all**, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True.


In [None]:
bools = np.array([False, False, True, False])

In [None]:
bools.any()

In [None]:
bools.all()

### Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the **sort**
method:

In [None]:
arr = np.random.randn(6)

In [None]:
arr

In [None]:
arr.sort()

In [None]:
arr

You can sort each one-dimensional section of values in a multidimensional array inplace along an axis by passing the axis number to sort.

In [None]:
arr = np.random.randn(5, 3)

In [None]:
arr

In [None]:
arr.sort(1)

In [None]:
arr

The top-level method **np.sort** returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank

In [None]:
arr = np.random.randn(5)
arr

In [None]:
new=np.sort(arr)
new

In [None]:
arr

In [None]:
arr.sort()

In [None]:
arr

### Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays.   
A commonly used one is **np.unique**, which returns the sorted unique values in an array.


In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [None]:
np.unique(names)

In [None]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])

In [None]:
np.unique(ints)

<img style="float: left;" src="pic/pic_4_9.png" width="700">

In [None]:
values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

## 4.4  File Input and Output with Arrays

NumPy is able to save and load data to and from disk either in text or binary format.
In this section I only discuss NumPy’s built-in binary format, since most users will
prefer pandas and other tools for loading text or tabular data.

np.save and np.load are the two workhorse functions for efficiently saving and load‐
ing array data on disk. Arrays are saved by default in an uncompressed raw binary
format with file extension *.npy*

In [None]:
arr = np.arange(10)

In [None]:
np.save('green', arr)

In [None]:
x=np.load('green.npy')
x

## 4.5  Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and other square matrix math, is an important part of any array library.   
Multiplying two two-dimensional arrays with **\*  is an element-wise product** instead of a matrix dot product.   
Thus, there is a function **dot**, both an array method and a function in the numpy namespace, **for matrix multiplication**:


In [None]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])

In [None]:
x

In [None]:
y

In [None]:
x.dot(y)

x.dot(y) is equivalent to np.dot(x, y):

In [None]:
np.dot(x, y)

## check broadcasting

In [None]:
w=np.ones(3)

In [None]:
w

In [None]:
w.shape

In [None]:
x

In [None]:
x.shape

In [None]:
x.dot(x)

In [None]:
np.dot(x, np.ones(3))

In [None]:
np.dot(x,w)

In [None]:
x.dot(w)

In [None]:
x.dot(w).shape

The **@** symbol (as of Python 3.5) also works as an infix operator that performs matrix multiplication

In [None]:
x @ np.ones(3)

**numpy.linalg** has a standard set of matrix decompositions and things like inverse and determinant.   
These are implemented under the hood via the same industry standard linear algebra libraries used in other languages like MATLAB and R, such as BLAS, LAPACK, or possibly (depending on your NumPy build) the proprietary Intel MKL (Math Kernel Library):


In [None]:
from numpy.linalg import inv, qr

In [None]:
X = np.random.randn(5, 5)

In [None]:
X

In [None]:
X.T

In [None]:
mat = X.T.dot(X)

In [None]:
mat

In [None]:
inv(mat)

In [None]:
mat.dot(inv(mat))

<img style="float: left;" src="pic/pic_4_10.png" width="700">

<img style="float: left;" src="pic/pic_4_11.png" width="700">

## 4.6 Pseudorandom Number Generation

The **numpy.random** module supplements the built-in Python **random** with functions
for efficiently generating whole arrays of sample values from many kinds of probability distributions. For example, you can get a 4 × 4 array of samples from the standard
normal distribution using **normal**:

In [None]:
samples = np.random.normal(size=(4, 4))
samples

Python’s built-in **random** module, by contrast, only samples one value at a time. As
you can see from this benchmark, **numpy.random** is well over an order of magnitude
faster for generating very large samples:


In [None]:
from random import normalvariate
N = 1000000

In [None]:
%timeit samples = [normalvariate(0, 1) for _ in range(N)]

In [None]:
%timeit samples2=np.random.normal(size=N)

We say that these are pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the *seed* of the random number generator. You can change NumPy’s random number generation seed using
**np.random.seed**:


<img style="float: left;" src="pic/pic_4_12.png" width="700">

## 4.7  Example: Random Walks

The simulation of random walks provides an illustrative application of utilizing array operations. Let’s first consider a simple random walk starting at 0 with steps of 1 and –1 occurring with equal probability. 

Here is a pure Python way to implement a single random walk with 1,000 steps using the built-in **random** module:


In [None]:
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)

In [None]:
plt.figure()

plt.plot(walk[:100])

You might make the observation that walk is simply the cumulative sum of the random steps and could be evaluated as an array expression.   
Thus, I use the **np.random** module to draw 1,000 coin flips at once, set these to 1 and –1, and compute the cumulative sum.


In [None]:
np.random.seed(20210301)

In [None]:
nsteps = 100

In [None]:
np.random.randint?

In [None]:
draws = np.random.randint(0, 2, size=nsteps)

In [None]:
draws

In [None]:
steps = np.where(draws > 0, 1, -1)

In [None]:
walk = steps.cumsum()

In [None]:
walk

In [None]:
walk.min()

In [None]:
walk.max()

A more complicated statistic is the first crossing time, the step at which the random walk reaches a particular value.   
Here we might want to know how long it took the random walk to get at least 5 steps away from the origin 0 in either direction.   
np.abs(walk) >= 5 gives us a boolean array indicating where the walk has reached or exceeded 5, but we want the index of the first 5 or –5.   
Turns out, we can compute this using argmax, which returns the first index of the maximum value in the boolean array (True is the maximum value).

In [None]:
(np.abs(walk) >= 5).argmax()

### Simulating Many Random Walks at Once

If your goal was to simulate many random walks, say 5,000 of them, you can generate all of the random walks with minor modifications to the preceding code.   
If passed a 2-tuple, the **numpy.random** functions will generate a two-dimensional array of draws, and we can compute the cumulative sum across the rows to compute all 5,000 random walks in one shot:


In [None]:
nwalks = 5000

In [None]:
nsteps = 1000

In [None]:
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1

In [None]:
steps = np.where(draws > 0, 1, -1)

In [None]:
walks = steps.cumsum(1)

In [None]:
walks

In [None]:
walks.max()

In [None]:
walks.min()

Out of these walks, let’s compute the minimum crossing time to 30 or –30.   
This is slightly tricky because not all 5,000 of them reach 30.   
We can check this using the any method:


In [None]:
hits30 = (np.abs(walks) >= 30).any(1)

In [None]:
hits30

In [None]:
hits30.sum() # Number that hit 30 or -30

In [None]:
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times

In [None]:
crossing_times.mean()