# Chapter 2: Introduction to NumPy

It will help us to think of all data as arrays of numbers. No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers. In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. Numpy arrays form the core of nearly the entire ecosystem of data science tools in Python.  

In [2]:
import numpy as np

Effective data-driven science and computation requires understanding how data is stored and manipulated.

The standard Python implementation is written in C. This means that every Python object is a cleverly disguised C structure. 

**A python list is more than just a list**
- because of python's dynamic typing we can create heterogenous lists. This flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information – that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. 

At the implementation level, the array contains a single pointer to one contiguous block of data. The Python list on the other hand contains a pointer to a block of pointers, each of which in turn points to a full python object like the Python integer we saw earlier. Again the advantage of a list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled witth data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for sotring and manipulating data. 

**Fixed-Type Arrays in Python**
Python offers several different options for storing data in efficient, fixed-type data buffers. The built in `array` module can be used to create dense arrays of a uniform type. 

In [None]:
import array

L = list(range(10))
A = array.array('i',L)
A

Here `'i'` is a type code indicating the contents are integers. 

However, the `ndarray` object of the NumPy package is much more useful. Python's `array` object provides an efficient way to store array-based data, but NumPy adds to this efficient operations on that data. Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If the types do not match, NumPy will upcast if possible. If we want to explicitly set the data type of the resulting array, we can use the `dtype` keyword. Finally unlike Python lists, NumPy arrays can be explicitly multidimensional.

In [None]:
#integer array
np.array([1,4,2,5,3])

In [None]:
#Numpy will upcast if types don't match
np.array([3.14,4,2,3])

In [None]:
np.array([1,2,3,4],dtype='float32')

In [3]:
#nested lists result in multidimensional arrays
np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### Creating Arrays from Scratch
Especially for larger arrays, it is more efficient to create arrays from scartch using routines built into NumPy. 

In [None]:
# Create a length-10 integer array filled with zeros 
np.zeros(10,dtype=int)

# Create a 3x5 floatting-point array filled with 1s
np.ones((3,5),dtype=float)

# Create a 3x5 array filled with 3.14
np.full((3,5),3.14)

# Create an array filled with a linear sequence 
# Starting at 0, ending at 20, stepping by 2
# This is similar to the built in range() function
np.arange(0,20,2)

# Create an array of five values evenly spaced between 0 and 1
np.linspace(0,1,5)

# Create a 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3,3))

# Create a 3x3 array of normally distributed random values with mean 0 and standard deviation of 1
np.random.normal(0,1,(3,3))

# Create a 3x3 array of random integers in the interval [0,10]
np.random.randint(0,10,(3,3))

# Create a 3x3 identity matrix
np.eye(3)

### The Basics of NumPy Arrays 
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array. 

Basic array manipulation: 
- **atributes of arrays** - determining the size, shape, memory consumption, and data types of arrays
- **indexing of arrays** - getting and setting the value of individual array elements
    - in a one dimensional array, you can access the ith value (counting from zero) by specifying the desired index in square brackets, just as with Python lists
    - to index from the end of the array you can use negative indices
    - in a multidimensional array, you can access items using a comma-seperated tuple of indices
- **slicing of arrays** - getting and setting smaller subarrays within a larger array
    - Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon `:` character. 
    - `x[start:stop:step]`
        - if any of these are unspecified, they default to the values start=0, stop = size of dimension, step = 1
        - a potentially confusing case is when the step value is negative
    - multidimensional slices work in the same way, with multiple slices seperated by commas
    - to access a single row or column of an array you can combine indexing and slicing using an empty slice marked by a `:`
    - one important thing to know about array slices is that they return views rather than copies of the array data. 
        - this default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer
            - this is one area in which NumPy array slicing differs from Python list slicing: in lists slices will be copies 
- **reshaping of arrays** - changing the shape of a given array
- **joining and splitting of arrays** - combining multiple arrays into one, and splitting one array into many

In [None]:
x1 = np.random.randint(10,size=6) # one dimensional array
x2 = np.random.randint(10,size=(3,4)) # two-dimensional array
x3 = np.random.randint(10, size =(3,4,5)) # three-dimensional array

In [None]:
x1[0]
x2[0,0]

You can also modify values using any of the above index notation. Keep in mind that unlike Python lists, NumPy arrays have a fixed type. This means that, for example, if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. 

In [None]:
x2[0,0]=12
x1[0]=3.144444 # will be truncated
x1

In [None]:
x = np.arange(10)
x

In [None]:
x[:5] #first five elements
x[5:] #elements after index 5
x[4:7] #middle subarray
x[::2] #every other element
x[1::2] #every other element starting at index 1

In [None]:
x2

In [None]:
x2[:2,:3] #two rows and three columns
x2[:3,::2] #all rows every other column

#subarray dimensions can even be reversed together
x2[::-1,::-1]

In [None]:
print(x2[:,0]) # first column of x2
print(x2[0,:]) # first row of x2

In [None]:
print(x2)

In [None]:
# Let's extract a subarray from this: 
x2_sub = x2[:2,:2]
print(x2_sub)

In [None]:
#Now if we modify this subarray we will see that the original array is changed!
x2_sub[0,0]=99
print(x2)

### Creating Copies of Arrays
It is useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the `copy()` method. 

In [None]:
x2_sub_copy=x2[:2,:2].copy() #if we modify this subarray, the original array is not touched
print(x2)

### Reshaping Arrays

- `reshape()`: note that for this to work, the size of the initial array must match the size of the reshaped array. Whenever possible, the `reshape()` method will use a no-copy view of the initial array, but this is not always the case. 

- Another common reshaping pattern is the conversion of a one-dimensional array into a two dimensional row or column matrix. You can do this with the `reshape()` method or by making use of the `newaxis` keyword within a slice operation. 

In [None]:
grid = np.arange(1,10).reshape((3,3))
print(grid)

In [None]:
x=np.array([1,2,3])

# row vector via reshape
x.reshape(1,3)

# row vector via newaxis
x[np.newaxis,:]

In [None]:
# column vector via reshape
x.reshape(3,1)

# column vector via newaxis
x[:,np.newaxis]

### Array Concatenation and Splitting 
It is possible to combine multiple arrays into one and to conversley split a single array into multiple arrays. 

Concatenation
- `np.concatenate`
- `np.vstack`
- `np.hstack`
- `np.dstack` will stack arrays along the third axis

Splitting - for each of these we can pass a list of indices giving the split points. Notice that N split points leads to N + 1 subarrays. 
- `np.split`
- `np.hsplit`
- `np.vsplit`
- `np.dspllit` will split arrays along the third axis

In [None]:
x = np.array([1,2,3])
y = np.array([3,2,1])
z = np.array([99,99,99])

np.concatenate([x,y])
np.concatenate([x,y,z])

# np.concatenate can also be used for two-dimensional arrays
grid = np.array([[1,2,3],[3,4,5]])
print(grid)

In [None]:
# Concatenate along the first axis
np.concatenate([grid,grid])

In [None]:
# Concatenate along the second axis 
np.concatenate([grid,grid],axis=1)

In [None]:
# Vertically stack the arrays
np.vstack([x,grid])

# Horizontally stack the arrays 
y = np.array([[99],[99]])
np.hstack([grid,y])

In [None]:
# Splitting of arrays
x = [1,2,3,99,99,3,2,1]
x1,x2,x3 = np.split(x,[3,5])
print(x1,x2,x3)

In [None]:
grid = np.arange(16).reshape((4,4))
print(grid)

In [None]:
upper, lower = np.vsplit(grid,[2])

In [None]:
print(upper,'\n\n', lower)

In [None]:
left, right = np.hsplit(grid,[2])
print(left,'\n\n', right)

### Computation on NumPy Arrays: Universal Functions
Numpy is so important in the Python data science world because it provides a flexible interface to optimize computation with arrays. Computations on NumPy arrays can be very fast, ot it can be very slow. The key to making it fast is to use *vectorized* operations which are implemented through NumPy's *universal functions* (ufuncs). Ufuncs main purpose is to quickly execute repeated operations on values in NumPy arrays. Ufuncs can also operate on multidimensional arrays. 

#### The Slowness of Loops
Python's default implementation does some operations very slowly. This is in part due to the fact that data types are flexible so that sequences of operations cannot be compiled down to efficient machine code as in languages like C. Specifically, Python can be slow in situations where many small operations are being repeated (e.g., looping over arrays to operate on each element). The vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.

For example, imagine we have an array of values and we'd like to compute the reciprocal of each. A straightforward approach might look like this:

In [None]:
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

This implementation feels fairly natural. But if we measure the execution time of this code, we see that this operation is very slow. 

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

It turns out that the bottleneck here is not the operations themselves, but the type-checking that Python must do at each cycle of the loop. 

We can instead use a vecrorized approach. This is one that is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution. 

Compare the results of the following two:

In [None]:
%timeit compute_reciprocals(values)
%timeit 1.0/values

Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execite repeated operations on values in NumPy arrays. Notee that ufunc operations are not limited to one-dimensional arrays, they can also act on multi-dimensional arrays. 

Computations using vectorization through ufuncs are nearly always more efficient than their counterpart implemented through Python loops (especially as the arrays grow in size). Any time you see a loop in a Python script, you should consider whether it can be replaced with a vectorized expression. 

Ufuncs exist in two flavors: *unary ufuncs* which operate on a single input, and *binary ufuncs* which operate on two inputs. 

**Array arithmetic**
NumPy's ufuncs feel very natural to use because they make use of Python's native arithmetic operators. The standard addition, subtraction, multiplication, and division can be used. There is also a unary ufunc for negation, exponentiation, and modulus. 

In [None]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

**Arithmetic operations**
- `np.add` - additon 
- `np.subtract` - subtract
- `np.negative` - unary negation
- `np.multiply`- multiply
- `np.divide` - division
- `np.floor_divide` - floor division
- `np.power` - exponentiation
- `np.mod` - modulus/remainder
- `np.absolute` also available under the allias `np.abs` - absolute value

**Trigonometric functions**
- `np.linspace`
- `np.sin`
- `np.cos`
- `np.tan`
- `np.arcsin`
- `np.arcos`
- `np.arctan`
The values are computed to within machine precision, which is why values that should be zero do not always hit exactly zero.

**Exponents and logs**
- `np.exp`
- `np.exp2`
- `np.power`
- `np.log`
- `np.log2`
- `np.log10`
- `np.expm1`
- `np.log1p`

**Specialized ufuncs**
NumPy has many more ufuncs available, including hyperbolic trig functions, bitwise arithmetic, comparison operators, rounding and remainder, and many more. 

An excellent source for more specialized and obscure ufuncs is the submodule `scipy.special`
- `special.gamma`
- `special.gammaln`
- `special.beta`
- `special.erf`
- `special.erfc`
- `special.erfinv`

**Advanced Ufunc features**
For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored. Rather than creating a temporary array, this can be used to write computation results directly to the memory location where you'd like them to be. For all ufuncs, this can be done using the `out` argument. 

In [None]:
x = np.arange(5)
y = np.empty(5)

np.multiply(x,10,out=y)

If we had instead written `y = np.multiply(x,10)` this would have resulted in the creation of a temporary array to hold the results of `np.multiply(x,10)` followed by a second operation copying those values into the `y` array. This doesn't make much of a difference for small computations, but for very large arrays the memory savings from careful use of the `out` argument can be significant. 

**Aggregates**
For binary ufuncs, there are some intersting aggregates that can be computed directly from the object. 

In [None]:
x = np.arange(1, 6)
print(np.add.reduce(x))
print(np.multiply.reduce(x))
print(np.add.accumulate(x))

**Outerproducts**
Any ufunc can compute the output of all pairs of two different inputs using the `outer` method.  

In [None]:
x = np.arange(1,6)
np.multiply.outer(x,x)

**Aggregations: Min, Max, and Everything In Between**
Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. 

NumPy has fast built-in aggregation functions for working on arrays
- `np.sum`
- `np.min`
- `np.max`

For `min`, `max`, `sum`, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

In [None]:
big_array = np.random.rand(1000000)
print(big_array.min(), big_array.max(), big_array.sum())

**Multi dimensional aggregates**
One common type of aggregation operation is an aggregate along a row or column.

In [None]:
M = np.random.random((3, 4))
print(M)

By default, each NumPy aggregation function will return the aggregate over the entire array:

In [None]:
M.sum()

Aggregation functions take an additional argument specifying the *axis* along which the aggregate is computed. For example, we can find the minimum value within each column by specifying `axis=0` or we can find the minimum value for each row by specifying `axis=1`. 

In [None]:
M.min(axis=0)

In [None]:
M.min(axis=1)

Note that most aggregates have `NaN` safe counterparts that compute the result while ignoring missing values.

| **Function Name** | **NaN-safe Version** | **Description** |
| ------------- |-------------| -----|
| `np.sum` |`np.nansum` |Compute sum of elements|
|`np.prod` | `np.nanprod` | Compute the product of elements|
|`np.mean` |`np.nanmean`| Compute the mean of elements|
|`np.std` |`np.nanstd`| Compute standard deviation|
|`np.var`|`np.nanvar`| Compute variance |
|`np.min` | `np.nanmin`| Find the minimum value |
|`np.max` |`np.nanmax`| Find the maximum value |
|`np.argmin`| `np.nanargmin`| Find the index of minimum value|
|`np.argmax`| `np.nanargmax`| Find the index of maxamimum value|
|`np.median`| `np.nanmedian`| Compute the median of elements|
|`np.percentile`| `np.nanpercentile`| Compute rank-based statistics of elements|
|`np.any`| N/A| Evaluate whether any elements are true|
|`np.all`| N/A| Find the index of minimum value|

In [None]:
import pandas as pd
df = pd.read_csv('data/president_heights.csv')
df.head()

In [None]:
heights = np.array(data['height(cm)'])
print(heights)

Now that we have this data array, we can compute a variety of summary statistics:

In [None]:
print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())
print("Maximum height:    ", heights.max())
print("\n")
print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))

**Computation on Arrays:Broadcasting**
We saw in the previous section how NumPy's universal functions can be used to *vectorize* operations and thereby remove slow python loops. Another means of vectorizing operations is to use NumPy's broadcasting functionality. Broadcasting is a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes. 

Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:

In [None]:
import numpy as np
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a+b

Broadcasting allows these types of binary operations to be performed on arrays of different sizes. Observe the result when we add a one-dimensional array to a two-dimensional array:

In [None]:
M = np.ones((3,3))
M

In [None]:
M+a

Here the one-dimensional array `a` is stretched (i.e., broadcast) in order to match the shape of `M`. More complicated cases can involve broadcasting of both arrays. 

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

![Screen%20Shot%202020-10-11%20at%202.48.49%20PM.png](attachment:Screen%20Shot%202020-10-11%20at%202.48.49%20PM.png)

In this image, the light boxes represent the broadcasted values. 

**Rules of Broadcasting**
Broadcasting in NumPy follows a strict set of rules to determine the interaction between two arrays:
- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.  
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is strentched to match othe other shape. 
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised. 

**Broadcasting in Practice**
- Centering an array
- Plotting a two-dimensional function
    - If we want to define a function z = f(x,y), broadcasting can be used to compute the function across the grid. 

In [None]:
# Centering an array

X = np.random.random((10, 3))

# compute the mean across the first dimension
Xmean = X.mean(0)

# center X by subtracting the mean
X_centered = X - Xmean

# to check that we have done this correctly, we can check that the
# centered array has near zero mean 
X_centered.mean(0)

In [None]:
# Plotting a two-dimensional function

# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]

z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.imshow(z,origin = 'lower',
           extent = [0,5,0,5],
           cmap = 'viridis')
plt.colorbar();

**Comparisons, Masks, and Boolean Logic**
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion. For example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks. 

In [None]:
import numpy as np
import pandas as pd

# use pandas to extract rainfall inches as a NumPy array
rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
print(rainfall[0:10])

inches = rainfall/254.0
print(inches[0:10])

inches.shape

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

In [None]:
plt.hist(inches,40);

This histogram gives us a general idea of what the data looks like. But this doesn't do a good job of conveying some other information: for example, how many rainy days where there last year? What is the average precipitation on those rainy days? How many rainy days were there with more than 1/2 inch of rain? 

One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range. This approach is very inefficient! We saw that NumPy's ufuncs can be used in place of loops to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise comparisons over arrays. 

**Comparison Operations as ufuncs**
NumPy implements comparison operators such as `<` and `>` as element wise ufuncs. The result of these comparison operators is always an array with a Boolean data type. 

All six of the standard comparison operations are available: 
- `<`
- `>`
- `<=`
- `>=`
- `!=`
- `==`

| **Operator** | **Equivalent ufunc** |
| ------------- |-------------|
| `==` | `np.equal`|
|`<` | `np.less`|
|`>` | `np.greater`|
|`!=` | `np.not_equal`|
|`<=` | `np.less_equal`|
|`<` | `np.greater_equal`|

In [None]:
x = np.array([1, 2, 3, 4, 5])
x < 3

It is also possible to do an element-wise comparison of two arrays, and to include compound expressions: 

In [None]:
(2*x) == (x**2)

These will work on arrays of any size and shape. 

In [None]:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

In [None]:
x <6

**Working with Boolean Arrays**
Given a Boolean array, there are a host of useful operations you can do. 

**Counting entries**
To count the number of `True` entries in a Boolean array, `np.count_nonzero` is useful. 

In [None]:
print(x)

# how many values are less than 6?
np.count_nonzero(x<6)

Another way to get at this information is to use `np.sum`. In this case, `False` is interpreted as `0` and `True` is interpreted as `1`. 

In [None]:
np.sum(x<6)

The benefit of `sum()` is that this summation can be done along rows or columns as well:

In [None]:
# how many values less than 6 in each row?
np.sum(x<6,axis=1)

If we're interested in quickly checking whether any or all the values are true, we can use `np.any` or `np.all`. These two can be used along particular axes as well. 

In [None]:
# are there any values greater than 8?
np.any(x>8)

# are all values less than 10?
np.all(x<10)

In [None]:
np.all(x<8,axis=1)

**Boolean operators**
What if we want to know about all days with rain less than 4 inches and greater than 1 inch? This is accomplished through Python's *bitwise logic operators* `&`,`|`,`^`, and `~`.

In [None]:
np.sum((inches > 0.5) & (inches < 1))

In [None]:
print("Number days without rain:      ", np.sum(inches == 0))
print("Number days with rain:         ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches  :", np.sum((inches > 0) &
                                                (inches < 0.2)))

Suppose we want an array of all values that are less than 5. 

In [None]:
print(x)

In [None]:
x<5

Now to select these values from the array, we can simply index on this Boolean array; this is known as a *masking operation*. 

In [None]:
x[x<5]

What is returned is a one-dimensional array filled with all the values that meet this condition. In other words, all the values in positions at which the mask array is `True`. 

**Aside: using the keywords and/or versus the operators &/|**
`and` and `or` gauge the truth or falsehood of the entire object, while `&` and `|` refer to bits within each object. 

When you use `and` or `or` it's equivalent to asking Python to treat the object as a single Boolean entity. When you use `&` and `|` on integers, the expression operates on the bits of the element, applying the *and* or the *or* to the individual bits making up the number. 

When you have an array of Boolean values in NumPy, this can be thought of as a string of bits where `1 = True` and `0 = False`.

In [None]:
A = np.array([1,0,1,0,1,0],dtype=bool)
B = np.array([1,1,1,0,1,1],dtype=bool)
A|B

Using `or` on these arrays will try to evaluate the truth or falsehood of the entire array object, which is not a well-defined value. 

In [None]:
# returns an error
A or B

Similarly, when doing a Boolean expression on a given array, you should use `|` or `&` rather than `or` or `and`. 

In [None]:
x = np.arange(10)
(x > 4) & (x < 8)

Trying to evaluate the truth or falsehood on the entire array will give the same error we saw before.

In [None]:
# returns an error
(x > 4) and (x < 8)

**Fancy Indexing**
In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., arr[0]), slices (e.g., arr[:5]), and Boolean masks (e.g., arr[arr >0]). Fancy indexing is like simple indexing, but we pass arrays of indices instead of simple scalars. This allows us to very quickly access and modify complicates subsets of an array's values. 

In [None]:
import numpy as np
rand = np.random.RandomState(42)

x = rand.randint(100,size=10)
print(x)

Suppose we want to access three different elements. We could do it like this: 

In [None]:
[x[3],x[7],x[2]]

Alternatively, we can pass a single list or array of indices.

In [None]:
ind = [3,7,2]

x[ind]

When using fancy indexing, the shape of the result reflects the shape of the *index arrays* rather than the shape of the array being indexed. 

In [None]:
ind = np.array([[3,7],
                [4,5]])

x[ind]

Fancy indexing also works in multiple dimensions. 

In [None]:
X = np.arange(12).reshape((3,4))
X

In [None]:
row = np.array([0,1,2])
col = np.array([2,1,3])
X[row,col]

**Combined Indexing**
Fancy indexing can be combined with the other indexing schemes we've seen. 

In [None]:
print(X)

In [None]:
X[2,[2,0,1]]

In [None]:
X[1:,[2,0,1]]

**Examples: Selecting Random Points**
One common use of fancy indexing is the selection of subsets of rows from a matrix. For example, we might have an N by D matrix 

In [None]:
mean = [0,0]
cov = [[1,2],
       [2,5]]
X = rand.multivariate_normal(mean,cov,100)
X.shape

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

plt.scatter(X[:,0],X[:,1]);

Let's use fancy indexing to select 20 random points. 

In [None]:
indices = np.random.choice(X.shape[0],20,replace=False)
print(indices)

In [None]:
selection = X[indices] # fancy indexing here
print(selection)
selection.shape

Now to see which points were selected, let's over-plot large circles at the locations of the selected points:

In [None]:
plt.scatter(X[:,0],X[:,1],alpha=0.3)
plt.scatter(selection[:,0],selection[:,1],
           facecolor='none',edgecolor='b',s=200);

**Modifying Values with Fancy Indexing**
Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array.

In [None]:
x = np.arange(10)
i = np.array([2,1,8,4])
x[i] = 99
print(x)

In [None]:
x[i]-=10
print(x)

Notice, though, that repeated indices with these operations can cause some potentially unexpected results. 

In [None]:
x = np.array([6,0,0,0,0,0,0,0,0,0])
i = [2,3,3,4,4,4]
x[i]+=1
print(x)

You might expect that `x[3]` would contain the value 2 and `x[4]` would contain that value 3 as this is how many times each index is repeated. Why is this not the case? This is because `x[i] += 1` is meant as a shorthand of `x[i] = x[i] + 1`. `x[i] + 1` is evaluated, and then the result is assigned to the indices in x. With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.

So if you want to repeat an operation at a certain index, you can use the `at()` method of ufuncs. Another method that is similar in spirit is the `reduceat()` method of ufuncs. 

In [None]:
i = [2,3,3,4,4,4]
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)

**Example: Binning Data**

In [None]:
np.random.seed(42)
x = np.random.randn(100)

# compute a histogram by hand
bins = np.linspace(-5,5,20)
counts = np.zeros_like(bins)

# find the appropriate bin for each x
i = np.searchsorted(bins,x)

# add 1 to each of these bins
np.add.at(counts,i,1)

# plot the results
plt.plot(bins, counts, linestyle='steps');

Interesting point: algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice cersa.

**Sorting Arrays**
There are many different sorting algorithms: *insertion sorts*, *selection sorts*, *merge sorts*, *quick sorts*, *bubble sorts*, etc. 

A *selection sort* repeatedly finds the minimum value from a list, and makes swaps until the list is sorted. 

In [None]:
import numpy as np

def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmin(x[i:])
        (x[i],x[swap]) = (x[swap],x[i])
    return x

In [None]:
x = np.array([2,1,4,3,5])
selection_sort(x)

The selection sort is useful for its simplicity, but it is much too slow to be useful for larger arrays. Python contains built-in algorithms that are much more efficient than the selection sort. 

**Fast Sorting in NumPy**
`np.argsort`
Although Python has built-in `sort` and `sorted` functions to work with lists, we won't discuss them here because NumPy's `np.sort` function turns out to be much more efficient and useful for our purposes. By default `np.sort` uses a *quicksort* algorithm, though *mergesort* and *heapsort* are also available. For most applications, the default quicksort is more than sufficent. 

To return a sorted version of the array without modifying the input, you can use `np.sort`.

In [None]:
x = np.array([2, 1, 4, 3, 5])
np.sort(x)

If you prefer to sort the array in-place, you can instead use the `sort` method. 

In [None]:
x = np.array([2, 1, 4, 3, 5])
x.sort()
print(x)

A related function is `argsort` which instead returns the *indices* of the sorted elements:

In [None]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)

**Sorting along rows or columns**
A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the `axis` argument. 

In [None]:
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

In [None]:
# sort each column of X
np.sort(X, axis=0)

In [None]:
# sort each row of X
np.sort(X, axis=1)

Keep in mind that this treats each row or column as an independent array, and any relationsips between the row or column values will be lost!

**Partial Sorts: Partitioning**
Sometimes we're not interested in sorting the entire array, but simply want to find the *k* smallest values in the array. 

`np.partition` takes an array and a number *k*; the result is a new array with the smallest *k* values to the left of the partiion, and the remaining values to the right. Note that within the two partitions, the elements have an arbitrary order. 

`np.argpartition` computes the indices of the partition.

In [None]:
x = np.array([7,2,3,1,6,5,4])
np.partition(x,3)

In [None]:
X = rand.randint(0, 10, (4, 6))
np.partition(X,2,axis=1)

**Example: k-Nearest Neighbors**
We can use the `argsort` function along multiple axes to find the nearest neighbors of each point in a set. 

In [None]:
X = rand.rand(10,2)
print(X)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);

Now we'll compute the distance between each pair of points.

In [None]:
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]

# square the coordinate differences
sq_differences = differences ** 2

# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
print(dist_sq)

With the pairwise square-distances converted, we can now use `np.argsort` to sort along each row. The leftmost columns will then give the indices of the nearest neighbors. 

In [None]:
nearest = np.argsort(dist_sq, axis=1)
print(nearest)

By using a full sort here, we've actually done more work than we need to. If we're simply interestd in the nearest *k* neighbors, all we need is to parition each row so that the smallest *k* + 1 squared distances come first, with the larger distances filling in the remaining positions of the array. We can do this with the `np.argpartition` function.

In [None]:
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
print(nearest_partition)

**Aside: Big-O Notation**
Big-O notation is a means of describing how the number of operations required for an algorithm scales as the input grows in size. 

Big-O notation, in the loose sense, tells you how much time your algorithm will take as you increase the amount of data. If you have an O[N] (read "order N") algorithm that takes 1 second to operate on a list of length N=1,000, then you should expect it to take roughly 5 seconds for a list of length N=5,000. If you have an O[N2] (read "order N squared") algorithm that takes 1 second for N=1000, then you should expect it to take about 25 seconds for N=5000.

Note that for small datasets in particular, the algorithm with better scaling might not be faster. For example, in a given problem an O[N2] algorithm might take 0.01 seconds, while a "better" O[N] algorithm might take 1 second.

**Structured Data: NumPy's Structured Arrays**
While often our data can be represented by a homogeneous array of values, sometimes this is not the case. NumPy's *structured arrays* and *record arrays* provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenerios like this often lend themselves to the use of Pandas dataframes. 

Imagine that we have several categories of data on a number of people and we'd like to store these values. It would be possible to store these in three separate arrays. 

In [None]:
import numpy as np

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

But there is nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all this data. 

NumPy can handle this through structured arrays, which are arrays with compound data types.

In [None]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)
print(data)

`U10` translates to unicode string of maximum length 10. 
`i4` translates to 4-byte (i.e., 32 bit) integer. 
`f8` translates to 8-byte (i.e., 64 bit) float. 

Now that we have created an empty container array, we can fill the array with our lists of values.

| **Character** | **Description** |**Example**|
| ------------- |-------------|-------------|
|`b` | Byte | `np.dtype('b')`|
|`i` | Signed integer | `np.dtype('i4')==np.int32`|
|`u` | Unsigned integer | `np.dtype('u1')==np.uint8`|
|`f` | Floating point | `np.dtype('f8')==np.int64`|
|`c` | Complex floating point | `np.dtype('c16')==np.complex128`|
|`S`,`a` | String | `np.dtype('S5')`|
|`U` | Unicode string | `np.dtype('U')==np.str_`|
|`V` | Raw data (void) | `np.dtype('V')==np.void`|

In [None]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

The handy thing with structured arrays is that you can refer to values either by index or by name. 

In [None]:
# Get all names
data['name']

In [None]:
# Get the first row of data
data[0]

In [None]:
# Get the name from the last row
data[-1]['name']

Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age. 

In [None]:
# Get names where age is under 30
data[data['age'] < 30]['name']

**RecordArrays: Structured Arrays with a Twist**
NumPy provides the `np.recarray` class, which is almost identical to the the structured arrays, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys. 

Recall that we previously accessed the ages by writing:

In [None]:
data['age']

If we view our data as a record array instead, we can access this with slightly fewer keystrokes. 

In [None]:
data_rec = data.view(np.recarray)
data_rec.age

Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you're using NumPy arrays to map onto binary data formats in C, Fortran, or another language. For day-to-day use of structured data, the Pandas package is a much better choice.