In [1]:
'''
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
'''

'\nimport numpy as np\nimport matplotlib.pyplot as plt\n%matplotlib inline\n'

# Numpy


Datasets can include collections of documents, images, sound clips, numerical measurements, or, really anything. Despite the heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.

| Data type	    | Arrays of Numbers? |
|---------------|-------------|
|Images | Pixel brightness across different channels|
|Videos | Pixels brightness across different channels for each frame |
|Sound | Intensity over time |
|Numbers | No need for transformation |
|Tables | Mapping from strings to numbers |


Therefore, the efficient storage and manipulation of large arrays of numbers is really fundamental to the process of doing data science. Numpy and pandas are the libraries within the SciPy stack that specialize in handling numerical arrays and data tables.

[Numpy](http://www.numpy.org/) is short for _numerical python_, and provides functions that are especially useful when you have to work with large arrays and matrices of numeric data, like matrix multiplications.  

The array object class is the foundation of Numpy, and Numpy arrays are like lists in Python, except that every thing inside an array must be of the same type, like int or float. As a result, arrays provide much more efficient storage and data operations, especially as the arrays grow larger in size. However, in other ways, NumPy arrays are very similar to Python's built-in list type, but with the exception of Vectorization.

### Creating arrays

In [109]:
import numpy as np

# Create array from lists:
ls = [[1,2,3,4,5],[6,7,8,9,10]]
ary = np.array(ls)
print(ary, type(ary))

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]] <class 'numpy.ndarray'>


In [3]:
#array1 = np.array([2,3,4,5])
#print(type(array1))

### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy that generate arrays of different forms. Some of the more common are:


### zeros and ones

In [4]:
# We use these when the elements of the
# array are originally unknown but its size is known.

#np.zeros((3,4))

In [5]:
#np.ones((2,3,4), dtype = np.int16)

In [6]:
#np.ones((3,4,5))

In [7]:
# Return a new array of given shape and type, without initializing entries.
#np.empty( (2,3) )
#np.empty( (2,3), dtype=int )

In [8]:
# Create a 3x5 array filled with 3.14
#np.full((3, 5), 3.14)

### arange

In [9]:
# Large operations work too, and quickly
#np.arange(10000)

In [10]:
# prints the corners, mainly
#np.arange(100).reshape(10,10)

In [11]:
#new_array=np.arange(60).reshape(4,15)


In [12]:
#new_array.reshape(20,3)#

In [13]:
#new_array=np.arange(3,7,2)
#print(new_array)

### random data

In [14]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
# np.random.random((3, 3))

In [15]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
#np.random.normal(0, 1, (3, 3))

In [16]:
#array2 = np.random.normal(0,1,(4,5))
#print(array2)
#np.mean(array2)


In [17]:
# Create a 3x3 array of random integers in the interval [0, 10)
#np.random.randint(0, 10, (3, 3))

In [18]:
# Create a 3x3 identity matrix
#np.eye(3)

In [19]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
#np.empty(3)

### linspace, logspace

In [20]:
# Make several equally spaced points in linear space
# linspace( start, stop, steps)
#np.linspace(0,np.pi,5)

In [21]:
## create linear space numbers from 0 to 0.5 with 125 spaces

#np.linspace(0, 0.5, 125)

In [22]:
# Return numbers spaced evenly on a log scale.
#np.logspace(0, 10, 10, base=np.e)

### diag

In [23]:
# a diagonal matrix
#np.diag([1,2,3])

In [24]:
# diagonal with offset from the main diagonal
#np.diag([1,2,3], k=1)

### Vectorization

In [25]:
#lis = [1,2,3,4,5]

In [26]:
#lis + lis

In [27]:
# See the difference???

# np_array = np.array(lis)
# np_array
# np_array + np_array
# np_array * np_array
# np.sum(np_array)
# np.mean(np_array)


In [28]:
#np_array + np_array


In [29]:
# Doing the same using normal lists requires a loop!
#print([x+x for x in lis])
#print([x**2 for x in lis])

So we call operations on numpy arrays **vectorized**.  For almost all data intensive computing, we use numpy because of this feature, and because the whole scientific and numerical python stack is based on numpy.  

To explain it another way, in a spreadsheet you would add an entire column to another one by writing a formula in the first cell and autofilling the rest of the column.  Numpy allows you to do such commands in one go.  





In [30]:

# array = np.array([1, 4, 5, 8], float)
# print(array)
# print("")
# array = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix
# print(array)


Numpy has all of its functionality written in _compiled_ code written in C, that is much faster.  But this can only be the case because all of the items in the numpy array are of the same data type! (i.e. Python is dynamically typed whereas C is not - this gives extra flexibility and simplicity to Python, but makes it slower).

In [31]:

# big_array = np.random.rand(1000000)
# %timeit sum(big_array)
# %timeit np.sum(big_array)


You can index, slice, and manipulate a Numpy ***array*** much like you would with a Python list.

Python has a certain way of doing things. For example lets call one of these ways listiness. Listiness works on lists, dictionaries, files, and a general notion of something called an iterator.

That's because they both support **the iterator protocol** - when something behaves in a list-like way.

## Broadcasting

Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.

In [32]:

# M = np.ones((3, 3))
# M


In [33]:
#M + 5

In [34]:
'''
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)
'''

'\na = np.arange(3)\nb = np.arange(3)[:, np.newaxis]\n\nprint(a)\nprint(b)\n'

In [35]:
#a + b

## Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

- **Rule 1:** If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.
- **Rule 2:** If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- **Rule 3:** If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

To make these rules clear, let's consider a few examples in detail.

In [36]:
# Rule one
# M = np.ones((2, 3))
# a = np.arange(3)
# M + a

In [37]:
# Rule two
#a = np.arange(3).reshape((3, 1))
#b = np.arange(3)
#print(a,b)

In [38]:
#a + b

In [39]:
# Rule three
#M = np.ones((3, 2))
#a = np.arange(3)
#M + a

In [40]:
# To get over the problem:
#a[:, np.newaxis].shape

In [41]:
#M + a[:, np.newaxis]

In [42]:
#np.logaddexp(M, a[:, np.newaxis])

## Manipulating arrays

### Indexing
We can index elements in an array using square brackets and indices:

In [43]:
# a vector: the argument to the array function is a Python list
#v = np.array([1,2,3,4])
#v[0]

In [44]:
#np.random.seed(1234)
#M = np.random.random([3,3])
#print(M)
# M is a matrix, or a 2 dimensional array, taking two indices
#M[1,1]

## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

Source: _Python Data Science Handbook_

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)

In [45]:
#M

In [46]:
#M[1]

The same thing can be achieved with using : instead of an index:

In [47]:
#M[1,:] #row 1

In [48]:
#M[:,1] #column 1

We can assign new values to elements in an array using indexing:

In [49]:
#M[0,0] = 1

In [50]:
#M

In [51]:
#show all elements for column 1st and 3rd

#M[:,0:3:2]

In [52]:
# also works for rows and columns
#M[1,:] = 0
#M[:,2] = -1

In [53]:
#M
# np.shape(M)

### Index Slicing
Index slicing is the technical name for the syntax M[lower:upper:step] to extract part of an array:

In [54]:
#A = np.array([1,2,3,4,5])
#A

In [55]:
#check for dimension: shape
#np.shape(A)

In [56]:
#A[1:3]

Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:

In [57]:
#A[1:3] = [-2,-3]

#A

We can omit any of the three parameters in M[lower:upper:step]:

In [58]:
#A[::] # lower, upper, step all take the default values

In [59]:
#A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

In [60]:
# first three elements
#A[:3]

In [61]:
# elements from index 3


Index slicing works exactly the same way for multidimensional arrays:


In [62]:
#import numpy as np
#A = np.array([[n+m*10 for n in range(5)] for m in range(5)])


#print(np.shape(A))
#print(A)

In [63]:
# a block from the original array
#A[1:4, 1:4]

In [64]:
# strides
#A[::2, ::2]

### Fancy indexing
Fancy indexing is the name for when an array or list is used in-place of an index:

In [65]:
#row_indices = [1, 2, 3]
#A[row_indices]
#A[[1,2,3]]
#A[1:4,:]

In [66]:
#col_indices = [1, 2, -1] # remember, index -1 means the last element
#A[row_indices, col_indices]

#A[[1,2,3],[1,2,-1]]

In [67]:
#show 32, 23 and 14

#A[[3,2,1],[2,3,4]]


We can also use index masks: If the index mask is an Numpy array of data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position of each element:

In [68]:
#B = np.array([n for n in range(5)])
#B

In [69]:
#row_mask = np.array([True, False, True, False, False])
#B[row_mask]

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [70]:
#x = np.arange(0, 10, 0.5)
#x

In [71]:
#mask = (x > 5) * (x < 7.5)

#mask

In [72]:
#x[mask]

#x[(x>5)*(x<7.5)]

In [73]:
#show x value more than 8.0

#x[x>8]

### Using arrays in conditions

When using arrays in conditions,for example ```if``` statements and other boolean expressions, one needs to use ```any``` or ```all```, which requires that any or all elements in the array evalutes to ```True```:

In [74]:
#M = np.array([[ 1,  4],[ 9, 16]])
#M

In [75]:
#any
#if (M > 5).any():
#    print("at least one element in M is larger than 5")
#else:
#    print("no element in M is larger than 5")

In [76]:
#all
#if (M > 5).all():
#    print("all elements in M are larger than 5")
#else:
#    print("all elements in M are not larger than 5")

## Functions for extracting data from arrays and creating arrays

**where**

The index mask can be converted to position index using the where function

In [77]:
#mask = (x > 5) * (x < 7.5)

#indices = np.where(mask)

#indices


In [78]:
#x[indices] # this indexing is equivalent to the fancy indexing x[mask]


## Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

### Scalar-array operations
We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [79]:
#v1 = np.arange(0, 5)
#v1

In [80]:
#v1 * 2

In [81]:
#v1 + 2

In [82]:
#A

In [83]:
#A * 2

### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is element-wise operations:

In [84]:
#A * A # element-wise multiplication

In [85]:
#v1 * v1

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:


In [86]:
#A.shape, v1.shape

In [87]:
#A * v1

### Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the dot function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:

In [88]:
#np.dot(A,A)

In [89]:
#A1 = np.array([[1,2],[3,4]])
#np.shape(A1)

#B1 = np.array([[1, 2, 3],[4,5,6],[7,8,9]])
#np.shape(B1)

#np.dot(A1,B1)

In [90]:
#np.dot(A, v1)

In [91]:
#np.matmul(A,v1)

In [92]:
#np.dot(v1,v1)

Alternatively, we can cast the array objects to the type matrix. This changes the behavior of the standard arithmetic operators +, -, * to use matrix algebra.

In [93]:
#A

In [94]:
#v1

In [95]:
#M = np.matrix(A)
#v = np.matrix(v1).T # make it a column vector

In [96]:
#M

In [97]:
#v

In [98]:
#M * M

In [99]:
#M * v

If we try to add, subtract or multiply objects with incomplatible shapes we get an error:


In [100]:
#v = np.matrix([1,2,3,4,5,6])

In [101]:
#v.T

In [102]:
#np.shape(M), np.shape(v)

In [103]:
#M * v #error due to different dimension

## NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.
Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

The standard NumPy data types are listed in the following table.
Note that when constructing an array, they can be specified using a string:

```python
np.zeros(10, dtype='int16')
```

Or using the associated NumPy object:

```python
np.zeros(10, dtype=np.int16)
```

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)|
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)|
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)|
| ``int8``      | Byte (-128 to 127)|
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)|
| ``uint8``     | Unsigned integer (0 to 255)|
| ``uint16``    | Unsigned integer (0 to 65535)|
| ``uint32``    | Unsigned integer (0 to 4294967295)|
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)|
| ``float_``    | Shorthand for ``float64``.|
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa|
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa|
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa|
| ``complex_``  | Shorthand for ``complex128``.|
| ``complex64`` | Complex number, represented by two 32-bit floats|
| ``complex128``| Complex number, represented by two 64-bit floats|

More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).
NumPy also supports compound data types, which will be covered in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb).

Source: Jake VanderPlas's _Python Data Science Handbook_

## Calculations with higher-dimensional data

When functions such as min, max, etc. are applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the axis argument we can specify how these functions should behave:

In [104]:
#m = np.random.rand(3,3)
#m

In [105]:
# global max
#m.max()

In [106]:
# max in each column, hint: axis = 0


In [107]:
# max in each row


### Other aggregation functions

NumPy provides many other aggregation functions, but we won't discuss them in detail here.
Additionally, most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value (for a fuller discussion of missing data, see [Handling Missing Data](03.04-Missing-Values.ipynb)).
Some of these ``NaN``-safe functions were not added until NumPy 1.8, so they will not be available in older NumPy versions.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

Source: Python Data Science Handbook

# Resources:  
- [numpy Quickstart Guide](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)  
- [Rahul Dave's CS109 lab1 content at Harvard](https://github.com/cs109/2015lab1)  
- [The Data Incubator](https://www.thedataincubator.com)  
- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)