# Chapter 4 - Data Manipulation I

## Numpy

Until now, we have learned fundamental python programming skills. This and several more chapters will discuss Data manipulation with python, using popular modules: Numpy, Pandas,

Numpy is a core library for scientific computing with python. It contains N-dimensional array object and tools for integrating C/C++ programming languages and Fortran codes. 

### Installation

For Mac users, _pip install numpy_ in your command prompt would work just fine.

For Windows users, you should visit "https://www.lfd.uci.edu/~gohlke/pythonlibs/" and manually install the modules.

1. Assuming you have python 3.5 64bit on your computer follow the steps below 
  (changing "numpy‑1.14.0+mkl‑cp35‑cp35m‑win_amd64.whl" as appropriate).
  (cp xx <- your python version   amd64 <- python bit version. 32 or 64 )  
  
2. At the command prompt type following code:
  __pip install numpy‑1.14.0+mkl‑cp35‑cp35m‑win_amd64.whl__



### Import

In [2]:
import numpy as np # Now we can call numpy as np.

### Explore the package

In [2]:
np? # This shows documentation about this module

In [3]:
dir(np) # Lists of methods, functions that numpy holds

['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'CLIP',
 'DataSource',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'MachAr',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'PackageLoader',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'WRAP',
 '_NoValue',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__doc__',
 '__file__',
 '__git_revision__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_distributor_init',
 '_globals',
 '_import_tools',
 '_mat',
 'abs',
 'absolute',
 'absolute_import',
 'add',
 'add_docstring',
 'add_newdoc',


### Array

Numpy array are similar to python lists, but of course, it is different. A numpy array is __1. a grid of values, all of the same type__, and __2. indexed by a tuple of nonnegative integers__. The number of dimensions is the __rank__ of the array, while the __shape__ of an array is a tuple of integers giving the size of the array along each dimension. 

Let's look at the example for clarification

In [4]:
list1 = [14,3,15,8,10,34,27]

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

<class 'numpy.ndarray'>
(3,)
1 2 3
[5 2 3]


In [7]:
b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"
print(b[0][0], b[0][1], b[1][0])   # Prints "1 2 4"

(2, 3)
1 2 4
1 2 4


In [14]:
c = np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]) # Create a rank 3 array
print(c.shape)                                            # Prints "(3,5)"
print(c[1,4], c[2,4], c[0,3])                             # Prints "10 15 4"
print(c[1][4], c[2][4], c[0][3])                          # Prints "10 15 4"

(3, 5)
10 15 4
10 15 4


In [27]:
d = np.array([[1,2,3],[4,5],[3,6,7,8]])            
print(d.shape)
print(type(d))

(3,)
<class 'numpy.ndarray'>


In [24]:
print(d[0][2], d[1][1], d[2][3])       # Prints "3 5 8"

3 5 8


In [26]:
print(d[1,1])               # Error

IndexError: too many indices for array

#### Creating New Arrays

In [43]:
x = np.zeros((2,2))   # Create an array of all zeros
print(x)              # Prints "[[0. 0.]
                      #          [0. 0.]]"
    
y = np.ones((1,2))    # Create an array of all ones
print(y)              # Prints "[[ 1.  1.]]"

z = np.full((2,2), 7)  # Create a constant array
print(z)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

x1 = np.eye(2)         # Create a 2x2 identity matrix
print(x1)              # Prints "[[ 1.  0.]
                       #          [ 0.  1.]]"

y1 = np.random.random((2,2))  # Create an array filled with random values
print(y1)                     # Might print "[[ 0.91940167  0.08143941]
                              #               [ 0.68744134  0.87236687]]"


[[ 0.  0.]
 [ 0.  0.]]
[[ 1.  1.]]
[[7 7]
 [7 7]]
[[ 1.  0.]
 [ 0.  1.]]
[[ 0.41376108  0.20699168]
 [ 0.68312473  0.52910467]]


### Indexing Array

__Slicing__

In [31]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
# array[row ,column]
b = a[:2, 1:3]

print(b)
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

[[2 3]
 [6 7]]
2
77


In [32]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:
row_r1 = a[1, :]    # Rank 1 view of the second row of a
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)  # Prints "[5 6 7 8] (4,)"
print(row_r2, row_r2.shape)  # Prints "[[5 6 7 8]] (1, 4)"

# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)  # Prints "[ 2  6 10] (3,)"
print(col_r2, col_r2.shape)  # Prints "[[ 2]
                             #          [ 6]
                             #          [10]] (3, 1)"

[5 6 7 8] (4,)
[[5 6 7 8]] (1, 4)
[ 2  6 10] (3,)
[[ 2]
 [ 6]
 [10]] (3, 1)


__Integer array indexing__

In [37]:
a = np.array([[1,2],[3,4],[5,6]])

#            0 1
#            | |  
#            V V
#  0 -->   [[1 2]
#  1 -->    [3 4]
#  2 -->    [5 6]]

      # Think of it as a matrix! 

      #[ [row numbers], [column numbers]]
print(a[  [0,1,2]     ,     [0,1,0]])
                         # The returned array will have shape (3,) and
                         # Prints "[1 4 5]"
print(np.array([a[0,0],a[1,1],a[2,0]])) # Same as above
        
        
print(a[[0,1,0],[1,1,1]]) # Prints "[ 2 4 2]" 
print(np.array([a[0,1],a[1,1],a[0,1]])) # Same as above


[1 4 5]
[1 4 5]
[2 4 2]
[2 4 2]


In [44]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print(a)  # prints "array([[ 1,  2,  3],
          #                [ 4,  5,  6],
          #                [ 7,  8,  9],
          #                [10, 11, 12]])"

# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
# np.arange(4) = np.array([0,1,2,3])
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"
print(a[[0,1,2,3],[0,2,0,1]]) # Same as above


# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10 # add 10 to the selected values

print(a)  # prints "array([[11,  2,  3],
          #                [ 4,  5, 16],
          #                [17,  8,  9],
          #                [10, 21, 12]])

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
[ 1  6  7 11]
[ 1  6  7 11]
[[11  2  3]
 [ 4  5 16]
 [17  8  9]
 [10 21 12]]


#### Boolean array indexing

In [48]:
a = np.array([[1,2],[3,4],[5,6]])

bool_idx = (a > 2)   # Find the elements of a that are bigger than 2;
                     # this returns a numpy array of Booleans of the same
                     # shape as a, where each slot of bool_idx tells
                     # whether that element of a is > 2.
            
print(bool_idx)      # Prints "[[False False]
                     #          [ True  True]
                     #          [ True  True]]"

# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])  # Prints "[3 4 5 6]"

# We can do all of the above in a single concise statement:
print(a[a > 2])     # Prints "[3 4 5 6]"

[[False False]
 [ True  True]
 [ True  True]]
[3 4 5 6]
[3 4 5 6]


### Data types

In [49]:
import numpy as np

x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

x = np.array([1, 2], dtype=np.int64)   # Force a particular datatype
print(x.dtype)                         # Prints "int64"

int64
float64
int64


### Array Math

__addition, subtraction, multiplication, division, sqrt__

In [50]:
import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

[[  6.   8.]
 [ 10.  12.]]
[[  6.   8.]
 [ 10.  12.]]
[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]
[[  5.  12.]
 [ 21.  32.]]
[[  5.  12.]
 [ 21.  32.]]
[[ 0.2         0.33333333]
 [ 0.42857143  0.5       ]]
[[ 0.2         0.33333333]
 [ 0.42857143  0.5       ]]
[[ 1.          1.41421356]
 [ 1.73205081  2.        ]]


__Dot product__

In [52]:
import numpy as np

x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))import numpy as np

219
219
[29 67]
[29 67]
[[19 22]
 [43 50]]
[[19 22]
 [43 50]]


#### Sum of array elements

In [53]:
import numpy as np

x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

10
[4 6]
[3 7]
