### What are NumPy and pandas?

- Numpy is an open source Python library used for scientific computing and provides a host of features that allow a Python programmer to work with high-performance arrays and matrices.

- pandas s a package for data manipulation that uses the DataFrame objects from R (as well as different R packages) in a Python environment.

- Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy array for the implementation of pandas data objects and shares many of its features. In addition, pandas builds upon functionality provided by NumPy.

- Both libraries belong to what is known as the SciPy stack, a set of Python libraries used for scientific computing.

In [None]:
>>> data
>>>
array([[ 0.9526, -0.246 , -0.8856],
       [ 0.5639,  0.2379,  0.9104]])

In [None]:
>>> data_frame 
>>>
  pop  state  year
0 1.5  Ohio   2000
1 1.7  Ohio   2001
2 3.6  Ohio   2002
3 2.4  Nevada 2001
4 2.9  Nevada 2002

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis. 
• ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities
• Standard mathematical functions for fast operations on entire arrays of data without having to write loops
• Tools for reading / writing array data to disk and working with memory-mapped files
• Linear algebra, random number generation, and Fourier transform capabilities
• Tools for integrating code written in C, C++, and Fortran

#### The NumPy ndarray: A Multidimensional Array Object


- One of the key features of NumPy is its N-dimensional array object, or "ndarray", which is a fast, flexible container for large data sets in Python

In [None]:
>>> data
>>>
array([[ 0.9526, -0.246 , -0.8856],
        [ 0.5639, 0.2379, 0.9104]])

In [None]:
>>> data * 10
>>>
array([[ 9.5256, -2.4601, -8.8565],
        [ 5.6385, 2.3794, 9.104 ]])


In [None]:
>>> data + data
>>>
array([[ 1.9051, -0.492 , -1.7713],
        [ 1.1277, 0.4759, 1.8208]])

In [None]:
>>> data.shape 
>>> (2, 3)
>>> data.dtype
>>> dtype('float64')

#####  Creating ndarrays

In [None]:
data1 = [6, 7.5, 8, 0, 1]

In [None]:
import numpy as np
arr1 = np.array(data1)

In [None]:
arr1

#### difference btw. array vs. list

- Arrays need to be declared. Lists don’t, since they are built into Python. In the examples above, you saw that lists are created by simply enclosing a sequence of elements into square brackets. Creating an array, on the other hand, requires a specific function from either the array module (i.e., array.array()) or NumPy package (i.e., numpy.array()). Because of this, lists are used more often than arrays.

- Arrays can store data very compactly and are more efficient for storing large amounts of data.

- Arrays are great for numerical operations; lists cannot directly handle math operations. For example, you can divide each element of an array by the same number with just one line of code. If you try the same with a list, you’ll get an error.

In [None]:
array = np.array([3, 6, 9, 12])
division = array/3
print(division)
print (type(division))

In [None]:
list = [3, 6, 9, 12]
division = list/3

- Of course, it’s possible to do a mathematical operation with a list, but it’s much less efficient:

##### When to use list or array?
- If you need to store a relatively short sequence of items and you don’t plan to do any mathematical operations with it, a list is the preferred choice. 

- If you have a very long sequence of items, consider using an array. This structure offers more efficient data storage.

- If you plan to do any numerical operations with your combination of items, use an array. Data analytics and data science rely heavily on (mostly NumPy) arrays.

In [None]:
##### difference between array vs. ndarray

- numpy.array is just a convenience function to create an ndarray; it is not a class itself.

- array is the function from numpy
- ndarray is the class generated from numpy.array

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]

In [None]:
arr2 = np.array(data2)

In [None]:
arr2

In [None]:
arr2.ndim

In [None]:
arr2.shape

In [None]:
arr2.dtype

- In addition to "np.array", there are a number of other functions for creating new arrays. 
- As examples, "zeros" and "ones" create arrays of 0’s or 1’s, respectively, with a given length or shape. 
- "empty" creates an array without initializing its values to any particular value.

In [None]:
np.zeros(10)

In [None]:
np.zeros((3, 6))

In [None]:
np.empty((2, 3, 2))

- It’s not safe to assume that np.empty will return an array of all zeros. 
- In many cases, as previously shown, it will return uninitialized garbage values.

In [None]:
# arange is an array-valued version of the built-in Python range function:
np.arange(15)

In [None]:
# range in list
range(15)

#### Data Types for ndarrays

In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

In [None]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

##### NumPy data types
Type
 int8, uint8
 int16, uint16
 int32, uint32
 int64, uint64
 float16
 float32
 float64, float128


In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

- In this example, integers were cast to floating point. 
- If I cast some floating point num- bers to be of integer dtype, the decimal part will be truncated:

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

In [None]:
arr.astype(np.int32)

In [None]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings

In [None]:
numeric_strings.astype(float)

#### Operations between Arrays and Scalars

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr * arr

In [None]:
1 / arr

In [None]:
arr**0.5

In [None]:
2 * arr

#### Basic Indexing and Slicing

In [38]:
arr = np.arange(10)

In [39]:
arr[5]

5

In [40]:
arr[5:8]

array([5, 6, 7])

In [52]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [42]:
arr_slice = arr[5:8]

In [43]:
arr_slice[1] = 12345
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,     9])

In [45]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

- As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted henceforth) to the entire selection. 
- An important first distinction from lists is that array slices are views on the original array. 
- This means that the data is not copied, and any modifications to the view will be reflected in the source array:

- If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array; for example arr[5:8].copy().

In [65]:
arr = np.arange(10)
arr


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [66]:
arr_slice = arr[5:8].copy()
arr_slice

array([5, 6, 7])

In [67]:
arr_slice[1] = 120
arr_slice

array([  5, 120,   7])

In [68]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

##### 2-Dimensional slicing

In [48]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [47]:
arr2d[2]

array([7, 8, 9])

In [50]:
arr2d[:,2]

array([3, 6, 9])

In [51]:
arr2d[2,:]

array([7, 8, 9])

In [69]:
arr2d[0][2]

3

In [72]:
arr2d[0,2]

3

In [71]:
arr2d[0:2,1:3]

array([[2, 3],
       [5, 6]])

##### In multidimensional arrays, 2 X 2 X 3 array "arr3d"

In [73]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

In [74]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [75]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

In [76]:
old_values = arr3d[0].copy()

In [78]:
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [79]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [80]:
arr3d[1, 0]

array([7, 8, 9])

#### Indexing with slices

In [82]:
## one-dimensional 
arr = np.arange(10)
arr[1:6]

array([1, 2, 3, 4, 5])

In [83]:
## two-dimensional 
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [84]:
arr2d[2]

array([7, 8, 9])

In [85]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [86]:
arr2d[:2,1:]

array([[2, 3],
       [5, 6]])

In [87]:
arr2d[1, :2]

array([4, 5])

In [88]:
arr2d[2, :1]

array([7])

Let us make an example: 


![title](img/nparray.png)

##### indexing with an integer is different than indexing with a slice

In [264]:
# Make a 3d array:
import numpy as np
array = np.arange(60).reshape((3, 4, 5))


In [265]:
# Indexing with ints gives a scalar
print array[2, 3, 4] == 59
# True


True


In [266]:
# Indexing with slices gives a 3d array
print array[:2, :2, :2].shape
# (2, 2, 2)


(2, 2, 2)


In [267]:
# Indexing with a mix of slices and ints will give an array with < 3 dims
print array[0, :2, :3].shape
# (2, 3)
print array[:, 2, 0:1].shape
# (3, 1)

(2, 3)
(3, 1)


##### add one-more dimension

- arr[..., None] takes an array of dimension N and "adds" a dimension "at the end" for a resulting array of dimension N+1.

In [269]:
x = np.array([[1,2,3],[4,5,6]])
print(x.shape)          # (2, 3)

(2, 3)


In [270]:
y = x[...,None]
print(y.shape)          # (2, 3, 1)

(2, 3, 1)


In [271]:
z = x[:,:,np.newaxis]
print(z.shape)          # (2, 3, 1)

(2, 3, 1)


#### adding Elements of the array

In [93]:
sum_val = np.sum(arr2d)
sum_val

45

In [94]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [95]:
# sum along the rows
np.sum(arr2d,axis=1)

array([ 6, 15, 24])

In [96]:
# sum along the cols
np.sum(arr2d,axis=0)


array([12, 15, 18])

##### Saving and Loading Text Files

It will at times be useful to load data into vanilla NumPy arrays using np.loadtxt or the more specialized np.genfromtxt.

In [8]:
!cat data1.txt

0.580052,0.186730,1.040717,1.134411 
0.194163,-0.636917,-0.938659,0.124094
-0.126410,0.268607,-0.695724,0.047428 
-1.484413,0.004176,-0.744203,0.005487
2.302869,0.200131,1.670238,-1.881090
-0.193230,1.047233,0.482803,0.960334


In [97]:
import numpy as np
arr = np.loadtxt('data1.txt', delimiter=',')
arr

array([[ 0.580052,  0.18673 ,  1.040717,  1.134411],
       [ 0.194163, -0.636917, -0.938659,  0.124094],
       [-0.12641 ,  0.268607, -0.695724,  0.047428],
       [-1.484413,  0.004176, -0.744203,  0.005487],
       [ 2.302869,  0.200131,  1.670238, -1.88109 ],
       [-0.19323 ,  1.047233,  0.482803,  0.960334]])

##### Reshape array

In [104]:
arr = np.arange(8)

In [276]:
arr.reshape((4, 2))

ValueError: cannot reshape array of size 9 into shape (4,2)

In [106]:
arr.reshape((4, 2)).reshape((2, 4))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

- One of the passed shape dimensions can be -1, in which case the value used for that dimension will be inferred from the data:

In [132]:
arr = np.arange(15).reshape((5, 3))

In [133]:
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [136]:
arr = np.arange(15).reshape((5, -1))
arr


array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [137]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [138]:
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

- The "flatten" method always returns a copy of the data:

- Functions like reshape and flatten, accept an order argument indicating the order to use the data in the array. This can be 'C' or 'F' in most cases 

- For historical reasons, row and column major order are also know as C and Fortran order, respectively. In FORTRAN 77, the language of our forebears, matrices were all column major.

In [140]:
arr = np.arange(12).reshape((3, 4), order = 'c')
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

- C / row major order: traverse higher dimensions first (e.g. axis 1 before advancing on axis 0).

In [141]:
arr = np.arange(12).reshape((3, 4), order = 'F')
arr

array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

- Fortran / column major order: traverse higher dimensions last (e.g. axis 0 before advancing on axis 1).

In [None]:
##### Concatenating and Splitting Arrays

In [256]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])

In [257]:
arr2 = np.array([[7, 8, 9], [10, 11, 12]])

In [145]:
np.concatenate([arr1, arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [146]:
np.concatenate([arr1, arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

- There are some convenience functions, like vstack and hstack

In [147]:
np.vstack((arr1, arr2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [148]:
np.hstack((arr1, arr2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

- "split", on the other hand, slices apart an array into multiple arrays along an axis:

In [149]:
from numpy.random import randn
arr = randn(5, 2)
arr

array([[ 0.87375887,  1.5926991 ],
       [ 0.57011234,  0.05117646],
       [-1.22960206, -0.52566717],
       [-0.28209095,  1.1228314 ],
       [-0.55939306,  0.65951081]])

In [150]:
first, second, third = np.split(arr, [1, 3])

In [151]:
first

array([[ 0.87375887,  1.5926991 ]])

In [153]:
second

array([[ 0.57011234,  0.05117646],
       [-1.22960206, -0.52566717]])

In [152]:
third

array([[-0.28209095,  1.1228314 ],
       [-0.55939306,  0.65951081]])

In [155]:
arr = randn(6, 2)
arr

array([[-0.44154458, -0.5885762 ],
       [ 1.86112074, -1.1948573 ],
       [-0.46666306, -1.70004579],
       [-0.72016586, -1.33753582],
       [ 1.33069788,  0.77176628],
       [ 1.18270368, -1.38640255]])

In [156]:
first, second, third = np.split(arr, [1, 3])

In [157]:
first

array([[-0.44154458, -0.5885762 ]])

In [158]:
second

array([[ 1.86112074, -1.1948573 ],
       [-0.46666306, -1.70004579]])

In [159]:
third

array([[-0.72016586, -1.33753582],
       [ 1.33069788,  0.77176628],
       [ 1.18270368, -1.38640255]])

##### Mathematical and Statistical Methods

- sum  $\;\;\;\;\;\;$     Sum of all the elements in the array or along an axis. 
- mean     $\;\;\;\;\;\;$          Arithmetic mean
- std, var   $\;\;\;\;\;\;$        Standard deviation and variance, respectively
- min, max      $\;\;\;\;\;\;$     Minimum and maximum.
- argmin, argmax  $\;\;\;\;\;\;$   Indices of minimum and maximum elements, respectively
- prod       $\;\;\;\;\;\;$     Product of elements starting from 0
- cumsum         $\;\;\;\;\;\;$    Cumulative sum of elements starting from 0
- cumprod       $\;\;\;\;\;\;$     Cumulative product of elements starting from 0



arr = np.random.randn(5, 4)       # normally-distributed data
arr

In [161]:
arr.mean()

0.13213996253821053

In [162]:
np.mean(arr)

0.13213996253821053

In [166]:
arr.sum()

3.159511490363581

In [179]:
arr.std()

2.5819888974716112

In [167]:
arr.mean(axis=1)

array([ 1.22466429,  0.41101143,  0.13215006, -0.07822193, -0.89972599])

In [168]:
arr.sum(0)

array([ 1.88760649,  0.80191581, -0.40801056,  0.87799975])

In [219]:
arr = np.array([[1, 2, 3], [4,
5, 6], [7, 8, 9]])
arr

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [220]:
arr.prod()

362880

In [221]:
arr.prod(0)

array([ 28,  80, 162])

In [174]:
arr.cumsum(0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])

In [171]:
arr.cumprod(1)

array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]])

In [None]:
np.cumsum(arr)

In [175]:
np.cumsum(arr, 0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])

In [178]:
np.cumprod(arr+1,0)

array([[  1,   2,   3],
       [  4,  10,  18],
       [ 28,  80, 162]])

In [182]:
b= np.arange(4)
b

array([0, 1, 2, 3])

In [183]:
b.cumsum()

array([0, 1, 3, 6])

In [184]:
b.cumprod()

array([0, 0, 0, 0])

#### Exercise 1. 

In [187]:
arr1 = np.arange(15)
arr1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [None]:
# reshape arr1 to a 3X5 array  as 
# array([[ 0,  3,  6,  9, 12],
#        [ 1,  4,  7, 10, 13],
#        [ 2,  5,  8, 11, 14]])

In [225]:
## calculate the sum of each column of the above array
## array([ 3, 12, 21, 30, 39])

In [227]:
# put the column summation at the bottom of the array, like
# array([[ 0,  3,  6,  9, 12],
#       [ 1,  4,  7, 10, 13],
#       [ 2,  5,  8, 11, 14],
#       [ 3, 12, 21, 30, 39]])


In [230]:
# Slice the above array by removing the top row and the ending column,   like
# array([[ 1,  4,  7, 10],
#       [ 2,  5,  8, 11],
#       [ 3, 12, 21, 30]])

In [250]:
## calculate the product of all elements in each row of the above matrix, like 
## array([[  280],
##       [  880],
##       [22680]])

In [252]:
# put the above product elements at the right side of the array, like
# array([[    1,     4,     7,    10,   280],
#        [    2,     5,     8,    11,   880],
#        [    3,    12,    21,    30, 22680]])
#