# Data Wrangling with numpy and pandas 

## Numpy

* NumPy is the primary array programming library for Python
* NumPy supports vectorized operations
* `array` (*NumPy array*) is the main NumPy data structure
    * efficiently stores and access multidimensional arrays
    * enables a wide variety of scientific computation
    * main concepts to understand: 
        * *data structure*
        * *indexing*
        * *vectorization*
        * *broadcasting*
        * *reduction*
* Links to resources:
    * [Official Paper](https://www.nature.com/articles/s41586-020-2649-2)
    * [Official Documentation](https://numpy.org/doc/)
    * Cheat Sheet: [DataCamp Link](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

<img src="./images/numpy_tree.png" width="600">

In [4]:
import numpy as np # do not do `from numpy import *`

In [59]:
# arrays support matrix operations
arr_1 = np.array([2, 3, 4])
arr_2 = np.array([9, 4, 7])
arr_3 = np.array([4, 6, 3])

arr_1

array([2, 3, 4])

In [44]:
# dot product with 2 vectors
arr_1.dot(arr_2)

58

In [61]:
# matrices with numpy
mat = np.array([arr_1, arr_2, arr_3])
mat

array([[2, 3, 4],
       [9, 4, 7],
       [4, 6, 3]])

### Shape of arrays 

In [66]:
# shape is the dimension of arrays
print("Shape of the vector:", arr_1.shape)
print("Shape of the matrix:", mat.shape)

Shape of the vector: (3,)
Shape of the matrix: (3, 3)


### "Reshaping"  arrays

In [81]:
# "reshaping" arrays
x = np.arange(12)
x.reshape(4, 3) # same as x.reshape((4, 3))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [82]:
x # original array is not overwritten

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [83]:
x = x.reshape(4, 3)
x

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [84]:
# transposing arrays
x.T

array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

### Indexing 

In [62]:
# indexing works (almost) like lists (but more versatile)
print("mat[0]:\n\t", mat[0], end="\n\n")

mat[0]:
	 [2 3 4]



In [63]:
# mat[2][3] and mat[2, 3] are equivalent
print("mat[1][2]:", mat[1][2])
print("mat[1, 2]:", mat[1, 2])

mat[1][2]: 7
mat[1, 2]: 7


In [70]:
# boolean indexing - 1
mat >= 5

array([[False, False, False],
       [ True, False,  True],
       [False,  True, False]])

In [71]:
# boolean indexing - 2
mat[mat >= 5]

array([9, 7, 6])

In [72]:
condition = mat >= 5
mat[~ condition] # same as: mat[~(mat >= 5)]

array([2, 3, 4, 4, 4, 3])

### Broadcasting 

In [87]:
mat

array([[2, 3, 4],
       [9, 4, 7],
       [4, 6, 3]])

In [75]:
# broadcasting - 1
mat + 3 # what will happen?

array([[ 5,  6,  7],
       [12,  7, 10],
       [ 7,  9,  6]])

### Reduction

In [94]:
np.sum(mat)

42

In [95]:
np.sum(mat, axis=0) # sum of column items

array([15, 13, 14])

In [96]:
np.sum(mat, axis=1) # sum of row items

array([ 9, 20, 13])

In [97]:
np.mean(mat, axis=0) # mean of column items

array([5.        , 4.33333333, 4.66666667])

### Multidimensional Arrays 

In [56]:
# multidimensional arrays (2x2x3)
arr = np.array([[[1, 3, 5], [6, 4, 2]] , [[9, 7, 5], [5, 9, 1]]], dtype=int)
arr.shape

(2, 2, 3)

In [57]:
arr

array([[[1, 3, 5],
        [6, 4, 2]],

       [[9, 7, 5],
        [5, 9, 1]]])

In [58]:
arr[0]

array([[1, 3, 5],
       [6, 4, 2]])

In [None]:
arr

## Pandas

* [Documentation](https://pandas.pydata.org/docs/)
* [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Matplotlib

* [Documentation](https://matplotlib.org/3.3.3/users/index.html)
* [Cheat Sheets](https://github.com/matplotlib/cheatsheets)
* Alternative Libraries: [seaborn](https://seaborn.pydata.org/#), [ggplot**](https://github.com/yhat/ggpy), [Bokeh](https://docs.bokeh.org/en/latest/index.html), [Plotly**](https://plotly.com/python/getting-started/)

** Also available in R