# Week12: NumPy, Pandas, Visualization

# NumPy
## Description
- Ref: https://numpy.org/doc/stable/
- Ref: https://jakevdp.github.io/PythonDataScienceHandbook/index.html
- Ref: https://docs.scipy.org/doc/numpy/user/whatisnumpy.html
- Ref: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
- NumPy provides a multidimensional array object.
- Each object comes with an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
- Basically, you can use NumPy to create ndarray (N-dimensional array) and easily manipulate the array. It is super fast because it is written in C.
## How is it different from Python arrays?
- Python lists can be modified -- you can add and remove elements. NumPy arrays have a fixed size at creation.
- Python lists can contain different data types. NumPy arrays can only have one data type. If you put in mixed types, they become a string.
- NumPy arrays come prepackaged with advanced mathematical operations. The operations are super fast even on large numbers of data and they use less memory.
## Why use NumPy
- Most data analysis programs use NumPy to manipulate data. They might take in data as standard Python lists, but they convert it to a NumPy array and manipulate the data using NumPy routines and output the transformed data as a NumPy array.
- NumPy data array is the main data type used in most scientific and mathematical Python-based packages.

## Simple example
Let's calculate average of the vector's elements.

In [2]:
import numpy as np

In [1]:
x = list(range(20))
x

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [3]:
x_np = np.array(x)
x_np

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [7]:
# pure python with explicit looping over elements
def my_avr1(y):
    m_sum = 0
    m_count = 0
    for element in y:
        m_sum += element
        m_count += 1
    return m_sum/m_count

# pure python using built-in functions
def my_avr2(y):
    return sum(y)/len(y)

print(my_avr1(x))
print(my_avr1(x_np))
print(my_avr2(x))
print(my_avr2(x_np))
print(np.average(x))
print(np.average(x_np))


9.5
9.5
9.5
9.5
9.5
9.5


In [8]:
# python list
x = list(range(20))
# numpy array
x_np = np.array(x)

# pure python with explicit looping over elements
%timeit my_avr1(x)
# pure python using built-in functions
%timeit my_avr2(x)
# numpy averaging over python list
%timeit np.average(x)
# numpy averaging over numpy array
%timeit np.average(x_np)

1.04 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
278 ns ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
7.38 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.35 µs ± 44.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
x = list(range(100))
x_np = np.array(x)
%timeit my_avr1(x)
%timeit my_avr2(x)
%timeit np.average(x)
%timeit np.average(x_np)

5.06 µs ± 55.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
714 ns ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
10.7 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.37 µs ± 8.15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [10]:
x = list(range(1000))
x_np = np.array(x)
%timeit my_avr1(x)
%timeit my_avr2(x)
%timeit np.average(x)
%timeit np.average(x_np)

54.2 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
5.41 µs ± 26.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
51.1 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
6.27 µs ± 69 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [11]:
x = list(range(10000))
x_np = np.array(x)
%timeit my_avr1(x)
%timeit my_avr2(x)
%timeit np.average(x)
%timeit np.average(x_np)

555 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
52.2 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
462 µs ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
14.1 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


> numpy is faster on larger arrays, 10000 is not that big

In [12]:
# Square a list using Python
squared_values = []
for number in range(10):
    squared_values.append(number*number)

print(squared_values)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [13]:
# Square a list using NumPy
import numpy as np

vector = np.array(range(10))
scalar = 5
print(vector * scalar)
print(vector * vector)

[ 0  5 10 15 20 25 30 35 40 45]
[ 0  1  4  9 16 25 36 49 64 81]


In [14]:
squared_values*scalar

[0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81]

### Numpy Basics
- NumPy arrays can be a 1-D array, called a vector, or a 2-D array, called a matrix, or higher

#### NumPy casting -- covert Python list to a NumPy array

In [2]:
my_list = [1, 2, 3]
my_list

[1, 2, 3]

In [3]:
import numpy as np

my_vector = np.array(my_list)
my_vector

array([1, 2, 3])

In [17]:
my_matrix1 = np.array([my_list, my_list])
my_matrix1

array([[1, 2, 3],
       [1, 2, 3]])

In [18]:
my_matrix2 = np.hstack([my_list, my_list])
my_matrix2

array([1, 2, 3, 1, 2, 3])

In [19]:
my_matrix3 = np.vstack([my_list, my_list])
my_matrix3

array([[1, 2, 3],
       [1, 2, 3]])

In [20]:
my_nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
my_matrix4 = np.array(my_nested_list)
my_matrix4

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [21]:
my_list1 = [[1,2], 
            [3,4]]

my_list2 = [[5,6], 
            [7,8]]
np.hstack([my_list1, my_list2])

array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

In [22]:
np.vstack([my_list1, my_list2])

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [14]:
my_vector = np.array([1,2,3,np.nan])
my_vector

array([ 1.,  2.,  3., nan])

#### NumPy dtype argument
- most of array creating function accept `dtype` argument
- `dtype` specify data-type of each element in the array
  - int, int32, uint32, int64, float, float32, float64, object

In [27]:
my_vector = np.array([1, 2, 3], dtype='float32')
print(my_vector.dtype)
my_vector

float32


array([1., 2., 3.], dtype=float32)

In [37]:
 np.array([1, 2., "3"], dtype="object")

array([1, 2.0, '3'], dtype=object)

#### NumPy creating arrays

In [None]:
my_list = range(10)
my_list

In [31]:
## Create array using arange
my_vector = np.arange(10,dtype='float')
my_vector

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [34]:
np.arange(0, 10,dtype="object")

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)

In [30]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

In [38]:
## Create array of zeros
np.zeros(3)

array([0., 0., 0.])

In [39]:
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [40]:
## Create array of ones
np.ones(3)

array([1., 1., 1.])

In [41]:
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [43]:
np.ones((3,3))*4

array([[4., 4., 4.],
       [4., 4., 4.],
       [4., 4., 4.]])

In [44]:
np.full((3,3),4)

array([[4, 4, 4],
       [4, 4, 4],
       [4, 4, 4]])

In [45]:
## Create evenly spaced vector
### Example use case: when you have Y values for a plot but need to generate X values
### *** Includes both start an end
# np.arange(start, end(not included), step size)
# np.linspace(start, end(included), number_of_points)
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [47]:
np.linspace(1900, 2000, 11, dtype='int')

array([1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000])

## Create an identify matrix

In [48]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Creating an empty array

In [49]:
np.empty((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

## Creating Random Numbers
- Ref: https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html

In [50]:
# Uniform distribution
np.random.rand(3)

array([0.83105347, 0.77513139, 0.25540587])

In [15]:
np.random.rand(3,4)

array([[0.6724091 , 0.4088635 , 0.79358399, 0.60729485],
       [0.82191105, 0.48657845, 0.07587307, 0.1306164 ],
       [0.18546424, 0.10825952, 0.34140068, 0.27368214]])

In [52]:
# Normal distribution
np.random.randn(3)

array([-1.21993991,  1.94893948, -0.2268333 ])

In [53]:
np.random.randn(3,3)

array([[-0.27845378, -0.92887139, -0.0953225 ],
       [ 0.60687333,  0.0592109 , -1.67748204],
       [-1.54370418,  0.17068829,  1.29626984]])

In [54]:
# Random integers
# np.random.randint(start, end(not_included), size)
np.random.randint(1,101)

88

In [55]:
np.random.randint(1,101,5)

array([34, 75, 34, 81, 76])

## Reshaping arrays

In [18]:
vector = np.arange(1,10)
vector

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [19]:
print(vector.reshape(3,3))

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [58]:
vector = np.arange(1,13)
print(vector.reshape(3,4))

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


## Basic array operations

In [20]:
vector = np.random.randint(1,50,25)
vector

array([11, 32, 28,  1,  1, 28, 11, 24, 28,  9, 21, 34, 49, 36, 37, 47,  9,
       32, 13, 39, 39,  6, 28, 10, 46])

In [60]:
# Min
vector.min()

5

In [61]:
# Max
vector.max()

49

In [62]:
max(vector)

49

In [21]:
for item in vector:
    print(item)

11
32
28
1
1
28
11
24
28
9
21
34
49
36
37
47
9
32
13
39
39
6
28
10
46


In [63]:
# get location of min value
index = vector.argmin()
index

0

In [64]:
# get location of max value
index = vector.argmax()
index

5

In [65]:
# get shape
vector.shape

(25,)

In [66]:
my_matrix = vector.reshape(5, 5)
my_matrix.shape

(5, 5)

## Indexing a 1-D array -- vector

In [22]:
vector = np.array(range(10))
vector
# vector[index]
# vector [start:end]
# vector [:end]
# vector [start:]
# vector [start, end, step]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [23]:
# read
vector[3]

3

In [24]:
vector[3:8]

array([3, 4, 5, 6, 7])

In [25]:
vector[:5]

array([0, 1, 2, 3, 4])

In [26]:
vector[5:]

array([5, 6, 7, 8, 9])

In [27]:
vector[3:9:2]

array([3, 5, 7])

In [28]:
vector[-1]

9

In [29]:
# update
vector[3] = 33
vector

array([ 0,  1,  2, 33,  4,  5,  6,  7,  8,  9])

In [30]:
vector[3:8] = 38
vector

array([ 0,  1,  2, 38, 38, 38, 38, 38,  8,  9])

In [31]:
vector[:5] = 55
vector

array([55, 55, 55, 55, 55, 38, 38, 38,  8,  9])

In [32]:
vector[5:] = 555
vector

array([ 55,  55,  55,  55,  55, 555, 555, 555, 555, 555])

In [33]:
vector[3:9:2] = 392
vector

array([ 55,  55,  55, 392,  55, 392, 555, 392, 555, 555])

In [34]:
vector[-1] = -1
vector

array([ 55,  55,  55, 392,  55, 392, 555, 392, 555,  -1])

## Setting multiple values at once -- Broadcasting
- There are two main features of NumPy arrays
  - Broadcasting -- set multiple values at once
  - Vectorization -- no need for explicit looping -- example, vector multiplication or squaring


In [35]:
vector[3:6] = 12


## Indexing a 2-D array -- Matrix
- Remember -- Python is zero-indexed

In [36]:
matrix = np.array(range(1,10)).reshape((3,3))
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [37]:
matrix[0,0]  # single [] with comma separated indices is preferred

1

In [38]:
matrix[0][0]

1

In [39]:
matrix[2,2]

9

In [40]:
matrix[2][2]


9

In [41]:
matrix[:,2]

array([3, 6, 9])

In [42]:
matrix[1,:]

array([4, 5, 6])

In [45]:
matrix[:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [46]:
matrix[:2,:]

array([[1, 2, 3],
       [4, 5, 6]])

In [47]:
matrix[:,1:] # grab all the rows, but columns starting from 1

array([[2, 3],
       [5, 6],
       [8, 9]])

## BE CAREFUL! View vs Copy
- Many operations in numpy and pandas creates a view rather than a copy
  - view: refer to original data but in a different perspective
  - copy: completely new array with copied data
- If you store a slice of an array in a new variable, changes in the new variable will be reflected in the original array.

In [54]:
vector = np.array(range(10))
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [57]:
my_slice = vector[3:7]
my_slice[:] = 20
my_slice

array([20, 20, 20, 20])

In [58]:
print(vector)

[ 0  1  2 20 20 20 20  7  8  9]


- Copy the array if you need a copy

In [51]:
vector = np.array(range(10))
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [52]:
my_slice_copy = vector[3:7].copy()
my_slice_copy[:] = 30
my_slice_copy

array([30, 30, 30, 30])

In [53]:
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

- similar with reshape

In [60]:
vector = np.array(range(16))+100
vector

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115])

In [61]:
matrix = vector.reshape((4,4))
matrix

array([[100, 101, 102, 103],
       [104, 105, 106, 107],
       [108, 109, 110, 111],
       [112, 113, 114, 115]])

In [62]:
matrix[1:3,1:3] = 22
matrix

array([[100, 101, 102, 103],
       [104,  22,  22, 107],
       [108,  22,  22, 111],
       [112, 113, 114, 115]])

In [63]:
vector

array([100, 101, 102, 103, 104,  22,  22, 107, 108,  22,  22, 111, 112,
       113, 114, 115])

In [64]:
submatrix = matrix[1:3,1:3]
submatrix

array([[22, 22],
       [22, 22]])

In [65]:
111*np.eye(2)

array([[111.,   0.],
       [  0., 111.]])

In [66]:
submatrix[0:2,0:2] = 111*np.eye(2)
submatrix

array([[111,   0],
       [  0, 111]])

In [67]:
matrix

array([[100, 101, 102, 103],
       [104, 111,   0, 107],
       [108,   0, 111, 111],
       [112, 113, 114, 115]])

In [68]:
vector

array([100, 101, 102, 103, 104, 111,   0, 107, 108,   0, 111, 111, 112,
       113, 114, 115])

In [69]:
# But
submatrix = 222*np.eye(2)
submatrix

array([[222.,   0.],
       [  0., 222.]])

In [70]:
matrix

array([[100, 101, 102, 103],
       [104, 111,   0, 107],
       [108,   0, 111, 111],
       [112, 113, 114, 115]])

In [71]:
vector

array([100, 101, 102, 103, 104, 111,   0, 107, 108,   0, 111, 111, 112,
       113, 114, 115])

## Conditional selection

In [72]:
vector = np.arange(10)
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [73]:
gt2 = vector > 2 # create boolean (condition)
gt2

array([False, False, False,  True,  True,  True,  True,  True,  True,
        True])

In [74]:
lt8 = vector < 8 # create boolean
gt2

array([False, False, False,  True,  True,  True,  True,  True,  True,
        True])

In [75]:
selected_gt2 = vector[gt2] # apply condition to select
selected_gt2

array([3, 4, 5, 6, 7, 8, 9])

In [76]:
selected_lt8 = vector[lt8] # apply condition to select
selected_lt8

array([0, 1, 2, 3, 4, 5, 6, 7])

In [77]:
# no need to use variables
vector[vector>2]

array([3, 4, 5, 6, 7, 8, 9])

In [78]:
vector[vector<8]

array([0, 1, 2, 3, 4, 5, 6, 7])

In [80]:
cond = (vector>2) & (vector<7)
vector[cond]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [81]:
cond = (vector>=2) & (vector<=7)
vector[cond]


array([2, 3, 4, 5, 6, 7])

### view or copy?

In [82]:
# view or copy?
vector = np.arange(10)
selected_lt8 = vector[vector < 8]
selected_lt8 = -11
print(selected_lt8)
print(vector)

-11
[0 1 2 3 4 5 6 7 8 9]


In [83]:
# How about now?
vector = np.arange(10)
selected_lt8 = vector[vector < 8]
selected_lt8[:] = -11
print(selected_lt8)
print(vector)

[-11 -11 -11 -11 -11 -11 -11 -11]
[0 1 2 3 4 5 6 7 8 9]


In [84]:
# What about now?
vector = np.arange(10)
vector[vector < 8] = -11
print(vector)

[-11 -11 -11 -11 -11 -11 -11 -11   8   9]


## Array operations -- Basic

In [85]:
vector = np.arange(10)
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [86]:
vector + vector

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [87]:
vector - vector

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [88]:
vector * vector

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [89]:
vector / vector # problem!!! return `nan`

  vector / vector # problem!!! return `nan`


array([nan,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [90]:
vector + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [91]:
vector - 10

array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1])

In [92]:
vector * 10

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [93]:
vector / 10

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

## Array operations -- Advanced
- Ref: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#math-operations
- https://stackoverflow.com/questions/25773245/ambiguity-in-pandas-dataframe-numpy-array-axis-definition/43413031

In [94]:
vector = np.arange(10)
vector

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [95]:
np.max(vector)

9

In [96]:
np.min(vector)

0

In [97]:
np.sqrt(vector)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [98]:
np.log(vector)

  np.log(vector)


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

In [99]:
sum(vector<5)

5

In [100]:
# not as efficient
np.sum(vector<5)

5

In [101]:
import math
vector = np.arange(1,11) * math.pi
np.sin(vector)

array([ 1.22464680e-16, -2.44929360e-16,  3.67394040e-16, -4.89858720e-16,
        6.12323400e-16, -7.34788079e-16,  8.57252759e-16, -9.79717439e-16,
        1.10218212e-15, -1.22464680e-15])

In [102]:
vector = np.arange(0,math.pi+math.pi/4,math.pi/4)
np.sin(vector)

array([0.00000000e+00, 7.07106781e-01, 1.00000000e+00, 7.07106781e-01,
       1.22464680e-16])

In [103]:
matrix = np.random.rand(5,5)
np.floor(matrix*1000)/1000

array([[0.545, 0.168, 0.926, 0.087, 0.331],
       [0.658, 0.7  , 0.896, 0.578, 0.935],
       [0.173, 0.808, 0.046, 0.671, 0.235],
       [0.351, 0.274, 0.085, 0.296, 0.712],
       [0.576, 0.07 , 0.241, 0.902, 0.502]])

In [104]:
np.round(matrix*1000)/1000

array([[0.546, 0.168, 0.927, 0.088, 0.331],
       [0.659, 0.7  , 0.896, 0.578, 0.935],
       [0.173, 0.808, 0.047, 0.672, 0.236],
       [0.351, 0.275, 0.085, 0.297, 0.713],
       [0.577, 0.071, 0.241, 0.902, 0.502]])

In [105]:
np.ceil(matrix*1000)/1000

array([[0.546, 0.169, 0.927, 0.088, 0.332],
       [0.659, 0.701, 0.897, 0.579, 0.936],
       [0.174, 0.809, 0.047, 0.672, 0.236],
       [0.352, 0.275, 0.086, 0.297, 0.713],
       [0.577, 0.071, 0.242, 0.903, 0.503]])

In [106]:
matrix = np.arange(1,13).reshape(3,4)
matrix

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [107]:
matrix.sum(axis=1)

array([10, 26, 42])

In [108]:
matrix.sum(axis=0)

array([15, 18, 21, 24])

In [109]:
matrix.cumsum()

array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55, 66, 78], dtype=int32)

In [110]:
matrix.cumprod()

array([        1,         2,         6,        24,       120,       720,
            5040,     40320,    362880,   3628800,  39916800, 479001600],
      dtype=int32)

In [111]:
matrix.min(axis=1)

array([1, 5, 9])

In [112]:
matrix.min(axis=0)

array([1, 2, 3, 4])

In [113]:
matrix.max(axis=1)

array([ 4,  8, 12])

In [114]:
matrix.max(axis=0)

array([ 9, 10, 11, 12])

In [115]:
matrix = np.array([1,2,3]*3).reshape(3,3)
matrix

array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

In [116]:
np.unique(matrix)



array([1, 2, 3])

# Pandas
- Built-on top of NumPy -- meaning the underlying data structure used is ndarray
- Pandas provides series which are like NumPy arrays but with associated index labels -- meaning are like column  labels
  or row labels. Element data type can be different
- Pandas also provides dataframes which are like Excel sheets or database tables

### Basic examples

In [117]:
import numpy as np
import pandas as pd

header = ['chrom', 'pos', 'filter']
data = [4, 12345, 38.4]
vector = np.array(data)
print(vector)
s1 = pd.Series(data=data) # Notice the data type is float
s1

[4.0000e+00 1.2345e+04 3.8400e+01]


0        4.0
1    12345.0
2       38.4
dtype: float64

In [118]:
s2 = pd.Series(data=data, index=header)
s2

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64

In [119]:
# can also do as positional arguments
s1 = pd.Series(data)
s1

0        4.0
1    12345.0
2       38.4
dtype: float64

In [120]:
s2 = pd.Series(data, header)
s2

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64

In [121]:
# can use a dictionary to initialize a panda series
data_dict = {'chrom': 4, 'pos': 12345, 'filter': 38.4}
s3 = pd.Series(data_dict)
s3

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64

In [122]:
# can hold "different" data types (it all casted to python object type and is not efficient any more)
data = [1, '2s', 34]
pd.Series(data)


0     1
1    2s
2    34
dtype: object

## Basic indexing

In [137]:
## Using index labels to fetch element
header = ['chrom', 'pos', 'filter', 'qual']
data = [4, 12345, 38.4, 12.3]
series = pd.Series(data=data, index=header)
series

chrom         4.0
pos       12345.0
filter       38.4
qual         12.3
dtype: float64

In [138]:
series[0]

4.0

In [124]:
series['chrom']

4.0

In [125]:
series['filter']

38.4

In [126]:
series['chrom':'filter']

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64

In [127]:
series.loc['chrom']

4.0

In [128]:
series.loc['filter']

38.4

In [129]:
series.loc['chrom':'filter']

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64

In [130]:
series.iloc[0]

4.0

In [131]:
series.iloc[2]

38.4

In [132]:
series.iloc[0:2]

chrom        4.0
pos      12345.0
dtype: float64

In [133]:
series[series>20]

pos       12345.0
filter       38.4
dtype: float64

In [134]:
series[np.array((True,False,True,False))]

chrom      4.0
filter    38.4
dtype: float64

In [139]:
## Using index labels to fetch element
header = [1,12,5,121]
data = [4, 12345, 38.4, 12.3]
series = pd.Series(data=data, index=header)
series

1          4.0
12     12345.0
5         38.4
121       12.3
dtype: float64

In [140]:
series[1]

4.0

In [141]:
series.loc[1]

4.0

In [142]:
series.iloc[1]

12345.0

In [136]:
# interger index
series = pd.Series(data)
series

0        4.0
1    12345.0
2       38.4
3       12.3
dtype: float64

In [None]:
series = pd.Series(data)
series[0]

## Basic operations

In [143]:
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
print(s1,"\n\n", s2)

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64 

 chrom        3.0
pos       4899.0
filter     234.0
qual        89.9
dtype: float64


In [144]:
s1+s2

chrom         7.0
filter      272.4
pos       17244.0
qual          NaN
dtype: float64

In [145]:
header1 = ['chrom', 'pos', 'filter']
data1 = [4, 12345, 38.4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = ['3', 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
print(s1,"\n\n", s2)

chrom         4.0
pos       12345.0
filter       38.4
dtype: float64 

 chrom        3
pos       4899
filter     234
qual      89.9
dtype: object


In [146]:
s1+s2

TypeError: unsupported operand type(s) for +: 'float' and 'str'

In [147]:
data1 = [4, 12345, 38.4]
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1)
s2 = pd.Series(data2)
print(s1,"\n\n", s2)

0        4.0
1    12345.0
2       38.4
dtype: float64 

 0       3.0
1    4899.0
2     234.0
3      89.9
dtype: float64


In [148]:
s1 + s2

0        7.0
1    17244.0
2      272.4
3        NaN
dtype: float64

In [149]:
## IMPORTANT - with index labels -- operations are based on label
header1 = ['pos', 'filter', 'chrom']
data1 = [12345, 38.4, 4]
header2 = ['chrom', 'pos', 'filter', 'qual']
data2 = [3, 4899, 234, 89.9]

s1 = pd.Series(data1, header1)
s2 = pd.Series(data2, header2)
print(s1,"\n\n", s2)
s1+s2

pos       12345.0
filter       38.4
chrom         4.0
dtype: float64 

 chrom        3.0
pos       4899.0
filter     234.0
qual        89.9
dtype: float64


chrom         7.0
filter      272.4
pos       17244.0
qual          NaN
dtype: float64

## Dataframes --
- Dataframe is composed of series
- Ref: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.html#pandas.io.formats.style.Styler

In [150]:
import numpy as np
import pandas as pd

header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']

df = pd.DataFrame(data=data, columns=header)
df

Unnamed: 0,exam1,exam2,exam3
0,84,73,92
1,95,88,94
2,85,67,73
3,76,87,92


In [151]:
df = pd.DataFrame(data=data, index=students, columns=header)
df

Unnamed: 0,exam1,exam2,exam3
student1,84,73,92
student2,95,88,94
student3,85,67,73
student4,76,87,92


In [152]:
# referencing column
df['exam1']

student1    84
student2    95
student3    85
student4    76
Name: exam1, dtype: int32

In [153]:
df.exam1 # not a good way to do this

student1    84
student2    95
student3    85
student4    76
Name: exam1, dtype: int32

In [154]:
df['average'] = (df['exam1'] + df['exam2'] + df['exam3'])/3
df

Unnamed: 0,exam1,exam2,exam3,average
student1,84,73,92,83.0
student2,95,88,94,92.333333
student3,85,67,73,75.0
student4,76,87,92,85.0


In [155]:
df.drop('average') # does not work because default for drop is to work on row labels

KeyError: "['average'] not found in axis"

In [156]:
df.drop('average', axis=1)# works on column labels

Unnamed: 0,exam1,exam2,exam3
student1,84,73,92
student2,95,88,94
student3,85,67,73
student4,76,87,92


In [157]:
## STILL NOT DROPPED from df
df

Unnamed: 0,exam1,exam2,exam3,average
student1,84,73,92,83.0
student2,95,88,94,92.333333
student3,85,67,73,75.0
student4,76,87,92,85.0


In [158]:
df.drop('average', axis=1, inplace=True)
df

Unnamed: 0,exam1,exam2,exam3
student1,84,73,92
student2,95,88,94
student3,85,67,73
student4,76,87,92


In [159]:
## drop a student
df.drop('student3')

Unnamed: 0,exam1,exam2,exam3
student1,84,73,92
student2,95,88,94
student4,76,87,92


In [160]:
df.drop('student3', inplace=True)
df

Unnamed: 0,exam1,exam2,exam3
student1,84,73,92
student2,95,88,94
student4,76,87,92


In [161]:
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student3']
df = pd.DataFrame(data=data, index=students, columns=header)
df

Unnamed: 0,exam1,exam2,exam3
student1,91,82,81
student2,92,67,85
student3,78,85,91
student3,76,70,92


In [162]:
df.drop('student3')

Unnamed: 0,exam1,exam2,exam3
student1,91,82,81
student2,92,67,85


In [163]:
df.drop('student3', inplace=True)
df

Unnamed: 0,exam1,exam2,exam3
student1,91,82,81
student2,92,67,85


# (Row,Column) == (axis=0, axis=1) df.shape
# Matrix NxM - N-rows, M-cols; a(i,j) - i-row, j-column

In [None]:
## Row is referred to as axis=0
## Column is referred to as axis=1
## (R,C) == (axis=0, axis=1) df.shape

## Select Dataframe rows

In [164]:
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student2,77,96,82
student3,96,69,77
student4,70,90,68


In [171]:
df['exam1']

student1    100
student2     77
student3     96
student4     70
Name: exam1, dtype: int32

In [165]:
df.loc['student1']

exam1    100
exam2     91
exam3     99
Name: student1, dtype: int32

In [166]:
df.iloc[0] ## remember that column names do not count as rows

exam1    100
exam2     91
exam3     99
Name: student1, dtype: int32

In [167]:
df.loc[['student1','student2']]

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student2,77,96,82


In [168]:
df.iloc[[0,2]] ## rem

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student3,96,69,77


In [169]:
df.loc[[True,False,True,False]]

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student3,96,69,77


In [170]:
df.iloc[[True,False,True,False]] ##

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student3,96,69,77


## Select subset of data

In [172]:
df.loc['student1', 'exam1']

100

In [173]:
df.loc[['student1', 'student3'], ['exam1', 'exam3']]

Unnamed: 0,exam1,exam3
student1,100,99
student3,96,77


In [174]:
df.loc['student1':'student2', 'exam1':'exam2']

Unnamed: 0,exam1,exam2
student1,100,91
student2,77,96


In [176]:
df.iloc[0, 0]

100

In [177]:
df.iloc[[0, 2], [0, 2]]

Unnamed: 0,exam1,exam3
student1,100,99
student3,96,77


In [175]:
df.iloc[0:3, 0:3]

Unnamed: 0,exam1,exam2,exam3
student1,100,91,99
student2,77,96,82
student3,96,69,77


## Use conditions to select

In [None]:
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, index=students, columns=header)
df

In [None]:
df>=90

In [None]:
df[df>=90]

In [None]:
df['exam1']>=85 #

In [None]:
df[df['exam1']>=85] # gives all columns where exam1 is greater than 85

In [None]:
df[df['exam1']>=85]['exam3']

In [None]:
df[df['exam1']>=85][['exam2', 'exam3']]

In [None]:
df[(df['exam1']>=85) & (df['exam2']>=85)]

In [None]:
df[(df['exam1']>=85) & (df['exam2']>=85)]['exam3']

In [None]:
df[(df['exam1']>=85) | (df['exam2']>=85)]

In [None]:
df[(df['exam1']>=85) | (df['exam2']>=85)]['exam3']

## Adding student index

In [None]:
header = ['exam1', 'exam2', 'exam3']
data = np.random.randint(65, 101, 12).reshape(4,3)
students = ['student1', 'student2', 'student3', 'student4']
df = pd.DataFrame(data=data, columns=header)
df

In [None]:
df['name'] = students
df

In [None]:
df.set_index('name', inplace=True)
df

In [None]:
df.loc['student1']

In [None]:
df.reset_index(inplace=True)
df

## Multi-index data

In [None]:
students = 'student1 student1 student1 student2 student2 student2 student3 student3 student3'
exams = 'exam1 exam2 exam3'.split()*3
classes = 'class1 class2'
index = list(zip(students.split(), exams))
index = pd.MultiIndex.from_tuples(index)
index

In [None]:
df = pd.DataFrame(np.random.randint(65, 101, 3*3*2).reshape(9,2) , index, classes.split())
df

In [None]:
df.loc['student1']

In [None]:
df.loc['student1'].loc['exam1']['class1']

In [None]:
df.index.names

In [None]:
df.index.names = ['Students', 'Exams']

In [None]:
df

In [None]:
## cross-section
df.xs('student1')

In [None]:
df.xs('exam1', level='Exams')

## Dealing with missing data


In [None]:
my_dict = {'student1': [90, 84, np.nan], 'student2': [77, np.nan, np.nan], 'student3': [88, 65, 93]}
df = pd.DataFrame(my_dict)
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis=0)

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(thresh=2)

In [None]:
df.fillna(value=55)

In [None]:
df.drop(axis=0, labels=[1,2])

In [None]:
df.drop(axis=1, columns=['student1'])

## Groupby

In [None]:
my_dict = {
    'Exams': 'exam1 exam1 exam1'.split() + 'exam2 exam2 exam2'.split() + 'exam3 exam3 exam3'.split(),
    'Students': 'student1 student2 student3'.split()*3,
    'Scores': np.random.randint(65,101,9)
}
df = pd.DataFrame(my_dict)
df

In [None]:
df.groupby('Students').mean()

In [None]:
df.groupby('Students').mean().loc['student1']

In [None]:
df.groupby('Exams').max()['Scores']

In [None]:
df.groupby('Exams').describe()

In [None]:
df.groupby('Students').describe().transpose()

## Merging  -- SQL JOIN

In [None]:
departments = {
    'DepartmentId': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Physics', 'Arts', 'Math']
}

df1 = pd.DataFrame(departments)
df1

In [None]:
students = {
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'StudentName': ['Michael', 'John', 'Jack', 'Sara', 'Sally', 'Jena', 'Nancy', 'Adam', 'Stevens', 'George'],
    'DepartmentId': [1, 1, 1, 2, 2, np.nan, 2, 3, 3, np.nan]
}

df2 = pd.DataFrame(students)
df2

In [None]:
marks = {
    'MarkId': [1, 2, 3, 4, 5, 6, 7, 8],
    'StudentId': [1, 2, 3, 4, 5, 6, 7, 8],
    'Mark': [18, 20, 16, 19, 14, 20, 20, 20]
}

df3 = pd.DataFrame(marks)
df3

In [None]:
pd.merge(df2, df1, how='inner', on='DepartmentId')

In [None]:
pd.merge(df1, df2, how='inner', on='DepartmentId')

In [None]:
pd.merge(df1, df2, how='outer', on='DepartmentId')

In [None]:
pd.merge(df2, df1, how='right', on='DepartmentId')

In [None]:
pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')

In [None]:
data = pd.merge(df3, pd.merge(df2, df1, how='inner', on='DepartmentId'), how='inner', on='StudentId')
data

In [None]:
data[['StudentName', 'Mark', 'DepartmentName']]

- ref: https://stackoverflow.com/a/48411543

## Concatenation

In [None]:
d1 = {
    'C0': ['COR0', 'COR1', 'COR2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1)
df1

In [None]:
d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2)
df2

In [None]:
d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3)
df3

In [None]:
pd.concat([df1, df2, df3])

In [None]:
## Concatenation -- Fix index

d1 = {
    'C0': ['COR0', 'COR1', 'COR2'],
    'C1': ['C1R0', 'C1R1', 'C2R2'],
    'C2': ['C2R0', 'C2R1', 'C2R2'],
}

df1 = pd.DataFrame(d1, index=[1, 2, 3])
df1

In [None]:
d2 = {
    'C0': ['C0R3', 'C0R4', 'C0R5'],
    'C1': ['C1R3', 'C1R4', 'C1R5'],
    'C2': ['C2R3', 'C2R4', 'C2R5'],
}

df2 = pd.DataFrame(d2, index=[4, 5, 6])
df2

In [None]:
d3 = {
    'C0': ['C0R6', 'C0R7', 'C0R8'],
    'C1': ['C1R6', 'C1R7', 'C1R8'],
    'C2': ['C2R6', 'C2R7', 'C2R8'],
}

df3 = pd.DataFrame(d3, index=[7, 8, 9])
df3

In [None]:
pd.concat([df1, df2, df3])

## More Pandas Operations

In [None]:
data['DepartmentName'].unique()

In [None]:
data['DepartmentName'].nunique()

In [None]:
data['DepartmentName'].value_counts()

In [None]:
data[data['Mark']>17]

## Lambda with Pandas
- Scale marks by 5

In [None]:
def times5(val):
    return val * 5

data['Mark'].apply(times5)

In [None]:
data['Mark'].apply(lambda val: val*5)

- Upper all department names

In [None]:
def upper(string):
    return string.upper()

data['DepartmentName'].apply(upper)

In [None]:
data['DepartmentName'].apply(lambda string: string.upper())

In [None]:
mapping = {18: 'B', 14: 'C', 19: 'A-', 20: 'A+'}
df3['Mark'].map(mapping)

## Dropping columns

In [None]:
data.columns

In [None]:
data.drop(['StudentId', 'MarkId' , 'DepartmentId'], axis=1)

## Sorting

In [None]:
data.sort_values('Mark')

In [None]:
data.sort_values('Mark', ascending=False)

## Importing CSV, TSV

In [None]:
data = pd.read_csv('students.tsv', sep='\t', names=['lastname', 'firstname', 'username', 'exam1', 'exam2', 'exam3'])
data

In [None]:
data.sort_values('exam1', ascending=False)

In [None]:
data[['exam1', 'exam2', 'exam3']].mean()

In [None]:
data['average']= np.mean(data[['exam1', 'exam2', 'exam3']], axis=1)

In [None]:
data.sort_values('average', ascending=False)

In [None]:
data.to_csv('output.tsv', sep='\t', index=False, header=False)

## Other methods

In [None]:
data.head()

In [None]:
data.head(2)

In [None]:
data.tail

In [None]:
data.tail(3)

In [None]:
data.shape

In [None]:
data.iloc[3]

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
data.info()

In [None]:
data.get_dtype_counts()

In [None]:
data.describe()

## More data manipulation

In [None]:
data[data['exam1'].between(75, 85)]

In [None]:
data[data['exam1'].in([75, 85, 95])

In [None]:
data[data['exam1'].isin([75, 85, 95])]

In [None]:
data['exam1'].unique()

In [None]:
data['exam1'].nunique()

In [None]:
np.sort(data['exam1'].unique())

## Example 5 in pandas