# Session 3: Numpy and Pandas
MSA 8010: Data Programming

### Agenda
1. Analyzing Patient Data (Numpy) with:
    1. Loops
    1. Functions
    1. Plotting
1. Pandas
1. Assignment #1

Source: https://swcarpentry.github.io/python-novice-inflammation/02-numpy/index.html

<img src="python-zero-index.svg" />

In [1]:
import numpy
data = numpy.array([["A","B","C"],["D","E","F"],["G","H","I"]])
print(data)

[['A' 'B' 'C']
 ['D' 'E' 'F']
 ['G' 'H' 'I']]


In [2]:
# The type function will only tell you that a variable is a NumPy array
print(type(data))

<class 'numpy.ndarray'>


In [3]:
#In Python, numbers are stored as integers or floating-point numbers.
#To see type of array items:
print("type=",data.dtype)

type= <U1


In [4]:
#shape: An array’s dimensions, represented as a vector. 
#For example, a 5×3 array’s shape is (5,3).
print(data.shape)

(3, 3)


In [5]:
# To view a single number from the array: 
# provide an index in square brackets after the variable name.
print(data[0,1])

B


### Exercise #1
1. Create the following numpy array:

    $\begin{bmatrix}1 & 4 & 7\\9 & 12 & 19\end{bmatrix}$


1. Print the contents
1. Print the datatype
1. Print the shape
1. Print the item at the [second row, third column] (_The answer should be **19**_)

In [6]:
#Range and reshape
import numpy as np
# arange function which is analogous to the Python built-in range, but returns an array.
a = np.arange(15)
print(a)
print("shape:",a.shape)
print("the number of axes (dimensions) of the array:",a.ndim)
print("first:", a[0])
#print(a[0,0]) #IndexError

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
shape: (15,)
the number of axes (dimensions) of the array: 1
first: 0


In [7]:
a = np.arange(15).reshape(3, 5)
print(a)
print("shape:",a.shape)
print("dimensions",a.ndim)
print("first:", a[0,0]) #first
# The total number of elements of the array. 
#This is equal to the product of the elements of shape.
print("size",a.size)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
shape: (3, 5)
dimensions 2
first: 0
size 15


In [8]:
#From: https://numpy.org/devdocs/user/quickstart.html
#Zeros
a = np.zeros((3, 4))
print(a)
print("----")
#Ones
a = np.ones((2, 3, 4), dtype=np.int16)
print(a)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
----
[[[1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]]

 [[1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]]]


Numpy data types: [https://www.tutorialspoint.com/numpy/numpy_data_types.htm](https://www.tutorialspoint.com/numpy/numpy_data_types.htm)

In [9]:
#Basic Operations
A = np.array([[1, 1], 
              [0, 1]])
B = np.array([[2, 0], 
              [3, 4]])
print("A=",A)
print("B=",B)  
C = A - B
print("----")
print("A-B=",C)
print("elementwise product=", A * B)
print("matrix product=", A @ B) # or A.dot(B)
D = np.random.randn(2,2)
print("D=",D)

A= [[1 1]
 [0 1]]
B= [[2 0]
 [3 4]]
----
A-B= [[-1  1]
 [-3 -3]]
elementwise product= [[2 0]
 [0 4]]
matrix product= [[5 4]
 [3 4]]
D= [[-0.18102953  0.96300237]
 [-0.97042196  0.71261168]]


### Exercise #2
Write a function, _matrix_generator_, that takes two numbers (_a_, and _b_) and returns a two dimensional matrix of _a_ rows, and _b_ columns filled with random numbers.

In [84]:
import numpy as np
def matrix_generator(a,b):
    m = np.random.randn(a,b)
    return m

print("testing 3,5")
print(matrix_generator(3,5))

testing 3,5
[[-0.61321733  0.07940577  1.57483197  0.30476885 -0.29453081]
 [-0.0922119  -1.1374506   1.47268604 -0.7688336  -0.58400464]
 [-0.89309743  1.09606599 -1.04919403  1.34089769 -0.50734703]]


### Exercise #3
1. Write a function, _matrix_props_, that takes two parameters: _matrix_, and _prop_, based on the prop value (that could be **shape**, **dtype**, or **size**), the function should print relevant info of the matrix.
2. If the prop value was incorrect, print _Not recognized_.

In [11]:
def matrix_props(matrix, prop):
    if prop == "shape":
        print(matrix.shape)
    elif prop == "dtype":
        print(matrix.dtype)
    elif prop == "size":
        print(matrix.size)
    else:
        print("Not recognized")

In [12]:
# Test
a = np.random.randn(4,2)
matrix_props(a, "shape")

(4, 2)


In [13]:
matrix_props(a, "dtype")

float64


In [14]:
matrix_props(a, "size")

8


In [15]:
matrix_props(a, "hello")

Not recognized


## Transposing a Matrix

At times it is useful to pivot a matrix for conformability- that is in order to matrix divide or multiply, we need to switch the rows and column dimensions of matrices.  Consider the matrix
$$
\begin{equation}
	A_{3 \times 2}=\begin{bmatrix}
	  a_{11} & a_{12} \\
	  a_{21} & a_{22} \\
	  a_{31} & a_{32} 	
	\end{bmatrix}_{3 \times 2}	
\end{equation}
$$
The transpose of A (denoted as $A^{\prime}$) is
$$
\begin{equation}
   A^{\prime}=\begin{bmatrix}
	  a_{11} & a_{21} & a_{31} \\
	  a_{12} & a_{22} & a_{32} \\
	\end{bmatrix}_{2 \times 3}
\end{equation}
$$

In [16]:
A = np.arange(6).reshape((3,2))
B = np.arange(8).reshape((2,4))
print ("A is")
print (A)

print ("The Transpose of A is")
print (A.T)

A is
[[0 1]
 [2 3]
 [4 5]]
The Transpose of A is
[[0 2 4]
 [1 3 5]]


Let matrix A be of dimension $N \times M$ and let B of of dimension $M \times P$.  
Then
$
\begin{equation}
	(AB)^{\prime}=B^{\prime}A^{\prime}
\end{equation}
$

In [17]:
# Testing the above rule
matrix1 = B.T.dot(A.T)
matrix2 = (A.dot(B)).T
print(matrix1)
print("----")
print(matrix2)

numpy.testing.assert_array_equal(matrix1, matrix2)
# More info about testing numpy arrays:
# https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_array_equal.html


[[ 4 12 20]
 [ 5 17 29]
 [ 6 22 38]
 [ 7 27 47]]
----
[[ 4 12 20]
 [ 5 17 29]
 [ 6 22 38]
 [ 7 27 47]]


### Slicing

In [85]:
a = np.arange(10,20) 
s = slice(2,7,2) #slice(start index, end index, step)
# start_position: included
# end_position: excluded
# step: default is 1
print("s=",s)
print("a=",a)
print("a[s]=",a[s])
print("Similarly, a[2:7:2]=", a[2:7:2])

s= slice(2, 7, 2)
a= [10 11 12 13 14 15 16 17 18 19]
a[s]= [12 14 16]
Similarly, a[2:7:2]= [12 14 16]


In [19]:
a = np.array([[1,2,3],[3,4,5],[4,5,6]]) 
print (a)

# slice items starting from index
print ('Now we will slice the array from the index a[1:]' )
print (a[1:])
print ("slicing along 2 dimensions a[1:,2]")
print (a[1:,2])

[[1 2 3]
 [3 4 5]
 [4 5 6]]
Now we will slice the array from the index a[1:]
[[3 4 5]
 [4 5 6]]
slicing along 2 dimensions a[1:,2]
[5 6]


### Exercise #4
Considering `a = np.arange(18)`, write a code to print the transpose of the 3x3 matrix formed from taking the even elements of a?

In [87]:
a = np.arange(18)
sliced = a[0:18:2] #or, a[::2]
reshaped = sliced.reshape([3,3])
transposed = reshaped.T
print(transposed)
#In one line: print(a[0:18:2].reshape([3,3]).T)

[[ 0  6 12]
 [ 2  8 14]
 [ 4 10 16]]


### Logical Operations

In [21]:
A = np.arange(6).reshape((3,2))
print(A)
print("\nA[:,1]>4:")
print(A[:,1]>4)


[[0 1]
 [2 3]
 [4 5]]

A[:,1]>4:
[False False  True]


In [22]:
A = np.random.rand(3,3)*10
print(A, '\n')
print(A < 5)

[[3.901876   7.42884964 3.94506222]
 [7.57916109 8.14617429 4.61938188]
 [1.26145783 6.13009017 0.52385434]] 

[[ True False  True]
 [False False  True]
 [ True False  True]]


In [23]:
A = np.random.rand(3,3)*10
print(A)
print("\nItems less than 5:\n", A[A < 5])
print("\nAre all items less than 5?", np.all(A<5))
print("Make all items that are less than or equal to 5 zero:")
A[A>=5] = 0
print(A)
print("\nAre all items less than 5 now?", np.all(A<5))

[[3.1037751  3.41368782 9.22430291]
 [3.39188366 3.11340422 6.05208562]
 [3.17484157 4.50158658 4.06086123]]

Items less than 5:
 [3.1037751  3.41368782 3.39188366 3.11340422 3.17484157 4.50158658
 4.06086123]

Are all items less than 5? False
Make all items that are less than or equal to 5 zero:
[[3.1037751  3.41368782 0.        ]
 [3.39188366 3.11340422 0.        ]
 [3.17484157 4.50158658 4.06086123]]

Are all items less than 5 now? True



- We are studying inflammation in patients who have been given a new treatment for arthritis.
- The data sets are stored in comma-separated values (CSV) format. 
- Each row holds the observations for just one patient.
- Each column holds the inflammation measured in a day, so we have a set of values in successive days. 

### Analyzing Patient Data

<img src="lesson-overview.svg" />

In [24]:
# Using Pandas to show CSV
# Before using, install pandas: conda install pandas
import pandas
df = pandas.read_csv('data/inflammation-01.csv')
df.head()

Unnamed: 0,0,0.1,1,3,1.1,2,4,7,8,3.1,...,4.3,4.4,5.1,7.6,3.4,4.5,2.1,3.5,0.2,0.3
0,0,1,2,1,2,1,3,2,2,6,...,3,5,4,4,5,5,1,1,0,1
1,0,1,1,3,3,2,6,2,5,9,...,10,5,4,2,2,3,2,2,1,1
2,0,0,2,0,4,2,2,1,6,7,...,3,5,6,3,3,4,2,3,2,1
3,0,1,1,3,3,1,3,5,2,4,...,9,6,3,2,2,4,2,0,1,1
4,0,0,1,2,2,4,2,1,6,4,...,8,4,7,3,5,4,4,3,2,1


In [25]:
# To load data into a Numpy array, first we import numpy library
import numpy

# The numpy.loadtxt has two parameters: the name of the file we want to read and the delimiter that separates values on a line. 
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
print(data)
#The output tells us that the data array variable contains 60 rows and 40 columns.
s = data.shape
print(s)
print(f"Patients:{s[0]}, Days:{s[1]}")

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]
(60, 40)
Patients:60, Days:40


### Analyzing Data

- **Mean**: The mean (informally, the “average“) is found by adding all of the numbers together and dividing by the number of items in the set.
- **Variance**: The average of the squared differences from the Mean.
- **Standard deviation**: a measure showing dispersion of data. It is the square root of the Variance.
    - the lower the standard deviation, the closer the data points tend to be to the mean (or expected value),

In [26]:
print(numpy.mean(data))

6.14875


In [27]:
minval, variance, stdval = numpy.min(data), numpy.var(data), numpy.std(data)
maxval = numpy.max(data)
print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('variance:', variance)
print('standard deviation:', stdval)

maximum inflammation: 20.0
minimum inflammation: 0.0
variance: 21.287456770833334
standard deviation: 4.613833197118566


In [28]:
#Slicing usecases
patient_0 = data[0,:] # 0 on the first axis (rows), everything on the second (columns)
print('maximum inflammation for patient 0:', numpy.max(patient_0))
print('maximum inflammation for patient 6:', numpy.max(data[6, :]))

maximum inflammation for patient 0: 18.0
maximum inflammation for patient 6: 17.0


In [29]:
patient3_week1 = data[3,0:7]
print(patient3_week1)

[0. 0. 2. 0. 4. 2. 2.]


### Axes (axis=1, and axis=0)
What if we need the maximum inflammation for each patient over all days (as in the next diagram on the left)?

Or, the average for each day (as in the diagram on the right)?
<img src="python-operations-across-axes.png" />

In [30]:
print("Mean of inflammations throught all days for each patient:\n",numpy.mean(data, axis=0))
print("shape=",numpy.mean(data, axis=0).shape)

Mean of inflammations throught all days for each patient:
 [ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]
shape= (40,)


### Numpy diff function
 It calculates the difference between two subsequent values of a NumPy array. 
 
 Hence, an array with n elements results in a diff array with n-1 elements.


In [31]:
fibs = numpy.array([0, 1, 1, 2, 3, 5, 8, 13, 21])
diff_fibs = numpy.diff(fibs)
print(diff_fibs)

[1 0 1 1 2 3 5 8]


In [32]:
# How the inflimmation is changing in patient 3 in the first week?
numpy.diff(patient3_week1)

array([ 0.,  2., -2.,  4., -2.,  0.])

## Pandas

In [35]:
#From: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [42]:
# Creating a pandas date range
dates = pd.date_range("20130101", periods=6)
print(dates)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


In [58]:
# Creating a pandas dataframe, having a range is index
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
#df.loc["2013-01-01"]

                   A         B         C         D
2013-01-01  0.536146 -0.838323 -0.227044 -0.332658
2013-01-02 -0.723137  2.184546 -0.172544 -0.774822
2013-01-03  0.542791  0.130107  0.870087 -0.458507
2013-01-04 -1.071307 -0.703697  1.963952 -0.428011
2013-01-05 -1.397861 -1.139620 -0.844042  2.957127
2013-01-06  2.589445 -0.818587  0.940422 -1.163175


In [60]:
df = pd.DataFrame(np.random.randn(8, 3), columns=["A", "B", "C"])
df.head()

Unnamed: 0,A,B,C
0,0.797153,0.905492,0.383517
1,-1.752803,0.232148,0.504009
2,-0.58174,0.568094,-1.370996
3,-0.488783,0.438026,-0.834427
4,0.124785,-0.150557,-0.064028


In [62]:
df.tail(4)

Unnamed: 0,A,B,C
4,0.124785,-0.150557,-0.064028
5,0.274222,-1.004205,0.790755
6,0.177497,-0.033768,-1.190398
7,-2.015543,0.161858,0.152062


In [63]:
# List Comprehensions in Python:
squares = []
for x in range(10):
    squares.append(x**2)
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [67]:
#It is equal to:
squares = [x**2 for x in range(10)]
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [73]:
df = pd.DataFrame(np.random.randn(8, 3), columns=["A", "B", "C"])
print(df.head(1))

#We can use list comprehension anywhere that a list exists
df.columns = [x.lower() for x in df.columns]
print("\n")
print(df.head(1))

          A         B         C
0 -0.335489 -0.401726  0.314076


          a         b         c
0 -0.335489 -0.401726  0.314076


### Exercise #5
Create a 10x4 DataFrame with random values having the following column names:
`A10, A20, A30, .., A100`

In [83]:
s = pd.DataFrame(np.random.randn(4,10), columns=["A" + str((x+1)*10) for x in range(10)])
s

Unnamed: 0,A10,A20,A30,A40,A50,A60,A70,A80,A90,A100
0,-0.647961,-1.066783,1.016228,1.822994,1.166877,-1.113236,0.881922,-1.15839,0.266965,-0.86384
1,2.92802,-0.859714,-1.110631,0.50656,-0.225781,2.251711,0.15584,0.804569,1.247652,1.60297
2,-1.467685,0.253177,0.922355,-0.69194,0.568771,1.049467,0.654996,1.653866,2.696517,-0.676505
3,-0.175471,-0.456015,-0.551743,-1.641772,0.223262,-0.940628,-0.298156,-1.190457,0.207686,1.344661
