# INFLAMMATION DATA

## Scenario: A Miracle Arthritis Inflammation Cure
Our imaginary colleague “Dr. Maverick” has invented a new miracle drug that promises to cure arthritis inflammation flare-ups after only 3 weeks since initially taking the medication! Naturally, we wish to see the clinical trial data, and after months of asking for the data they have finally provided us with a CSV spreadsheet containing the clinical trial data.

The CSV file contains the number of inflammation flare-ups per day for the 60 patients in the initial clinical trial, with the trial lasting 40 days. Each row corresponds to a patient, and each column corresponds to a day in the trial. Once a patient has their first inflammation flare-up they take the medication and wait a few weeks for it to take effect and reduce flare-ups.

To see how effective the treatment is we would like to:

1. Calculate the average inflammation per day across all patients.
2. Plot the result to discuss and share with colleagues.

# Loading data into Python

In [1]:
# NumPy = Numerical Python
import numpy as np

In [3]:
# calling function loadtxt
# loadtxt(file, delimiter)
# save data into variable data
data = np.loadtxt(fname='inflammation-01.csv', delimiter=',')

In [4]:
print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


In [5]:
# What type of object is the variable data?
print(type(data))

<class 'numpy.ndarray'>


In [6]:
# What type of items/elements are stored in data?
print(data.dtype)

float64


In [7]:
# Size of data
print(data.shape)

# 60 row, 40 columns

(60, 40)


In [8]:
# Python starts counting at 0
print('first value in data:', data[0,0])

first value in data: 0.0


In [10]:
print('middle value in data:', data[30,20])

middle value in data: 13.0


# Slicing Data

In [11]:
# Ranges do not include the end index
# Diff bw upper and lower bounds = number of values in slice
# Start at row 0 and go up to, but not including, row 4

print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


In [13]:
# Don't have to include lower bound
# Python defaults to 0

small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


# Analyzing data

In [14]:
# Mean function takes array as argument
print(np.mean(data))

6.14875


In [15]:
## Not all functions need input but () required
import time
print(time.ctime())

Wed Feb  1 17:00:12 2023


In [17]:
# Method: Multiple assignment
maxval, minval, stdval = np.max(data), np.min(data), np.std(data)

print('Descriptive Statistics')
print('max inflammation:', maxval)
print('min inflammation:', minval)
print('std dev:', stdval)


Descriptive Statistics
max inflammation: 20.0
min inflammation: 0.0
std dev: 4.613833197118566


# To find functions
np. tab brings up a list
  
For help page:
1. np.function?
2. help(np.function)

# Analyzing Individual data

In [19]:
patient_0 = data[0,:] # 0 on the first axis (rows), everything on teh second (columns)
print('maximum inflammation for patient 0:', np.max(patient_0))

maximum inflammation for patient 0: 18.0


In [21]:
print('maximum inflammation for patient 2:', np.max(data[2,:]))

maximum inflammation for patient 2: 19.0


# Analyzing across axis

In [23]:
# Axis 0 = rows
# Average inflammation for each day across all patients
print(np.mean(data, axis=0))

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


In [24]:
# Check shape of array - should be 40 bc data has 40 cols
print(np.mean(data, axis=0).shape)

(40,)


In [25]:
# Axis 1 = columns
# Average inflammation for each patient across all days
print(np.mean(data, axis=1))

[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


In [26]:
# Shape should be 60 bc data has 50 rows
print(np.mean(data, axis=1).shape)

(60,)


# EXERCISES

## Slicing Strings

In [29]:
element = 'oxygen'
print('first 3 chars:', element[0:3])
print('last 3 chars:', element[3:6])

first 3 chars: oxy
last 3 chars: gen


In [30]:
print(element[:4]) # oxyg
print(element[4:]) # gen - WRONG - counted wrong
print(element[:]) # oxygen

oxyg
en
oxygen


In [31]:
print(element[-1]) # e - WRONG - in reverse, use human counting
print(element[-2]) # g - WRONG - same as above error

n
e


In [32]:
print(element[1:-1]) # xygen - WRONG - doesn't include last value in range

xyge


In [33]:
print(element[-3:]) # last three chars? gen

gen


In [37]:
t1 = 'carpentry'
t2 = 'clone'
t3 = 'hi'

print(t1[-3:]) # try
print(t2[-3:]) # one
print(t3[-3:]) # hi

try
one
hi


## Thin Slices

In [42]:
# If element [3:3] is an empty string, what will below display
print(data[3:3, 4:4]) # [] empty array
print(data[3:3, 4:4].shape) # 0 rows, 0 cols

[]
(0, 0)


In [43]:
print(data[3:3,:]) # [] empty bc there is no row value given
print(data[3:3,:].shape) # 0 rows, 40 cols

[]
(0, 40)


## Stacking Arrays

In [45]:
# vstack = vertical stacking
# hstack = horizontal stacking

A = np.array([[1,2,3], [4,5,6], [7, 8, 9]])
print('A = ')
print(A)

B = np.hstack([A, A])
print('B = ')
print(B)

C = np.vstack([A, A])
print('C = ')
print(C)

A = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
B = 
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [7 8 9 7 8 9]]
C = 
[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 2 3]
 [4 5 6]
 [7 8 9]]


### slice the first and last columns of A, and stacks them into a 3x2 array

In [50]:
# Option 1

sliceA = np.hstack((A[:,:1], A[:,-1:]))
print(sliceA)

[[1 3]
 [4 6]
 [7 9]]


In [51]:
# produces a 1D array that will not stack correctly
print(A[:,0])

[1 4 7]


In [55]:
t = np.hstack((A[:,0], A[:,0]))
print(t)

[1 4 7 1 4 7]


In [56]:
# produces a array of single digit arrays
print(A[:,:1])

[[1]
 [4]
 [7]]


In [57]:
# Option 2
# Numpy delete to delete second column

D = np.delete(A, 1, 1)
print(D)

[[1 3]
 [4 6]
 [7 9]]


## Change In Inflammation
The patient data is longitudinal in the sense that each row represents a series of observations relating to one individual. This means that the change in inflammation over time is a meaningful concept. Let’s find out how to calculate changes in the data contained in an array with NumPy.

The numpy.diff() function takes an array and returns the differences between two successive values. Let’s use it to examine the changes each day across the first week of patient 3 from our inflammation dataset.

In [58]:
pt3_wk1 = data[3,:7]
print(pt3_wk1)

[0. 0. 2. 0. 4. 2. 2.]


In [60]:
# np.diff(pt3_wk1) is essentially doing this
# [0-0, 2-0, 0-2, 4-0,2-4, 2-2]

np.diff(pt3_wk1)

array([ 0.,  2., -2.,  4., -2.,  0.])

Note that the array of differences is shorter by one element (length 6).

When calling numpy.diff with a multi-dimensional array, an axis argument may be passed to the function to specify which axis to process. When applying numpy.diff to our 2D inflammation array data, which axis would we specify?

In [68]:
# we want difference bw columns so axis 1

diffs = np.diff(data, axis=1)
print(diffs)

[[ 0.  1.  2. ...  1. -3.  0.]
 [ 1.  1. -1. ...  0. -1.  1.]
 [ 1.  0.  2. ...  0. -1.  0.]
 ...
 [ 1.  0.  0. ... -1.  0.  0.]
 [ 0.  0.  1. ... -2.  2. -2.]
 [ 0.  1. -1. ... -2.  0. -1.]]


If the shape of an individual data file is (60, 40) (60 rows and 40 columns), what would the shape of the array be after you run the diff() function and why?

In [63]:
# (60,39) since we lose one dimension - one fewer difference bw columns than total # of cols

How would you find the largest change in inflammation for each patient? Does it matter if the change in inflammation is an increase or a decrease?

In [72]:
maxinf = np.max(diffs, axis=1) # indicate axis to get info for just that axis
print(maxinf)

# if it decreases, value will be negative


[ 7. 12. 11. 10. 11. 13. 10.  8. 10. 10.  7.  7. 13.  7. 10. 10.  8. 10.
  9. 10. 13.  7. 12.  9. 12. 11. 10. 10.  7. 10. 11. 10.  8. 11. 12. 10.
  9. 10. 13. 10.  7.  7. 10. 13. 12.  8.  8. 10. 10.  9.  8. 13. 10.  7.
 10.  8. 12. 10.  7. 12.]


In [75]:
# if interested in magnitude of change - look at abs value
maxinfabs = np.max(np.absolute(diffs), axis=1)
print(maxinfabs)

[12. 14. 11. 13. 11. 13. 10. 12. 10. 10. 10. 12. 13. 10. 11. 10. 12. 13.
  9. 10. 13.  9. 12.  9. 12. 11. 10. 13.  9. 13. 11. 11.  8. 11. 12. 13.
  9. 10. 13. 11. 11. 13. 11. 13. 13. 10.  9. 10. 10.  9.  9. 13. 10.  9.
 10. 11. 13. 10. 10. 12.]


# Use numpy.mean(array), numpy.max(array), and numpy.min(array) to calculate simple statistics.

# Use numpy.mean(array, axis=0) or numpy.mean(array, axis=1) to calculate statistics across the specified axis.