### Intro to Scientific Python Ecosystem

In [1]:
import numpy as np

In [2]:
!head dummy.csv

0,0,1,3,1,2,4
0,1,2,1,2,1,3
0,1,1,3,3,2,6

Read dummy dataset with numpy

In [3]:
np.loadtxt?

[0;31mSignature:[0m
[0mnp[0m[0;34m.[0m[0mloadtxt[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m=[0m[0;34m<[0m[0;32mclass[0m [0;34m'float'[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomments[0m[0;34m=[0m[0;34m'#'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconverters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mskiprows[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0munpack[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mndmin[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mencoding[0m[0;34m=[0m[0;34m'bytes'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_rows[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0

In [4]:
import os
from pathlib import Path


BASE_FOLDER = Path(os.path.abspath(os.path.curdir))
DATA_FOLDER = BASE_FOLDER / "data"

In [5]:
dummy_data = np.loadtxt("dummy.csv", delimiter=",")

Playing with Attribute of `ndarray`

In [6]:
type(dummy_data)

numpy.ndarray

In [7]:
dummy_data.shape

(3, 7)

In [8]:
dummy_data.dtype

dtype('float64')

In [9]:
dummy_data

array([[0., 0., 1., 3., 1., 2., 4.],
       [0., 1., 2., 1., 2., 1., 3.],
       [0., 1., 1., 3., 3., 2., 6.]])

In [10]:
dummy_data = np.loadtxt("dummy.csv", delimiter=",", dtype=np.int32)

In [11]:
dummy_data.dtype

dtype('int32')

In [12]:
dummy_data

array([[0, 0, 1, 3, 1, 2, 4],
       [0, 1, 2, 1, 2, 1, 3],
       [0, 1, 1, 3, 3, 2, 6]], dtype=int32)

### Let's switch to some real data

$\rightarrow$ _Adapted from_ : [**Software Carpentries: Programming with Python**]()

## Arthritis Inflammation
We are studying **inflammation in patients** who have been given a new treatment for arthritis.

There are `60` patients, who had their inflammation levels recorded for `40` days.
We want to analyze these recordings to study the effect of the new arthritis treatment.

To see how the treatment is affecting the patients in general, we would like to:

1. Process the file to extract data for each patient;
2. Calculate some statistics on each patient;
    - e.g. average inflammation over the `40` days (or `min`, `max` .. and so on)
    - e.g average statistics per week (we will assume `40` days account for `5` weeks)
    - `...` (open to ideas)
3. Calculate some statistics on the dataset.
    - e.g. min and max inflammation registered overall in the clinical study;
    - e.g. the average inflammation per day across all patients.
    - `...` (open to ideas)


![3-step flowchart shows inflammation data records for patients moving to the Analysis step
where a heat map of provided data is generated moving to the Conclusion step that asks the
question, How does the medication affect patients?](
https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/fig/lesson-overview.svg "Lesson Overview")


### Data Format

The data sets are stored in
[comma-separated values] (CSV) format:

- each row holds information for a single patient,
- columns represent successive days.

The first three rows of our first file look like this:
~~~
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
~~~

Each number represents the number of inflammation bouts that a particular patient experienced on a
given day.

For example, value "6" at row 3 column 7 of the data set above means that the third
patient was experiencing inflammation six times on the seventh day of the clinical study.

Our **task** is to gather as much information as possible from the dataset, and to report back to colleagues to foster future discussions.

In [13]:
if_data_01 = DATA_FOLDER / "inflammation-01.csv"

In [14]:
inf_data = np.loadtxt(if_data_01, delimiter=",", dtype=np.int32)

In [15]:
inf_data.shape

(60, 40)

In [16]:
inf_data.dtype

dtype('int32')

In [17]:
inf_data.size

2400

In [18]:
inf_data.itemsize

4

#### Slicing

In [19]:
inf_data[:3]

array([[ 0,  0,  1,  3,  1,  2,  4,  7,  8,  3,  3,  3, 10,  5,  7,  4,
         7,  7, 12, 18,  6, 13, 11, 11,  7,  7,  4,  6,  8,  8,  4,  4,
         5,  7,  3,  4,  2,  3,  0,  0],
       [ 0,  1,  2,  1,  2,  1,  3,  2,  2,  6, 10, 11,  5,  9,  4,  4,
         7, 16,  8,  6, 18,  4, 12,  5, 12,  7, 11,  5, 11,  3,  3,  5,
         4,  4,  5,  5,  1,  1,  0,  1],
       [ 0,  1,  1,  3,  3,  2,  6,  2,  5,  9,  5,  7,  4,  5,  4, 15,
         5, 11,  9, 10, 19, 14, 12, 17,  7, 12, 11,  7,  4,  2, 10,  5,
         4,  2,  2,  3,  2,  2,  1,  1]], dtype=int32)

In [20]:
inf_data[:3, :7]

array([[0, 0, 1, 3, 1, 2, 4],
       [0, 1, 2, 1, 2, 1, 3],
       [0, 1, 1, 3, 3, 2, 6]], dtype=int32)

### Performance Comparison Numpy vs Lists

In [21]:
%%timeit
matrix_lol = []
for i in range(10000):
    row = list()
    for j in range(1000):
        row.append(j)
    matrix_lol.append(row)


1.03 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
%%timeit
matrix_np = np.empty((10000, 1000), dtype=np.int32)
for i in range(10000):
    for j in range(1000):
        matrix_np[i, j] = j

1.98 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
%%timeit
matrix_np_faster = np.arange(10000*1000).reshape(10000, 1000)

19.3 ms ± 699 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Let's talk about patients

1. average inflammations per day (max and min)
2. median
3. standard deviation

#### Compute patient's averages with vanilla python

In [24]:
patients = list()
with open(if_data_01) as data_file:
    for line in data_file:
        line = line.strip()
        if not line:
            continue
        values = line.split(",")
        patient_data = list()
        for value in values:
            patient_data.append(int(value))
        patients.append(tuple(patient_data))
    

In [25]:
len(patients)

60

In [26]:
from typing import List, Tuple

def overall_average(patients: List[Tuple[int]]) -> float:
    n_values = 0
    sum_values = 0
    for patient in patients:
        n_values += len(patient)
        sum_values += sum(patient)
    return sum_values / n_values

In [27]:
%timeit overall_average(patients)

32.8 µs ± 798 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### Average of the whole dataset with `np.mean`

In [28]:
%timeit inf_data.mean()

12.2 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [29]:
overall_average(patients)

6.14875

In [30]:
inf_data.mean()

6.14875

#### Average for each patient

In [31]:
average_per_patient = inf_data.mean(axis=1)

In [32]:
average_per_patient.shape

(60,)

In [33]:
average_per_patient

array([5.45 , 5.425, 6.1  , 5.9  , 5.55 , 6.225, 5.975, 6.65 , 6.625,
       6.525, 6.775, 5.8  , 6.225, 5.75 , 5.225, 6.3  , 6.55 , 5.7  ,
       5.85 , 6.55 , 5.775, 5.825, 6.175, 6.1  , 5.8  , 6.425, 6.05 ,
       6.025, 6.175, 6.55 , 6.175, 6.35 , 6.725, 6.125, 7.075, 5.725,
       5.925, 6.15 , 6.075, 5.75 , 5.975, 5.725, 6.3  , 5.9  , 6.75 ,
       5.925, 7.225, 6.15 , 5.95 , 6.275, 5.7  , 6.1  , 6.825, 5.975,
       6.725, 5.7  , 6.25 , 6.4  , 7.05 , 5.9  ])

**Start Lect 4**:

Let's compare time execution to Vanilla Python Implementation

#### Exercise:

Calculate Daily Inflammation Average

In [None]:
# Vanilla Python Implementation

In [None]:
#Numpy

### Plotting with `matplotlib`

We will be using `matplotlib` to add in some visualisation of the Data

In [None]:
from matplotlib import pyplot as plt

In [None]:
# plot a line with the computed means
plt.plot(inf_data.mean(axis=1))

# set the tile
plt.title("Avg inflammations per day")

# set the label for the x-axis
plt.xlabel("Patient ID")

# set the label for the y-axis
plt.ylabel("Avg inflammations")

#### Plot with scatter points instead of a continuous line

In [None]:
plt.plot(inf_data.mean(axis=1), 'o')

plt.title("Avg inflammations per day")
plt.xlabel("Patient ID")
plt.ylabel("Avg inflammations")

#### Compute mean and standard deviation (i.e. spread) of the distribution

In [None]:
patient_means = inf_data.mean(axis=1)
patient_means.mean(), patient_means.std()

In [None]:
plt.plot(inf_data.mean(axis=1), 'o')

# plot an horizontal line
plt.axhline(patient_means.mean(), color='k')

# plot an horizontal line with a different style
plt.axhline(
    patient_means.mean() - patient_means.std(), 
    linestyle='--', 
    color='k'
)
plt.axhline(
    patient_means.mean() + patient_means.std(), 
    linestyle='--', 
    color='k'
)

plt.title("Avg inflammations per day")
plt.xlabel("Patient ID")
plt.ylabel("Avg inflammations")

#### Plot the line of a single patient

In [None]:
plt.plot(inf_data[0])

#### Plot a single line representing the daily averages

In [None]:
plt.plot(inf_data.mean(axis=0))

#### Try to plot all the lines together

In [None]:
# create a figure of a larger size, 20x10 inches
plt.figure(figsize=(20, 10))

# data.T transposes the matrix, i.e. swap rows and columns
# the alpha parameter add some transparency
_ = plt.plot(inf_data.T, color='green', alpha=0.1)

#### Visualize the whole dataset with a heatmap

In [None]:
plt.imshow(inf_data, cmap='Spectral_r')
plt.colorbar()
plt.show()

---

## Solution 2 with Dictionary VS Numpy

```python
def overall_average(patients: Dataset) -> float:
    num_values = 0
    sum_of_values = 0
    for inflammation_data in patients.values():
        num_values += len(inflammation_data)
        sum_of_values += sum(inflammation_data)
    return sum_of_values / num_values
```

Let's recall what the implementation was like

Let's Now try to load `inflammation-02.csv` file in Numpy

Looking for a better solution: introducing `pandas`

In [None]:
import pandas as pd

Slicing & `pd.Series`

Adding information on Columns

Calculate Statistics with Pandas

Plot with Pandas

**GREAT TIME FOR A BREAK NOW!** ☕️🧁🍪