# Working Notebook

Welcome to the _Programming with Python_ course! We will be using this notebook to go through the lecture materials, as well as to work _together_ on practical examples and exercises.

## first thing: let's familiarise with the environment

Let's talk about **Jupyter Notebooks** for a second.

In [1]:
# code cell

Text Cell

### A few useful Keyboard shortcuts for you to remember when working with Notebooks

**Change Cell type**

1. `Esc, M`: Switch the current cell type to a **Markdown** cell
2. `Esc, y`: Switch the current cell type to a **Code** cell

**Add/Remove Cells**

1. `Esc, A`: Add a new cell _above_ current cell
2. `Esc, B`: Add a new cell _below_ current cell
3. `Esc, DD`: Delete the current cell

**Run current cell**

1. `Ctrl + Enter`: Run current cell (and maintain focus on current cell);
2. `Shift + Enter`: Run current cell and move focus to next cell.

---

$\rightarrow$ _Adapted from_ : [**Software Carpentries: Programming with Python**]()

## Arthritis Inflammation
We are studying **inflammation in patients** who have been given a new treatment for arthritis.

There are `60` patients, who had their inflammation levels recorded for `40` days.
We want to analyze these recordings to study the effect of the new arthritis treatment.

To see how the treatment is affecting the patients in general, we would like to:

1. Process the file to extract data for each patient;
2. Calculate some statistics on each patient;
    - e.g. average inflammation over the `40` days (or `min`, `max` .. and so on)
    - e.g average statistics per week (we will assume `40` days account for `5` weeks)
    - `...` (open to ideas)
3. Calculate some statistics on the dataset.
    - e.g. min and max inflammation registered overall in the clinical study;
    - e.g. the average inflammation per day across all patients.
    - `...` (open to ideas)


![3-step flowchart shows inflammation data records for patients moving to the Analysis step
where a heat map of provided data is generated moving to the Conclusion step that asks the
question, How does the medication affect patients?](
https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/fig/lesson-overview.svg "Lesson Overview")


### Data Format

The data sets are stored in
[comma-separated values] (CSV) format:

- each row holds information for a single patient,
- columns represent successive days.

The first three rows of our first file look like this:
~~~
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
~~~

Each number represents the number of inflammation bouts that a particular patient experienced on a
given day.

For example, value "6" at row 3 column 7 of the data set above means that the third
patient was experiencing inflammation six times on the seventh day of the clinical study.

Our **task** is to gather as much information as possible from the dataset, and to report back to colleagues to foster future discussions.

### Let'make a plan

- Problem description (step by step) in NATURAL LANGUAGE (**strict rule**) - imagine you're explaining this to someone who doesn't know **anything** about programming.
- What do we need to start
- Where do we start

I'll go first - let's create a dummy file to practice named dummy, three rows, 7 values

1. read the file
    - read the file one line at a time
2. store the data from the file into a data structure or format

In [2]:
dummy_datafile = open("dummy.csv")

In [3]:
patients = [] # list()

for line in dummy_datafile:
    patients.append(line)

In [4]:
print(patients)

['0,0,1,3,1,2,4\n', '0,1,2,1,2,1,3\n', '0,1,1,3,3,2,6']


### Small Diversion about **Python Typing Mechanism**

1. Python typing is _dynamic_ (as opposed to _static_): each variable gets its type by the value it's been assigned to. No need to declare a type for a variable. 

2. Python typing is _strong_ (as opposed to _weak_): the type of each variable will always remain the same, unless the variable is re-defined, or explicitly casted to another compatible type!

In [5]:
name = "valerio"

In [6]:
type(name)

str

In [7]:
name / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [None]:
name * "maggio"

In [8]:
name = 2

#### Going back to our Data case

In [9]:
patients

['0,0,1,3,1,2,4\n', '0,1,2,1,2,1,3\n', '0,1,1,3,3,2,6']

In [10]:
for patient in patients:
    print("Patient info: " + patient)

Patient info: 0,0,1,3,1,2,4

Patient info: 0,1,2,1,2,1,3

Patient info: 0,1,1,3,3,2,6


### Storing the data from file in a better format:

In [11]:
dummy_datafile.close()  # first we need to close the file handler previously opened, otherwise the buffer has been read already and there's nothing else to read.

In [12]:
patients = []

with open("dummy.csv") as dummy_datafile:
    for line in dummy_datafile:
        line = line.strip()
        if (len(line) == 0):
            continue
        inflammation_data = line.split(",")
        patients.append(inflammation_data)
        

In [13]:
patients

[['0', '0', '1', '3', '1', '2', '4'],
 ['0', '1', '2', '1', '2', '1', '3'],
 ['0', '1', '1', '3', '3', '2', '6']]

Play with what we have so far: iteration

In [14]:
for patient in patients:
    print(type(patient))

<class 'list'>
<class 'list'>
<class 'list'>


In [15]:
patients[0][0]

'0'

In [16]:
patients[2][:3]

['0', '1', '1']

(_fancy word_) **Slicing**

![slicing example](https://swcarpentry.github.io/python-novice-inflammation/fig/python-zero-index.svg)

Source: [Software Carpentries](https://swcarpentry.github.io/python-novice-inflammation/02-numpy/index.html)

Now let's move to the _real_ data file: **how can we re-use the same algorithm?**

### TYPE HINT

#### Python 3.8

```python
# BASIC TYPES : bool, int, float, str

from typing import List
from typing import Dict

def function(param1: int, param2: str) -> List[int]:
    pass

```

Since Python 3.9

```python
def function(p1: int, p2:str) -> list:
    pass
```




Play with type hint

In [17]:
def concat(first: str, second: str) -> str: 
    return first + second

concat("valerio", "maggio")

'valeriomaggio'

In [18]:
concat(1, 2)

3

In [19]:
concat("name", 32)

TypeError: can only concatenate str (not "int") to str

In [20]:
def process_inflammation_data(datafile_path: str) -> list:
    """
    Process the input CSV file (specified by the filepath) and 
    collect all the inflammatory levels recorded for each patient.
    
    The function returns a list of all patient data.
    """
    patients = []
    with open(datafile_path) as datafile:
        for line in datafile:
            line = line.strip()
            if not line:  ## line is empty line
                continue
            patient_data = []
            for value in line.split(","):
                patient_data.append(int(value))
            patients.append(patient_data)
            
    return patients

In [21]:
process_inflammation_data("dummy.csv")

[[0, 0, 1, 3, 1, 2, 4], [0, 1, 2, 1, 2, 1, 3], [0, 1, 1, 3, 3, 2, 6]]

In [22]:
from pathlib import Path
from os import path

BASE_PATH = Path(path.abspath(path.curdir))

DATA_FOLDER = BASE_PATH / "data"

In [23]:
print(BASE_PATH)

/Users/valerio/Research/UoB/lectures/fbk-academy/2021/python-data-science


In [24]:
print(DATA_FOLDER)

/Users/valerio/Research/UoB/lectures/fbk-academy/2021/python-data-science/data


In [25]:
type(DATA_FOLDER)

pathlib.PosixPath

In [26]:
data_filepath = DATA_FOLDER / "inflammation-01.csv"
print(type(data_filepath))

<class 'pathlib.PosixPath'>


In [27]:
print(data_filepath)

/Users/valerio/Research/UoB/lectures/fbk-academy/2021/python-data-science/data/inflammation-01.csv


In [28]:
patients_inflammation = process_inflammation_data(data_filepath)

In [29]:
type(patients_inflammation)

list

_now we have 60 patiens to deal with_ - how can we do that?

In [30]:
print(len(patients_inflammation))

60


In [31]:
len(patients_inflammation) == 60

True

In [32]:
assert len(patients_inflammation) == 60, "WARNING: Patients recorded were expected to be 60 but instead are {}".format(
    len(patients_inflammation))

In [33]:
# demonstration of assert statement
assert 5 < 2, "No way, go back to Primary School"

AssertionError: No way, go back to Primary School

1. Asserting that we indeed have 40 values per each patient
2. Make assertive (defensive) programming in a more structured way

In [34]:
from typing import List

Dataset = List[List[int]]

In [35]:
def test_dataset_contains_60_patients(data: Dataset):
    assert len(data) == 60, "Error, Expected 60, found {}".format(len(data))

In [36]:
test_dataset_contains_60_patients(patients_inflammation)

In [37]:
def test_each_patient_has_40_recorded_values(data: Dataset):
    
    for i, patient in enumerate(data):
        assert len(patient) == 40, "Error: Patient Nr. {} has {} recorded values".format(i, len(patient))
        
test_each_patient_has_40_recorded_values(patients_inflammation)

## Diversion about side effects and MUTABLE Sequences

In [38]:
sequence = [1, 2, 3, 4, 5, 6]

def nasty_function(seq: list):
    for item in seq:
        print(item)
    seq.append("NOISE")
    
    

In [39]:
nasty_function(sequence)

1
2
3
4
5
6


In [40]:
print(sequence)

[1, 2, 3, 4, 5, 6, 'NOISE']


In [41]:
sequence = (1, 2, 3, 4, 5, 6)

type(sequence)

tuple

In [42]:
nasty_function(sequence)

1
2
3
4
5
6


AttributeError: 'tuple' object has no attribute 'append'

In [43]:
from typing import List, Tuple

Dataset = List[Tuple[int]]

def process_inflammation_data(datafile_path: str) -> Dataset:
    """
    Process the input CSV file (specified by the filepath) and 
    collect all the inflammatory levels recorded for each patient.
    
    The function returns a list of all patient data.
    """
    patients = []
    with open(datafile_path) as datafile:
        for line in datafile:
            line = line.strip()
            if not line:  ## line is empty line
                continue
            patient_data = []
            for value in line.split(","):
                patient_data.append(int(value))
            # Convert into an immutable sequence    
            patient_data = tuple(patient_data)
            patients.append(patient_data)
            
    return patients

In [44]:
patients_inflammation = process_inflammation_data(data_filepath)

In [45]:
test_dataset_contains_60_patients(patients_inflammation)

In [46]:
test_each_patient_has_40_recorded_values(patients_inflammation)

What if we also add in a reference ID for each patient? (see `data/inflammation02.csv`)

In [47]:
from typing import Dict, Tuple

# DOCUMENTATION
Dataset = Dict[str, Tuple[int]]

def process_inflammation_data_with_patientID(datafile_path: str) -> Dataset:
    """
    Process the input CSV file (specified by the filepath) and 
    collect all the inflammatory levels recorded for each patient.
    
    
    """
    patients = {}
    with open(datafile_path) as datafile:
        for line in datafile:
            line = line.strip()
            if not line:  ## line is empty line
                continue
            
            values = line.split(",")            
            pid = values[0]
            
            patient_data = []
            for value in values[1:] :
                patient_data.append(int(value))
            # Convert into an immutable sequence    
            patient_data = tuple(patient_data)
            
            patients[pid] = patient_data
    return patients

In [48]:
inflammation_filepath_02 = DATA_FOLDER / "inflammation-02.csv"

In [49]:
patients_dict = process_inflammation_data_with_patientID(inflammation_filepath_02)

In [50]:
print(patients_dict["669f"])

(0, 0, 1, 3, 1, 2, 4, 7, 8, 3, 3, 3, 10, 5, 7, 4, 7, 7, 12, 18, 6, 13, 11, 11, 7, 7, 4, 6, 8, 8, 4, 4, 5, 7, 3, 4, 2, 3, 0, 0)


In [51]:
print(patients_dict["c760"])

(0, 1, 1, 3, 3, 1, 3, 5, 2, 4, 4, 7, 6, 5, 3, 10, 8, 10, 6, 17, 9, 14, 9, 7, 13, 9, 12, 6, 7, 7, 9, 6, 3, 2, 2, 4, 2, 0, 1, 1)


In [52]:
patients_dict.keys()

dict_keys(['669f', '2edf', '0355', '5968', 'c760', '6b51', 'dbaf', 'b3b7', '3995', 'd6ff', '2d58', 'a1d4', '71e9', '65c1', '1edd', '277b', 'fe0e', '66d3', '3ff3', '4102', '12c9', '5b04', '1fef', '01c0', '57b5', '226c', 'c653', '94fd', 'ebf2', 'fc73', 'd4a0', 'a9f2', 'dc22', 'a6e7', '3fb2', '11cc', 'c9f5', 'a73f', 'dab2', '65a1', '8bcb', '4004', 'c2af', '8037', 'cb49', '2b4b', '80a8', 'ac50', '57ef', 'cc45', '9184', '84be', '0af0', 'bf77', 'c56c', '7d0c', 'c736', 'c5c8', '050a', 'a085'])

In [53]:
patients_dict.values()

dict_values([(0, 0, 1, 3, 1, 2, 4, 7, 8, 3, 3, 3, 10, 5, 7, 4, 7, 7, 12, 18, 6, 13, 11, 11, 7, 7, 4, 6, 8, 8, 4, 4, 5, 7, 3, 4, 2, 3, 0, 0), (0, 1, 2, 1, 2, 1, 3, 2, 2, 6, 10, 11, 5, 9, 4, 4, 7, 16, 8, 6, 18, 4, 12, 5, 12, 7, 11, 5, 11, 3, 3, 5, 4, 4, 5, 5, 1, 1, 0, 1), (0, 1, 1, 3, 3, 2, 6, 2, 5, 9, 5, 7, 4, 5, 4, 15, 5, 11, 9, 10, 19, 14, 12, 17, 7, 12, 11, 7, 4, 2, 10, 5, 4, 2, 2, 3, 2, 2, 1, 1), (0, 0, 2, 0, 4, 2, 2, 1, 6, 7, 10, 7, 9, 13, 8, 8, 15, 10, 10, 7, 17, 4, 4, 7, 6, 15, 6, 4, 9, 11, 3, 5, 6, 3, 3, 4, 2, 3, 2, 1), (0, 1, 1, 3, 3, 1, 3, 5, 2, 4, 4, 7, 6, 5, 3, 10, 8, 10, 6, 17, 9, 14, 9, 7, 13, 9, 12, 6, 7, 7, 9, 6, 3, 2, 2, 4, 2, 0, 1, 1), (0, 0, 1, 2, 2, 4, 2, 1, 6, 4, 7, 6, 6, 9, 9, 15, 4, 16, 18, 12, 12, 5, 18, 9, 5, 3, 10, 3, 12, 7, 8, 4, 7, 3, 5, 4, 4, 3, 2, 1), (0, 0, 2, 2, 4, 2, 2, 5, 5, 8, 6, 5, 11, 9, 4, 13, 5, 12, 10, 6, 9, 17, 15, 8, 9, 3, 13, 7, 8, 2, 8, 8, 4, 2, 3, 5, 4, 1, 1, 1), (0, 0, 1, 2, 3, 1, 2, 3, 5, 3, 7, 8, 8, 5, 10, 9, 15, 11, 18, 19, 20, 8, 5, 13, 

In [54]:
patients_dict.items()

dict_items([('669f', (0, 0, 1, 3, 1, 2, 4, 7, 8, 3, 3, 3, 10, 5, 7, 4, 7, 7, 12, 18, 6, 13, 11, 11, 7, 7, 4, 6, 8, 8, 4, 4, 5, 7, 3, 4, 2, 3, 0, 0)), ('2edf', (0, 1, 2, 1, 2, 1, 3, 2, 2, 6, 10, 11, 5, 9, 4, 4, 7, 16, 8, 6, 18, 4, 12, 5, 12, 7, 11, 5, 11, 3, 3, 5, 4, 4, 5, 5, 1, 1, 0, 1)), ('0355', (0, 1, 1, 3, 3, 2, 6, 2, 5, 9, 5, 7, 4, 5, 4, 15, 5, 11, 9, 10, 19, 14, 12, 17, 7, 12, 11, 7, 4, 2, 10, 5, 4, 2, 2, 3, 2, 2, 1, 1)), ('5968', (0, 0, 2, 0, 4, 2, 2, 1, 6, 7, 10, 7, 9, 13, 8, 8, 15, 10, 10, 7, 17, 4, 4, 7, 6, 15, 6, 4, 9, 11, 3, 5, 6, 3, 3, 4, 2, 3, 2, 1)), ('c760', (0, 1, 1, 3, 3, 1, 3, 5, 2, 4, 4, 7, 6, 5, 3, 10, 8, 10, 6, 17, 9, 14, 9, 7, 13, 9, 12, 6, 7, 7, 9, 6, 3, 2, 2, 4, 2, 0, 1, 1)), ('6b51', (0, 0, 1, 2, 2, 4, 2, 1, 6, 4, 7, 6, 6, 9, 9, 15, 4, 16, 18, 12, 12, 5, 18, 9, 5, 3, 10, 3, 12, 7, 8, 4, 7, 3, 5, 4, 4, 3, 2, 1)), ('dbaf', (0, 0, 2, 2, 4, 2, 2, 5, 5, 8, 6, 5, 11, 9, 4, 13, 5, 12, 10, 6, 9, 17, 15, 8, 9, 3, 13, 7, 8, 2, 8, 8, 4, 2, 3, 5, 4, 1, 1, 1)), ('b3b7', (0

In [55]:
print("Iterate by Keys - explicitly")
for key in patients_dict.keys():
    print(key)
    
print("Iterate by keys - implicitly (as in default)")
for key in patients_dict:
    print(key)
    
print("Iterate by values")
for patient_data in patients_dict.values():
    print(patient_data)

Iterate by Keys - explicitly
669f
2edf
0355
5968
c760
6b51
dbaf
b3b7
3995
d6ff
2d58
a1d4
71e9
65c1
1edd
277b
fe0e
66d3
3ff3
4102
12c9
5b04
1fef
01c0
57b5
226c
c653
94fd
ebf2
fc73
d4a0
a9f2
dc22
a6e7
3fb2
11cc
c9f5
a73f
dab2
65a1
8bcb
4004
c2af
8037
cb49
2b4b
80a8
ac50
57ef
cc45
9184
84be
0af0
bf77
c56c
7d0c
c736
c5c8
050a
a085
Iterate by keys - implicitly (as in default)
669f
2edf
0355
5968
c760
6b51
dbaf
b3b7
3995
d6ff
2d58
a1d4
71e9
65c1
1edd
277b
fe0e
66d3
3ff3
4102
12c9
5b04
1fef
01c0
57b5
226c
c653
94fd
ebf2
fc73
d4a0
a9f2
dc22
a6e7
3fb2
11cc
c9f5
a73f
dab2
65a1
8bcb
4004
c2af
8037
cb49
2b4b
80a8
ac50
57ef
cc45
9184
84be
0af0
bf77
c56c
7d0c
c736
c5c8
050a
a085
Iterate by values
(0, 0, 1, 3, 1, 2, 4, 7, 8, 3, 3, 3, 10, 5, 7, 4, 7, 7, 12, 18, 6, 13, 11, 11, 7, 7, 4, 6, 8, 8, 4, 4, 5, 7, 3, 4, 2, 3, 0, 0)
(0, 1, 2, 1, 2, 1, 3, 2, 2, 6, 10, 11, 5, 9, 4, 4, 7, 16, 8, 6, 18, 4, 12, 5, 12, 7, 11, 5, 11, 3, 3, 5, 4, 4, 5, 5, 1, 1, 0, 1)
(0, 1, 1, 3, 3, 2, 6, 2, 5, 9, 5, 7, 4, 5, 4, 15, 5,

In [56]:
"669f" in patients_dict

True

In [57]:
for pid, pdata in patients_dict.items():
    print("Patient({}) -> {}".format(pid, pdata))

Patient(669f) -> (0, 0, 1, 3, 1, 2, 4, 7, 8, 3, 3, 3, 10, 5, 7, 4, 7, 7, 12, 18, 6, 13, 11, 11, 7, 7, 4, 6, 8, 8, 4, 4, 5, 7, 3, 4, 2, 3, 0, 0)
Patient(2edf) -> (0, 1, 2, 1, 2, 1, 3, 2, 2, 6, 10, 11, 5, 9, 4, 4, 7, 16, 8, 6, 18, 4, 12, 5, 12, 7, 11, 5, 11, 3, 3, 5, 4, 4, 5, 5, 1, 1, 0, 1)
Patient(0355) -> (0, 1, 1, 3, 3, 2, 6, 2, 5, 9, 5, 7, 4, 5, 4, 15, 5, 11, 9, 10, 19, 14, 12, 17, 7, 12, 11, 7, 4, 2, 10, 5, 4, 2, 2, 3, 2, 2, 1, 1)
Patient(5968) -> (0, 0, 2, 0, 4, 2, 2, 1, 6, 7, 10, 7, 9, 13, 8, 8, 15, 10, 10, 7, 17, 4, 4, 7, 6, 15, 6, 4, 9, 11, 3, 5, 6, 3, 3, 4, 2, 3, 2, 1)
Patient(c760) -> (0, 1, 1, 3, 3, 1, 3, 5, 2, 4, 4, 7, 6, 5, 3, 10, 8, 10, 6, 17, 9, 14, 9, 7, 13, 9, 12, 6, 7, 7, 9, 6, 3, 2, 2, 4, 2, 0, 1, 1)
Patient(6b51) -> (0, 0, 1, 2, 2, 4, 2, 1, 6, 4, 7, 6, 6, 9, 9, 15, 4, 16, 18, 12, 12, 5, 18, 9, 5, 3, 10, 3, 12, 7, 8, 4, 7, 3, 5, 4, 4, 3, 2, 1)
Patient(dbaf) -> (0, 0, 2, 2, 4, 2, 2, 5, 5, 8, 6, 5, 11, 9, 4, 13, 5, 12, 10, 6, 9, 17, 15, 8, 9, 3, 13, 7, 8, 2, 8, 8, 4, 2,

Let's practice with our new data structure

Let's get on with the _real deal_ : let's gather some statistics!

--- 

Well done for reaching this point! 🎉

**GREAT TIME FOR A BREAK NOW!** ☕️🧁🍪

---


Dealing with more _realistic cases_ ❌

Putting our helmets on (_with some testing_) ⛑

Now it's time to rethink about our Data (Abstractions): let's define our own **new type**!