## Classes, Data Classes and Abstractions

### Recap Data Problem

$\rightarrow$ _Adapted from_ : [**Software Carpentries: Programming with Python**]()

## Arthritis Inflammation
We are studying **inflammation in patients** who have been given a new treatment for arthritis.

There are `60` patients, who had their inflammation levels recorded for `40` days.
We want to analyze these recordings to study the effect of the new arthritis treatment.

To see how the treatment is affecting the patients in general, we would like to:

1. Process the file to extract data for each patient;
2. Calculate some statistics on each patient;
    - e.g. average inflammation over the `40` days (or `min`, `max` .. and so on)
    - e.g average statistics per week (we will assume `40` days account for `5` weeks)
    - `...` (open to ideas)
3. Calculate some statistics on the dataset.
    - e.g. min and max inflammation registered overall in the clinical study;
    - e.g. the average inflammation per day across all patients.
    - `...` (open to ideas)


![3-step flowchart shows inflammation data records for patients moving to the Analysis step
where a heat map of provided data is generated moving to the Conclusion step that asks the
question, How does the medication affect patients?](
https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/fig/lesson-overview.svg "Lesson Overview")


### Data Format

The data sets are stored in
[comma-separated values] (CSV) format:

- each row holds information for a single patient,
- columns represent successive days.

The first three rows of our first file look like this:
~~~
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
~~~

Each number represents the number of inflammation bouts that a particular patient experienced on a
given day.

For example, value "6" at row 3 column 7 of the data set above means that the third
patient was experiencing inflammation six times on the seventh day of the clinical study.

Our **task** is to gather as much information as possible from the dataset, and to report back to colleagues to foster future discussions.

---

Dealing with more _realistic cases_ ❌

In [1]:
import pandas as pd

Read data from `data/inflammation-04.csv`

In [2]:
import os
from pathlib import Path 

BASE_FOLDER  = Path(os.path.abspath(os.path.curdir))

DATA_FOLDER = BASE_FOLDER / "data"

DATASET_FILE = DATA_FOLDER / "inflammation-04.csv"

In [3]:
print(DATASET_FILE)

/Users/valerio/Research/UoB/lectures/fbk-academy/2021/python-data-science/data/inflammation-04.csv


In [4]:
df = pd.read_csv(DATASET_FILE, header=0, index_col=0)

In [5]:
df.head()

Unnamed: 0_level_0,Day1,Day2,Day3,Day4,Day5,Day6,Day7,Day8,Day9,Day10,...,Day34,Day35,Day36,Day37,Day38,Day39,Day40,Sex,Age,Group
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
669f,0,0,1,3,1,2,4,7,8,3,...,7,3,4,2,3,0,0,F,76,G3
2edf,0,1,2,1,2,1,3,2,2,6,...,4,5,5,1,1,0,1,F,42,G3
0355,0,1,1,3,3,2,6,2,5,9,...,2,2,3,2,2,1,1,F,59,G3
5968,0,0,2,0,4,2,2,1,6,7,...,3,3,4,2,3,2,1,M,25,G1
c760,0,1,1,3,3,1,3,5,2,4,...,2,2,4,2,0,1,1,M,60,G2


In [7]:
df.loc["669f"]

Day1      0
Day2      0
Day3      1
Day4      3
Day5      1
Day6      2
Day7      4
Day8      7
Day9      8
Day10     3
Day11     3
Day12     3
Day13    10
Day14     5
Day15     7
Day16     4
Day17     7
Day18     7
Day19    12
Day20    18
Day21     6
Day22    13
Day23    11
Day24    11
Day25     7
Day26     7
Day27     4
Day28     6
Day29     8
Day30     8
Day31     4
Day32     4
Day33     5
Day34     7
Day35     3
Day36     4
Day37     2
Day38     3
Day39     0
Day40     0
Sex       F
Age      76
Group    G3
Name: 669f, dtype: object

```python
from typing import List, Dict, Tuple
from numpy.typing import ArrayLike

Dataset = List[int] ==> ArrayLike[int] 
Dataset = Dict[str, List[int]] ==> Dict[str, Tuple[int]] ==> pd.DataFrame
```

```python
from typing import Sequence, Any

Dataset = Sequence[Patient]
Patient = Dict[str, Any]
```

In [None]:
from typing import Sequence

Dataset = Sequence[?]

```python
Patient = {
    "patiendID": str
    "sex": str
    "age" : int
    "group": str
    "inflammation_data": ArrayLike[int] 
}
```

**SOLUTION1:** `Patient as a Tuple`

```python
Dataset = Sequence[Patient]
Patient = Tuple[Any]

def read_inflammation_04(filepath: Path) -> Dataset:
    dataset = list()
    with open(filepath) as datafile:
        for line in datafile:
            patient_info = line.split(",")  # ["669f", "0", "1", ..., "F", "76", "G0"]
            patient = tuple(patient_info)
            dataset.append(patient)
    return dataset

```

**SOLUTION 1.5**: `Patient as a specialised TUPLE`

```python

Patient = Tuple[str, str, str, int, Sequence[int]]

def read_inflammation_04(filepath: Path) -> Dataset:
    dataset = list()
    with open(filepath) as datafile:
        for line in datafile:
            patient_info = line.split(",")  # ["669f", "0", "1", ..., "F", "76", "G0"]
            patient_data = [patient_info[0], patient_info[-3], 
                            patient_info[-1], patient_info[-2], 
                            np.asarray(patient_info[1:-3]).astype(int)]
            patient = tuple(patient_data)
            dataset.append(patient)
    return dataset

```

In [12]:
from typing import Sequence, Tuple, Union
from numpy.typing import ArrayLike

Patient = Tuple[str, str, str, int, Sequence[int]]
Dataset = Sequence[Patient]

In [20]:
def read_inflammation_04(filepath: Path) -> Dataset:
    dataset = list()
    with open(filepath) as datafile:
        for i, line in enumerate(datafile):
            if i == 0:
                continue
            line = line.strip()
            patient_info = line.split(",")  # ["669f", "0", "1", ..., "F", "76", "G0"]
            patient_data = [patient_info[0], patient_info[-3], 
                            patient_info[-1], patient_info[-2], 
                            np.asarray(patient_info[1:-3]).astype(int)]
            patient = tuple(patient_data)
            dataset.append(patient)
    return dataset

In [21]:
import numpy as np

In [22]:
read_inflammation_04(DATASET_FILE)

[('669f',
  'F',
  'G3',
  '76',
  array([ 0,  0,  1,  3,  1,  2,  4,  7,  8,  3,  3,  3, 10,  5,  7,  4,  7,
          7, 12, 18,  6, 13, 11, 11,  7,  7,  4,  6,  8,  8,  4,  4,  5,  7,
          3,  4,  2,  3,  0,  0])),
 ('2edf',
  'F',
  'G3',
  '42',
  array([ 0,  1,  2,  1,  2,  1,  3,  2,  2,  6, 10, 11,  5,  9,  4,  4,  7,
         16,  8,  6, 18,  4, 12,  5, 12,  7, 11,  5, 11,  3,  3,  5,  4,  4,
          5,  5,  1,  1,  0,  1])),
 ('0355',
  'F',
  'G3',
  '59',
  array([ 0,  1,  1,  3,  3,  2,  6,  2,  5,  9,  5,  7,  4,  5,  4, 15,  5,
         11,  9, 10, 19, 14, 12, 17,  7, 12, 11,  7,  4,  2, 10,  5,  4,  2,
          2,  3,  2,  2,  1,  1])),
 ('5968',
  'M',
  'G1',
  '25',
  array([ 0,  0,  2,  0,  4,  2,  2,  1,  6,  7, 10,  7,  9, 13,  8,  8, 15,
         10, 10,  7, 17,  4,  4,  7,  6, 15,  6,  4,  9, 11,  3,  5,  6,  3,
          3,  4,  2,  3,  2,  1])),
 ('c760',
  'M',
  'G2',
  '60',
  array([ 0,  1,  1,  3,  3,  1,  3,  5,  2,  4,  4,  7,  6,  5,  3, 10,  8

**SOLUTION 2:** Patient as a `namedtuple`

```python

# TYPE HINT
Patient = PatientTuple
Dataset = Sequence[Patient]

```

In [23]:
from collections import namedtuple

PatientTuple = namedtuple("PatientTuple", ["pid", "sex", "group", "age", "inflammation_data"])

In [24]:
def read_inflammation_04(filepath: Path) -> Dataset:
    dataset = list()
    with open(filepath) as datafile:
        for i, line in enumerate(datafile):
            if i == 0:
                continue
            line = line.strip()
            patient_info = line.split(",")  # ["669f", "0", "1", ..., "F", "76", "G0"]
            patient = PatientTuple(pid=patient_info[0], 
                                   sex=patient_info[-3], group=patient_info[-1], 
                                   age=int(patient_info[-2]), 
                                   inflammation_data=np.asarray(patient_info[1:-3]).astype(int)
                                  )
            dataset.append(patient)
    return dataset

In [26]:
dataset = read_inflammation_04(DATASET_FILE)

In [27]:
type(dataset)

list

In [28]:
type(dataset[0])

__main__.PatientTuple

In [29]:
patient = dataset[1]

print(f"{patient.pid}, Age: {patient.age}, Group: {patient.group}")

2edf, Age: 42, Group: G3


In [30]:
def g3_filter_function(patient: PatientTuple) -> bool:
    return patient.group == "G3"

g3_patients = filter(g3_filter_function, dataset)

In [31]:
type(g3_patients)

filter

In [32]:
g3_patients = list(g3_patients)

In [33]:
len(g3_patients)

26

In [34]:
g3_patients

[PatientTuple(pid='669f', sex='F', group='G3', age=76, inflammation_data=array([ 0,  0,  1,  3,  1,  2,  4,  7,  8,  3,  3,  3, 10,  5,  7,  4,  7,
         7, 12, 18,  6, 13, 11, 11,  7,  7,  4,  6,  8,  8,  4,  4,  5,  7,
         3,  4,  2,  3,  0,  0])),
 PatientTuple(pid='2edf', sex='F', group='G3', age=42, inflammation_data=array([ 0,  1,  2,  1,  2,  1,  3,  2,  2,  6, 10, 11,  5,  9,  4,  4,  7,
        16,  8,  6, 18,  4, 12,  5, 12,  7, 11,  5, 11,  3,  3,  5,  4,  4,
         5,  5,  1,  1,  0,  1])),
 PatientTuple(pid='0355', sex='F', group='G3', age=59, inflammation_data=array([ 0,  1,  1,  3,  3,  2,  6,  2,  5,  9,  5,  7,  4,  5,  4, 15,  5,
        11,  9, 10, 19, 14, 12, 17,  7, 12, 11,  7,  4,  2, 10,  5,  4,  2,
         2,  3,  2,  2,  1,  1])),
 PatientTuple(pid='6b51', sex='F', group='G3', age=65, inflammation_data=array([ 0,  0,  1,  2,  2,  4,  2,  1,  6,  4,  7,  6,  6,  9,  9, 15,  4,
        16, 18, 12, 12,  5, 18,  9,  5,  3, 10,  3, 12,  7,  8,  4,  7,  3,

```python

def read_inflammation_04(filepath: Path) -> Dataset:
    dataset = list()
    with open(filepath) as datafile:
        for i, line in enumerate(datafile):
            if i == 0:
                continue
            line = line.strip()
            patient_info = line.split(",")  # ["669f", "0", "1", ..., "F", "76", "G0"]
            patient = ?
            dataset.append(patient)
    return dataset
```

Putting our helmets on (_with some testing_) ⛑

Now it's time to rethink about our Data (Abstractions): let's define our own **new type**!