# Summary Statistics

This notebook displays summary statistics of patient records,
comparing multiple snapshots extracted by the `cohortextractor` action.

## Preliminaries

In [None]:
from IPython.display import display, Markdown
import itertools
import matplotlib
import pandas
from pathlib import Path

In [None]:
%matplotlib inline
matplotlib.style.use("seaborn")

In [None]:
BASE_DIR = Path("../output")

In [None]:
def read_feather(f_path):
    return pandas.read_feather(f_path).assign(
        f_name=f_path.name,  # We need the file name because we will concatenate the csv files
    )

In [None]:
# Concatenate the feather files in `BASE_DIR`
records = pandas.concat(read_feather(x) for x in BASE_DIR.iterdir() if x.name.endswith(".feather"))

In [None]:
# Unfortunately, `pandas.concat` converts categorical columns to string (object) columns,
# if the set of categories are different. Consequently, we must cast the `f_name` column
# to a categorical column here.
records.f_name = records.f_name.astype("category")

In [None]:
# Patient IDs should be unique within each file
assert records.set_index(["f_name", "patient_id"]).index.is_unique

Discretise patient ages into patient age groups of width `age_group_width`.

In [None]:
age_group_width = 10

In [None]:
records["age_group"] = pandas.cut(
    records.age,
    range(0, records.age.max() + age_group_width, age_group_width),
    right=False,  # Don't include the right-edge, meaning [lower, upper) or lower <= x < upper
)

## Patients

How many patients are in each file?

---
**Aside**: There are several ways to count groups in Pandas. For consistency, we will:

* locate the columns of interest
* group by these columns
* count the number of rows

For readability, we will also put each step on a separate line.

---

In [None]:
(records
    .loc[:, ["f_name", "patient_id"]]
    .groupby("f_name")
    .count())

How many patients are in each file, by sex?

In [None]:
(records
    .loc[:, ["f_name", "sex", "patient_id"]]
    .groupby(["f_name", "sex"])
    .count())

How many patients are in each file, by age group?

In [None]:
by_age_group = (records
    .loc[:, ["age_group", "f_name", "patient_id"]]
    .groupby(["age_group", "f_name"])
    .count())

In [None]:
by_age_group.unstack()

In [None]:
_ = (by_age_group
    .unstack()
    .plot.bar(subplots=True, figsize=(6, 6), legend=False))

How many patients in each file deregistered, by month?

For each combination of files, how many patients are in:

* both files
* the first file but not the second file
* the second file but not the first file

---
**Aside**: We compare each combination of files to future-proof our notebook.
Because order isn't significant, we use `itertools.combinations` rather than `itertools.permutations`.

---

In [None]:
# Let's revisit sets.
# https://docs.python.org/3.8/library/stdtypes.html#set
set_1 = {1, 2, 3}
set_2 = {3, 4, 5}
assert set_1 & set_2 == {3}  # In set 1 and in set 2 (intersection)
assert set_1 - set_2 == {1, 2}  # In set 1, but not in set 2 (difference)
assert set_2 - set_1 == {4, 5}  # In set 2, but not in set 1 (difference)
del set_1
del set_2

In [None]:
for f_name_0, f_name_1 in itertools.combinations(records.f_name.cat.categories, 2):
    patient_id_0 = set(records.patient_id[records.f_name == f_name_0])
    patient_id_1 = set(records.patient_id[records.f_name == f_name_1])

    display(Markdown(f"{len(patient_id_0 & patient_id_1)} patients are in *{f_name_0}* and in *{f_name_1}*."))
    display(Markdown(f"{len(patient_id_0 - patient_id_1)} patients are in *{f_name_0}* but not in *{f_name_1}*."))
    display(Markdown(f"{len(patient_id_1 - patient_id_0)} patients are in *{f_name_1}* but not in *{f_name_0}*."))

## Practices

How many practices are in each file?

In [None]:
(records
    .loc[:, ["f_name", "practice_pseudo_id"]]
    .drop_duplicates()  # Remember that rows are patient records.
    .groupby("f_name")
    .count())