# Granularity presentation

This notebook is overview of results from granularity investigation. The investigation consists of the following notebooks:
- old_data_investigation.ipynb
- transplants_investigation.ipynb
- granulairty_investigation.ipynb



## Imports

In [None]:
import pandas as pd
import os
from typing import List
import re
import numpy as np
import sys
import math
import time
import logging

In [None]:
sys.path.insert(0, "../..")

from local_testing_utilities.notebook_utils.pairing_data import parse_pairing_data
from local_testing_utilities.notebook_utils.survival_data import parse_survival_data

## Load data

First, we load patients data

In [None]:
df_all_patients = parse_pairing_data('data/KDP-processed', 'data/patients_list_recipientID.csv', remove_single_donors=False)

In [None]:
df_survival = parse_survival_data('data/LD_kidney_survival.csv')
df_survival_summary = pd.read_pickle('data/survival_summary.pkl')

In [None]:
df_patients_with_recipient_id = pd.read_csv('data/patients_list_recipientID.csv')
df_transplanted_donors = pd.read_excel('data/transplanted_donors.xlsx', index_col=None)

### In old txm events, we found

Patients records:

In [None]:
len(df_all_patients.index)

Unique donors:

In [None]:
len(df_all_patients.groupby(['donor_name']).first().index)

Unique recipients:

In [None]:
len(df_all_patients.loc[lambda df: df.recipient_name != ''].groupby(['recipient_name']).first().index)

Each color correspond to one recipient. We can see in what txm events the recipient was found in the following plot:

In [None]:
df_event_to_patients = pd.pivot_table(df_all_patients.assign(one=1), values='one', index=['txm_event'], columns=['recipient_id'], aggfunc=np.sum, fill_value=0)
df_event_to_patients.plot.area(figsize=(20,10), legend=False, title='In what txm events each patient was')

## Ended patients

We checked which patients ended in what TXM event. Then we searched for a transplant in survival data to know, if the patient ended, because he or she had a transplant or if he or she ended for other reason.

In [None]:
df_donors_last_event = df_all_patients.loc[
    df_all_patients.apply(
        lambda row1:
        not df_all_patients.apply(
            lambda row2:
            row1.txm_event + 1 == row2.txm_event and \
            row1.donor_name == row2.donor_name,
            axis=1
        ).any(),
        axis=1
    )
]

df_recipients_last_event = df_all_patients.loc[
    df_all_patients.apply(
        lambda row1:
        not df_all_patients.apply(
            lambda row2:
            row1.txm_event + 1 == row2.txm_event and \
            row1.recipient_name == row2.recipient_name,
            axis=1
        ).any() and \
        row1.recipient_name != '',
        axis=1
    )
]

df_transplanted_donors = pd.read_excel('data/transplanted_donors.xlsx', index_col=None)

df_donors_last_event_with_surv = df_donors_last_event\
    .join(df_transplanted_donors.set_index('donor_name')['target_recipient_id'], on='donor_name')\
    .join(df_survival_summary.set_index('RecipientID'), on='target_recipient_id', rsuffix='_surv')

df_recipients_last_event_with_surv = df_recipients_last_event.join(df_survival_summary.set_index('RecipientID'), on='recipient_id', rsuffix='_surv')

df_recipients_last_event_with_surv.groupby('txm_event').count()\
    .join(df_donors_last_event_with_surv.groupby('txm_event').count(), rsuffix='_donors')\
    .apply(lambda row: pd.Series(
    {
        'Recipients ended': row.recipient_name,
        'Recipients ended with transplant found': row.delay,
        'Recipients ended without transplant': row.recipient_name - row.delay,
        'Donors ended': row.donor_name_donors,
        'Donors ended with transplant found': row.delay_donors,
        'Donors ended without transplant': row.donor_name_donors - row.delay_donors
    }), axis=1)\
    .plot(
        style=['b-','g-','r-', 'b--','g--','r--'],
        title='Number of patients that were lastly seen in the given txm event versus those that were mapped to transplant date',
        figsize=(14, 7)
    )

## Compute matchings for TXM event with various granularity
We ran the matching algorithm for patients in TXM events with various granularity.

- granularity 1 = 3 months
- granularity 2 = 6 months
- granularity 3 = 9 months
- granularity 4 = 12 months

For given granularity, each event has patients from the originla event plus patients from $granularity - 1$ previous events that have been transplanted.

In [None]:
df_granularity_results = pd.read_csv('data/granularity_results.csv')

In [None]:
# df_granularity_results.pivot_table(index='txm_event', columns='granularity', values=['donors_count']).plot(ylabel='Donors count')
# df_granularity_results.pivot_table(index='txm_event', columns='granularity', values='recipients_count').plot(ylabel='Recipients count')
df_granularity_results.pivot_table(index='txm_event', columns='granularity', values='matching_pairs_count').plot(ylabel='matching_pairs_count')
df_granularity_results.pivot_table(index='txm_event', columns='granularity', values='matching_pairs_count_normalized').plot(ylabel='matching_pairs_count_normalized')
# df_granularity_results.pivot_table(index='txm_event', columns='granularity', values='elapsed_time').plot(ylabel='elapsed_time (s)')

Show, how many patients would be transplanted in the given year of various granularities. 

For granularity 2 (6 months):
- shift 0 corresponds to pairing in winter and **summer**
- shift 1 corresponds to pairing in spring and **autumn**

For granulairty 4 (12 months):
- shift 0 corresponds to paring in **winter**
- shift 1 corresponds to paring in **spring**
- shift 2 corresponds to paring in **summer**
- shift 3 corresponds to paring in **autumn**

In [None]:
df_results_per_year = df_granularity_results\
    .assign(year=lambda df: (df.txm_event-10)//4)\
    .assign(event_in_year=lambda df: (df.txm_event-10)%4)
df_results_per_year['granularity_1'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 1 else 0, axis=1)
df_results_per_year['granularity_2_shift_0'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 2 and s.event_in_year in [0, 2] else 0, axis=1)
df_results_per_year['granularity_2_shift_1'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 2 and s.event_in_year in [1, 3] else 0, axis=1)
df_results_per_year['granularity_4_shift_0'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 4 and s.event_in_year == 0 else 0, axis=1)
df_results_per_year['granularity_4_shift_1'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 4 and s.event_in_year == 1 else 0, axis=1)
df_results_per_year['granularity_4_shift_2'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 4 and s.event_in_year == 2 else 0, axis=1)
df_results_per_year['granularity_4_shift_3'] = df_results_per_year.apply(lambda s: s.matching_pairs_count if s.granularity == 4 and s.event_in_year == 3 else 0, axis=1)

df_results_per_year = df_results_per_year.groupby(['year']).sum().reset_index()
df_results_per_year = df_results_per_year[[
    'granularity_1',
    'granularity_2_shift_0',
    'granularity_2_shift_1',
    'granularity_4_shift_0',
    'granularity_4_shift_1',
    'granularity_4_shift_2',
    'granularity_4_shift_3'
]]

df_results_per_year.plot(
    figsize=(15, 7), style=['b-','g-','g--','r-','r--','r-o','r-.',],
)

#### Overall number of transplants

In [None]:
df_results_per_year.sum()

### Show dependency between patient count and found transplants

We ran the matching algorithm on random patients found in old pairing data. We show dependency between number of input patients and number of transplants found.

In [None]:
df_ratio_results = pd.read_csv('data/ratio_results.csv')

Number of found transplants for given patient count

In [None]:
df_ratio_results.groupby('patients').agg({'matching_pairs_count': ['mean', 'std']}).plot(xlim=0, ylim=0)

Ratio between number of found transplants and number of patients for given patient count.

In [None]:
df_ratio_results.assign(ratio=lambda df: df.matching_pairs_count/df.patients).groupby('patients').agg({'ratio': ['mean', 'std']}).plot(xlim=0, ylim=0)