# [Chapter 5] Exploring the global organ donation trends data

[DSLC stages]: EDA


In this document, we will conduct an EDA of the organ donation data. The general format of this document is that each section involves asking a question of the data and we then produce several exploratory visualizations to answer the question. Interesting findings are evaluated using PCS, and a few are turned into explanatory findings. 

Let's load and clean/pre-process the organ donation data (recall that we developed the cleaning/pre-processing workflow in the file `01_cleaning.qmd`, and saved our cleaning function in the file `R/prepareOrganData.R`). It is often helpful to keep a copy of the original uncleaned data in your environment. 


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from functions.prepare_organ_data import prepare_organ_data
from functions.impute_feature import impute_feature

pd.set_option('display.max_columns', None)

In [2]:
# load the organs data
organs_original = pd.read_csv("../data/global-organ-donation_2018.csv")
# create the organs_clean object
organs_clean = prepare_organ_data(organs_original)
organs_clean.head()

  data["imputed_feature_tmp_prev"] = data.groupby(group)["imputed_feature_tmp_prev"].fillna(method='ffill')
  data["imputed_feature_tmp_prev"] = data.groupby(group)["imputed_feature_tmp_prev"].fillna(method='ffill')
  data["imputed_feature_tmp_next"] = data.groupby(group)["imputed_feature_tmp_next"].fillna(method='bfill')
  data["imputed_feature_tmp_next"] = data.groupby(group)["imputed_feature_tmp_next"].fillna(method='bfill')
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['feature_imputed'].fillna(0, inplace=True)
  data["imputed_feature_tmp_prev"] = data.groupby(group)["imputed_feature_tmp_prev"].fillna(method='ffill')
  data["imputed_feature_tmp_

Unnamed: 0,country,year,region,population,population_imputed,total_deceased_donors,total_deceased_donors_imputed,deceased_donors_brain_death,deceased_donors_circulatory_death,total_utilized_deceased_donors,utilized_deceased_donors_brain_death,utilized_deceased_donors_circulatory_death,deceased_kidney_tx,living_kidney_tx,total_kidney_tx,deceased_liver_tx,domino_liver_tx,living_liver_tx,total_liver_tx,total_heart_tx,living_lung_tx,LD Lung Tx,total_lung_tx,total_pancreas_tx,total_kidney_pancreas_tx,total_small_bowel_tx
0,Andorra,2000,Europe,100000.0,100000.0,,0.0,,,,,,,,,,,,,,,,,,,
1,Andorra,2001,Europe,100000.0,100000.0,,0.0,,,,,,,,,,,,,,,,,,,
2,Andorra,2002,Europe,100000.0,100000.0,,0.0,,,,,,,,,,,,,,,,,,,
3,Andorra,2003,Europe,100000.0,100000.0,,0.0,,,,,,,,,,,,,,,,,,,
4,Andorra,2004,Europe,100000.0,100000.0,,0.0,,,,,,,,,,,,,,,,,,,


Next, since many of our explorations will involve looking at the donor *rates*, let's create a version of the original and imputed donor counts *per million* (we could have included this in the `prepareOrganData()` function, since it can be thought of as a pre-processing featurization step). 


In [3]:
# add a donors_per_mil column for the donors cols
organs_clean = organs_clean.assign(total_deceased_donors_per_mil = (organs_clean.total_deceased_donors / (organs_clean.population_imputed + 1)) * 1_000_000)
organs_clean = organs_clean.assign(total_deceased_donors_imputed_per_mil = (organs_clean.total_deceased_donors_imputed / (organs_clean.population_imputed + 1)) * 1_000_000)
  # note that we use `population_imputed + 1` in the denominator because there are some countries with a reported population of 0.

## High-level summary of the data

The first question we ask is very vague: *what do the variables in the data look like?* Before looking at specific trends, it's helpful to give a high-level summary of the variables of interest (let's focus here on just population, the donor count, and the donor rate per million). These summaries aren't necessarily supposed to tell a story about the trends in the data, but rather are just supposed to give us a sense of what the data itself looks like.


In [4]:
organs_clean.columns

Index(['country', 'year', 'region', 'population', 'population_imputed',
       'total_deceased_donors', 'total_deceased_donors_imputed',
       'deceased_donors_brain_death', 'deceased_donors_circulatory_death',
       'total_utilized_deceased_donors',
       'utilized_deceased_donors_brain_death',
       'utilized_deceased_donors_circulatory_death', 'deceased_kidney_tx',
       'living_kidney_tx', 'total_kidney_tx', 'deceased_liver_tx',
       'domino_liver_tx', 'living_liver_tx', 'total_liver_tx',
       'total_heart_tx', 'living_lung_tx', 'LD Lung Tx', 'total_lung_tx',
       'total_pancreas_tx', 'total_kidney_pancreas_tx', 'total_small_bowel_tx',
       'total_deceased_donors_per_mil',
       'total_deceased_donors_imputed_per_mil'],
      dtype='object')

In [None]:
fig = make_subplots(rows=1, cols=3)
fig.add_trace(
    go.Histogram(x=organs_clean["population"], name="Population"),
    row=1, col=1)
fig.add_trace(
    go.Histogram(x=organs_clean["total_deceased_donors"], name="Total deceased donors"),
    row=1, col=2)
fig.add_trace(
    go.Histogram(x=organs_clean["total_deceased_donors_per_mil"], name="Deceased donors per million"),
    row=1, col=3)
# add axes
fig.update_layout(
    title_text="Histograms of important variables",
    xaxis1_title_text="Population",
    xaxis2_title_text="Total deceased donors",
    xaxis3_title_text="Total deceased donors per million",
    showlegend=False,
)
fig.show()

The donor count and donor count per million seem to have a concentration around 0. 

## Global organ donations are increasing over time

*Are global organ donations are increasing over time*? 

The plot below shows the increasing trend in (imputed) organ donations across the world over time. The imputed donor counts are based on the "average" imputation method.  


In [None]:
donors_by_year = organs_clean.groupby("year")["total_deceased_donors_imputed"].sum()
px.line(donors_by_year)

In [8]:
# compute the number of organ donations in 2017
total_2017 = organs_clean.query('year == 2017')["total_deceased_donors_imputed"].sum()
# compute the number of organ donations in 2000
total_2000 = organs_clean.query('year == 2000')["total_deceased_donors_imputed"].sum()


In [9]:
total_2000

np.float64(21321.0)

In [10]:
total_2017

np.float64(36885.0)

Clearly there has been quite a significant increase in organ donations over time. 


### PCS evaluation

#### Stability to a cleaning and pre-processing judgment call

Let's check the stability of the main takeaway from this plot concerning the organ donation trends over time to the imputation judgment call that we made.

The figure below shows how the trendline using each of the imputation methods (Average imputation, Previous imputation, and no imputation). The "Previous imputation method seems to yield similar results to no imputation (removing missing values), except for in the last year or two. The "Average" imputation method yields higher donor counts overall. The overall trend that the number of organ donations is increasing is certainly stable, but the "Previous" imputation method and no imputation ("None") make the rate of increase seem much more rapid. However, based on our domain understanding of these missing values (and assuming that most of the missing values are more likely to be closer to the "Average" imputed value than the previous imputed value or 0), we feel that the "Average" imputed results are likely to be a better representation of reality.

In [11]:
# add previous imputed donor count value
organs_clean["total_deceased_donors_imputed_previous"] = impute_feature(organs_clean, 
                                                                        feature="total_deceased_donors", 
                                                                        group="country", 
                                                                        impute_method="previous") 
# compute the donor counts by year for each imputation approach
unimputed_donors_by_year = organs_clean.groupby("year")["total_deceased_donors"].sum()  
imputed_average_donors_by_year = organs_clean.groupby("year")["total_deceased_donors_imputed"].sum()  
imputed_previous_donors_by_year = organs_clean.groupby("year")["total_deceased_donors_imputed_previous"].sum()  

imputed_donors_by_year_df = pd.DataFrame({
    "None": unimputed_donors_by_year,
    "Average": imputed_average_donors_by_year,
    "Previous": imputed_previous_donors_by_year,
    "year": organs_clean["year"].unique()
    }
  ).melt(id_vars="year", var_name="imputation_method")
  
px.line(imputed_donors_by_year_df, x="year", y="value", color="imputation_method")


DataFrameGroupBy.fillna is deprecated and will be removed in a future version. Use obj.ffill() or obj.bfill() for forward or backward filling instead. If you want to fill with a single value, use DataFrame.fillna instead


DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.



## The US has the *most donors*, but Spain has the *highest donor rate*


The next question we want to ask is *which country had the highest number of organ donations per million people in 2017?*. To answer this question, let's first print out the donor counts for the 20 countries with the highest donor counts in 2017. In the table below, it is clear that the US has the most organ donations by far, followed by China and Brazil. 

In [12]:
organs_2017 = organs_clean[organs_clean["year"] == 2017]
countries_top_20_2017 = organs_2017.set_index("country")["total_deceased_donors_imputed"] \
    .sort_values(ascending=False) \
    .head(20)
countries_top_20_2017


country
United States of America      10286.0
China                          4080.0
Brazil                         3420.0
Spain                          2183.0
France                         1933.0
Italy                          1714.0
United Kingdom                 1492.0
Iran (Islamic Republic of)      870.0
Canada                          802.0
Germany                         797.0
Argentina                       593.0
Russian Federation              572.0
Turkey                          562.0
Poland                          560.0
Australia                       510.0
Mexico                          509.0
Republic of Korea               501.0
Colombia                        437.0
India                           391.0
Portugal                        351.0
Name: total_deceased_donors_imputed, dtype: float64

We can visualize this using a bar chart. 

In [13]:
px.bar(countries_top_20_2017)

Since the populations of each of these countries are quite different, these counts are not actually really comparing apples-to-apples.  Let's instead look at a comparison of the donor counts *per million* for each country. 


In [14]:
countries_top_20_2017_per_mil = organs_2017.set_index("country")["total_deceased_donors_imputed_per_mil"] \
    .sort_values(ascending=False) \
    .head(20)
countries_top_20_2017_per_mil

country
Spain                       47.047413
Portugal                    34.077667
Croatia                     33.333325
United States of America    31.697997
Belgium                     30.526313
Malta                       29.999925
France                      29.738461
Italy                       28.855218
Czech Republic              25.377356
Austria                     24.482756
Belarus                     23.578945
United Kingdom              22.537764
Canada                      21.912568
Norway                      21.886788
Finland                     21.454542
Australia                   20.816326
Ireland                     20.624996
Slovenia                    20.476181
Iceland                     19.999933
Sweden                      19.393937
Name: total_deceased_donors_imputed_per_mil, dtype: float64

Again, we can visualize this using a bar chart

In [15]:
px.bar(countries_top_20_2017_per_mil)

When viewed in the context of population size, it appears that *Spain* (not the US) is the clear world-leader in organ donation *Rates*. China and Brazil don't even feature this time (because their number of organ donations are not actually that impressive when viewed in the context of the size of their population).


### PCS evaluation

#### Predictability

A quick literature search revealed that it is a very well-known fact that Spain is the world leader in organ donations. While it seems that many of these reports are based on the same data as this dataset that we are using, the fact that this information seems so broadly reported feels like reasonable evidence of the predictability of this finding. 

Another way that we can demonstrate the predictability of this finding is by showing that it occurs not just in 2017, but also for 2016. The figure below reproduces the two bar charts above, but using the 2016 data. The results are very similar (although the extent to which Spain's rates are higher than Portugal and Croatia's is less extreme).

In [17]:
organs_2016 = organs_clean[organs_clean["year"] == 2016]
countries_top_20_2016 = organs_2016.set_index("country")["total_deceased_donors_imputed"] \
    .sort_values(ascending=False) \
    .head(20)
countries_top_20_2016_per_mil = organs_2016.set_index("country")["total_deceased_donors_imputed_per_mil"] \
    .sort_values(ascending=False) \
    .head(20)

fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Bar(x=countries_top_20_2016.index, y=countries_top_20_2016, name="Total donor counts"),
    row=1, col=1)
fig.add_trace(
    go.Bar(x=countries_top_20_2016_per_mil.index, y=countries_top_20_2016_per_mil, name="Donor counts per mil"),
    row=1, col=2)
# add axes
fig.update_layout(
    title_text="Histograms of 2016 donor counts",
    xaxis1_title_text="Total donor counts",
    xaxis2_title_text="Donor counts per mil",
    showlegend=False,
)
fig.show()

#### Stability to a data visualization judgment call

Since this result is unlikely to change due to data perturbations and imputation judgment calls, let's conduct a brief stability analysis evaluating whether our conclusions change if we use a different visualization technique to look at the data. 

The figure below shows a heatmap of the organ donation rate for each country for each year (the rows are arranged in order of the 2017 rate). From this figure it is still very clear that Spain is a world leader in organ donations!


In [18]:
# extract the names of the top 20 countries from 2017
countries_top_2017 = countries_top_20_2017_per_mil.index
# filter the organs_clean data (all years) to these top 20 countries
organs_top_countries = organs_clean.query('country in @countries_top_2017').copy()
# add the word "year" to the year column so that we can use it as a variable name later
organs_top_countries["year"] = "year_" + organs_top_countries["year"].astype(str)
# select just the three columns: year, country, total_deceased_donors_imputed_per_mil
organs_top_countries_per_mil = organs_top_countries[["year", "country", "total_deceased_donors_imputed_per_mil"]]
# spread the data across year
organs_top_countries_per_mil_wide = (organs_top_countries_per_mil.pivot(
    index="country", 
    columns="year", 
    values="total_deceased_donors_imputed_per_mil"
  ).sort_values("year_2017", ascending=False))
px.imshow(organs_top_countries_per_mil_wide, color_continuous_scale="Greys")

### Creating an explanatory figure


Let's turn this 2017 donor rates per million figure into a nice explanatory figure that we can use to show people Spain's donor rate. 


All we will do is clean up the plot by removing the background, tidying the axis names, and highlighting Spain. 


In [19]:
# convert the series to a data frame
countries_top_20_2017_per_mil_df = (countries_top_20_2017_per_mil.to_frame(name = "donors_per_mil")
                                                                 .reset_index())
# add a boolean column for identifying which entry corresponds to Spain (to use for color later)
countries_top_20_2017_per_mil_df["Spain"] = countries_top_20_2017_per_mil_df["country"] == "Spain"
# create bar chart
bar_2017 = px.bar(countries_top_20_2017_per_mil_df, 
                  x="country", 
                  y="donors_per_mil", 
                  color = "Spain",
                  color_discrete_sequence=["orange", "grey"],
                  labels={
                      "donors_per_mil": "Organ donations per million",
                      "country": "",
                      },
                  title="Organ donation rates per million in 2017 (for the top 20 countries)")
# customize bar chart
bar_2017.update_layout(
    font_family="Arial",
    showlegend=False
)
bar_2017.update_xaxes(tickangle=-90)


## Visualizing the donor rates over time for each country

The heatmap above that we produced in our stability analysis above gave us an idea that it might be interesting to visualize the donor rates over time using line plots. 

The figure below shows the (imputed) number of donations per million for each country. We highlighted a few countries just to make it easier to tease out some interesting trends. This plot is definitely a mess, but it contains some useful information!


In [None]:
organs_highlight_countries = organs_clean
# add a column containing TRUE only for the countries to highlight in the plot below
organs_highlight_countries["highlight"] = np.where(organs_highlight_countries["country"].isin(["Spain", "Croatia", "Belgium", "Malta", "United States of America"]),
                                                   organs_highlight_countries["country"], 
                                                   "Other")
# create a line plot
fig = px.line(organs_clean, 
              x="year",
              y="total_deceased_donors_per_mil",
              color="highlight",
              color_discrete_sequence=["grey", "#84ACCE", "#F6AE2D", "#589D6F", "#CEA1C3", "#E68992"])
fig.update_traces(opacity=0.4)


### PCS evaluations

Since our conclusions from this figure is related to our results above, the PCS evaluations that we conducted above are also relevant to this Figure (e.g., we showed stability to a visualization judgment calls by the same information using a heatmap, and we showed predictability used domain literature to show that it is well-known that Spain is a world leader in organ donation rates). But we can also conduct a PCS analysis of these findings to some data perturbations and some additional visualization judgment calls (such as our choice of which countries lines to include in the plot).



#### Stability to data perturbations


To investigate how much our our figure changes as a result of our data perturbations, we create four different versions of the perturbed dataset and overlay the four perturbed trend lines (dashed lines) over the original trend lines (solid lines) in Figure \@ref(fig:lines-highlight-perturb). To reduce overplotting, we filter to the countries that have at least one year with 500 donations. 

Spain's trend lines are highlighted in purple. Fortunately, even with 30% of the organ donor counts perturbed, Spain is consistently the world leader in deceased organ donations, indicating that this conclusion is fairly stable even to these fairly extreme data perturbations.


In [21]:
# compute the SD of donor count for each country
donors_sd_by_country = organs_clean.groupby("country")["total_deceased_donors_imputed"] \
    .std() \
    .to_frame(name="sd") \
    .reset_index()
# create a version of organs_clean with the sd as a column
organs_clean_with_sd = organs_clean.merge(donors_sd_by_country)

In [None]:
def compute_perturbed_donor_count():
  perturbed_count = np.where(
    # if total_deceased_donors_imputed is not equal to 0 AND the random bernoulli number is equal to 1 (30% chance)
    (organs_clean["total_deceased_donors_imputed"] != 0) & (np.random.binomial(1, 0.3, len(organs_clean.index)) == 1),
    # compute the sum of the imputed donor count and a normal value with mean 0 and SD equal to the country's donor count SD
    organs_clean["total_deceased_donors_imputed"].add(np.random.normal(0, list(organs_clean_with_sd["sd"]), 1 * len(organs_clean.index))),
    # else just return the original imputed donor count
    organs_clean["total_deceased_donors_imputed"]
  )
  return perturbed_count

In [24]:
perturbed_organs_clean = organs_clean
# compute four versions of the imputed donor count and add them as columns
perturbed_organs_clean["donors_pert1_per_mil"] = 1_000_000 * compute_perturbed_donor_count() / organs_clean["population"]
perturbed_organs_clean["donors_pert2_per_mil"] = 1_000_000 * compute_perturbed_donor_count() / organs_clean["population"]
perturbed_organs_clean["donors_pert3_per_mil"] = 1_000_000 * compute_perturbed_donor_count() / organs_clean["population"]
perturbed_organs_clean["donors_pert4_per_mil"] = 1_000_000 * compute_perturbed_donor_count() / organs_clean["population"]
# print out a random sample of 10 rows and compare the original imputed count with the perturbed versions
perturbed_organs_clean.sample(10)[["total_deceased_donors_imputed_per_mil", "donors_pert1_per_mil", "donors_pert2_per_mil", "donors_pert3_per_mil", "donors_pert4_per_mil"]]

Unnamed: 0,total_deceased_donors_imputed_per_mil,donors_pert1_per_mil,donors_pert2_per_mil,donors_pert3_per_mil,donors_pert4_per_mil
2297,0.0,0.0,0.0,0.0,0.0
2976,1.285714,0.454457,1.285714,1.285714,1.271734
591,17.228914,17.228916,17.228916,17.228916,15.8319
3376,0.168539,0.168539,-0.004893,0.274628,0.168539
3144,0.093458,-0.092051,0.093458,-0.143227,0.093458
888,0.0,0.0,0.0,0.0,0.0
606,0.0,0.0,0.0,0.0,0.0
394,0.87156,,,,
2164,0.0,0.0,0.0,0.0,0.0
2897,0.0,,,,


In [None]:
# identify the countries with at least 500 total donors in a single year
countries_500_donors = perturbed_organs_clean.groupby("country")["total_deceased_donors_imputed"] \
    .max() \
    .to_frame() \
    .query('total_deceased_donors_imputed >= 500') \
    .index
# filter to the countries with at least 500 total donors in a single year
perturbed_organs_clean_min_500 = perturbed_organs_clean[perturbed_organs_clean["country"].isin(countries_500_donors)]

In [26]:
# create a melted version of the data frame with the various perturbed donor counts in a single column
perturbed_organs_clean_min_500_melted = perturbed_organs_clean_min_500.melt(value_vars=["total_deceased_donors_per_mil", 
                                                                                 "donors_pert1_per_mil", 
                                                                                 "donors_pert2_per_mil", 
                                                                                 "donors_pert3_per_mil", 
                                                                                 "donors_pert4_per_mil"], 
                                                                     id_vars=["year", "country"], 
                                                                     var_name="perturbed", 
                                                                     value_name="donor_count")
# add a column to use for highlighting which lines correspond to Spain
perturbed_organs_clean_min_500_melted["spain"] = perturbed_organs_clean_min_500_melted["country"] == "Spain"
perturbed_organs_clean_min_500_melted

Unnamed: 0,year,country,perturbed,donor_count,spain
0,2000,Argentina,total_deceased_donors_per_mil,,False
1,2001,Argentina,total_deceased_donors_per_mil,,False
2,2002,Argentina,total_deceased_donors_per_mil,,False
3,2003,Argentina,total_deceased_donors_per_mil,,False
4,2004,Argentina,total_deceased_donors_per_mil,10.359897,False
...,...,...,...,...,...
1615,2013,United States of America,donors_pert4_per_mil,23.534792,False
1616,2014,United States of America,donors_pert4_per_mil,26.646001,False
1617,2015,United States of America,donors_pert4_per_mil,28.213176,False
1618,2016,United States of America,donors_pert4_per_mil,30.762110,False


In [None]:
# create line plots for these countries with a solid line for the original trendline and dashed lines for the perturbed trendlines
fig = px.line(perturbed_organs_clean_min_500_melted,
        x="year",
        y="donor_count", 
        line_dash="perturbed",
        color="spain", 
        color_discrete_sequence=["grey", "#9C528B"],
        line_dash_sequence=["solid", "dash", "dash", "dash", "dash"])
fig.update_traces(opacity=0.4)


#### Stability to a data visualization judgment call

Next, let's investigate whether our conclusion changes when we change which country's lines are included in our figure. Our original figure filtered to the *top 20* countries in 2017. Alternative judgment calls that we could have made include not filtering the data at all (i.e., including all countries), filtering just to the countries that had at least one year with 500 reported donations, or filtering to the countries that had at least one year with a donor rate of at least 20 donors per million.

The analysis below re-creates the figure using each of these judgment calls.

In [28]:
# create a function that we can re-use to create each plot
def plot_lines(data, 
               title, 
               highlight="spain", 
               colors=["grey", "#9C528B"], 
               legend_title="Spain"):
    fig = px.line(data, 
                  x="year", 
                  y="total_deceased_donors_per_mil",
                  color=highlight, 
                  color_discrete_sequence=colors)
    fig.update_traces(opacity=0.4)
    fig.update_layout(title=title,
                      yaxis_title="Organ donations per million",
                      legend_title=legend_title)
    return fig

In [29]:
# (a) No filter
# add a column for highlighting Spain
organs_clean_highlight_spain = organs_clean.copy()
organs_clean_highlight_spain["spain"] = organs_clean["country"] == "Spain"
# compute the line plot
plot_lines(organs_clean_highlight_spain, "(a) No filtering")


In [30]:
# (b) Top 20 countries in 2017
# filter to the top 20 countries in 2017
organs_clean_2017 = organs_clean.query('year == 2017').copy()
top_20_countries = organs_clean_2017.nlargest(n=20, columns="total_deceased_donors_imputed_per_mil")["country"]
organs_clean_top_20 = organs_clean.query('country in @top_20_countries').copy()
# add highlight for spain
organs_clean_top_20["spain"] = organs_clean_top_20["country"] == "Spain"
plot_lines(organs_clean_top_20, "(b) Top 20 countries in 2017")

In [31]:
# (c) At least 500 donors
# filter to the countries with at least 500 total donors in a single year
organs_clean_min_500 = organs_clean[organs_clean["country"].isin(countries_500_donors)].copy()
# add highlight for spain
organs_clean_min_500["spain"] = organs_clean_min_500["country"] == "Spain"
# compute the line plot
plot_lines(organs_clean_min_500, "(c) At least 500 donors")

In [32]:
# (d) At least 20 donors per million
countries_20_donors_per_mil = organs_clean.groupby("country")["total_deceased_donors_imputed_per_mil"] \
    .max() \
    .to_frame() \
    .query('total_deceased_donors_imputed_per_mil >= 20') \
    .index
# filter to the countries with at least 20 total donors per mil in a single year
organs_clean_min_20_per_mil = organs_clean.query('country in @countries_20_donors_per_mil').copy()
# add highlight for spain
organs_clean_min_20_per_mil["spain"] = organs_clean_min_20_per_mil["country"] == "Spain"
# compute the line plot
plot_lines(organs_clean_min_20_per_mil, "(d) At least 20 donors per million")


### Creating an explanatory figure

Let's just look at the top 20 countries in 2017, and highlight Spain, Croatia, and the US. From here, you could try and re-create the plots for Spain and Croatia that we created in the book!

In [None]:
# add highlight for Spain, USA, and Croatia
organs_clean_top_20["highlight"] = np.where(organs_clean_top_20["country"].isin(["Spain", "United States of America", "Croatia"]),
                                            organs_clean_top_20["country"], 
                                            "Other")

fig= px.line(organs_clean_top_20, 
                  x="year", 
                  y="total_deceased_donors_per_mil",
                  color="highlight", 
                  color_discrete_sequence=["grey", "#84ACCE", "#F6AE2D", "#589D6F"],
                  hover_name="country",
                  hover_data={"year": True, 
                              "total_deceased_donors_per_mil": ":.2f",
                              "highlight": False})
fig.update_traces(opacity=1, 
                  line=dict(width=5))

fig.update_layout(yaxis_title="Organ donations per million",
                      legend_title="Country", 
                      plot_bgcolor="rgba(0,0,0,0)",
                      showlegend=False)
# add direct country annotation
fig.add_annotation(x=2017, y=47,
            text="Spain", 
            showarrow=False, xanchor="left")
fig.add_annotation(x=2017, y=34,
            text="Croatia",
            showarrow=False, xanchor="left")
fig.add_annotation(x=2017, y=31,
            text="USA",
            showarrow=False, xanchor="left")

# change the specs for the "Other" line only
fig['data'][0]['line']['width'] = 1
fig['data'][0]['opacity'] = 0.25
fig.show()


Another way to present this data is using a grid of line plots. 



In [None]:
organs_clean_2017 = organs_clean.query('year == 2017').copy()
top_15_countries = organs_clean_2017.nlargest(n=15, columns="total_deceased_donors_imputed_per_mil")["country"]
organs_clean_top_15 = organs_clean[organs_clean["country"].isin(top_15_countries)].copy()

fig = px.line(organs_clean_top_15,
        x="year", 
        y="total_deceased_donors_per_mil", 
        facet_col="country", 
        facet_col_wrap=3,
        category_orders={"country": top_15_countries},
        hover_name="country",
        hover_data={"year": True, 
                    "country": False,
                    "total_deceased_donors_per_mil": ":.2f"})
fig.update_traces(line_color="grey") 
fig.update_layout(height=800,
                  autosize=False)

# create just a single y-axis label
fig.for_each_yaxis(lambda y: y.update(title = ''))
fig.add_annotation(x=-0.05, y=0.5,
                   text="Organ donations per million", textangle=-90,
                   xref="paper", yref="paper",
                   showarrow=False)

# create just a single x-axis label
fig.for_each_xaxis(lambda x: x.update(title = ''))
fig.add_annotation(x=0.5, y=-0.05,
                   text="Year", 
                   xref="paper", yref="paper",
                   showarrow=False)

# update facet labels to remove "country="
fig.for_each_annotation(lambda a: a.update(text=a.text.replace("country=", "")))


## The relationship between population and number of donors

Having observed that the donor rate paints a different picture from the raw number of donors, we assumed that countries with larger populations have more donors. Let's check this assumption by asking *do countries with larger populations typically have more donors?*


In [35]:
# compute correlation between imputed population and number of donors
organs_clean_2017["population_imputed"].corr(organs_clean_2017["total_deceased_donors_imputed"])

np.float64(0.4128076308434108)


The correlation between the (imputed) population and number of donors is indicative of a possible weak linear relationship. 


Looking at a scatterplot of the two variables does not provide too many hints about this supposed weak linear relationship, however, due to the concentration of values in the lower-left corner. 


In [36]:
fig = px.scatter(organs_clean_2017, 
                 x="population_imputed", 
                 y="total_deceased_donors_imputed", 
                 hover_name="country")
fig.add_annotation(x=1_400_000_000, y=4_100,
            text="China", 
            showarrow=False, xanchor="left")
fig.add_annotation(x=340_000_000, y=10_300,
            text="USA",
            showarrow=False, xanchor="left")
fig.add_annotation(x=220_000_000, y=3_450,
            text="Brazil",
            showarrow=False, xanchor="left")
fig.show()


Removing the outlier countries makes it a little easier to see some trends:

In [37]:
organs_clean_2017_no_outliers = organs_clean_2017.query('total_deceased_donors_imputed < 2500').copy()
organs_clean_2017_no_outliers = organs_clean_2017_no_outliers.query('population_imputed < 500000000')

px.scatter(organs_clean_2017_no_outliers, 
                 x="population_imputed", 
                 y="total_deceased_donors_imputed", hover_name="country")


But taking a log-log transformation of the plot shows that, *if we ignore the countries with zero donations*, there is a reasonable linear relationship between the log of population and the log of donor count (indicating that a percentage increase in population is associated with a percentage increase in donor count). However, by ignoring these countries we risk presenting a severely biased view of the data.


In [38]:
px.scatter(organs_clean_2017_no_outliers, 
                 x="population_imputed", 
                 y="total_deceased_donors_imputed", 
                 hover_name="country", 
                 log_x=True, log_y=True)


This finding doesn't feel particularly informative, so we won't turn it into an explanatory finding, nor will we conduct a thorough PCS evaluation of it.

## [Exercise: to complete] Is there a difference in deceased donor type (i.e., whether the organs come from brain death or circulatory death donors) across different countries?

Conduct your own analysis to answer this question. The relevant variables in the pre-processed data (`organs_preprocessed`) will be 
- `deceased_donors_brain_death`: "Actual DBD"
- `deceased_donors_circulatory_death`: "Actual DCD"
- `country`.

Let's first look at Spain's 2017 distribution of brain death and circulatory death donors

In [63]:
spain_2017_organ = organs_clean[(organs_clean['year']==2017) & (organs_clean['country']=='Spain')].iloc[:,:]

In [None]:
# create bar graphs where the values are 'deceased_donors_brain_death', 'deceased_donors_circulatory_death' are color coded
fig = px.bar(spain_2017_organ, 
             x="year", 
             y=["deceased_donors_brain_death", "deceased_donors_circulatory_death"], 
             barmode="group",
             labels={"value": "Number of donors", "variable": "Donor type"},
             title="Number of organ donors in Spain in 2017 by donor type")
fig.show()


Seems that there are more donors who were declared brain death than circulatory death.
Let's now see if there is a change over time for this trend in Spain

In [70]:
spain_organ_donors = organs_clean[organs_clean['country'] == 'Spain']
fig = px.line(spain_organ_donors, 
              x="year", 
              y=["deceased_donors_brain_death", "deceased_donors_circulatory_death"], 
              title="Total deceased donors in Spain over time")
fig.show()

Now let's try to focus on the top 20 countries by donor / million and see whether most of their donors come from brain death / circulatory death.

In [76]:
# extract the names of the top 20 countries from 2017
countries_top_2017 = countries_top_20_2017_per_mil.index
# filter the organs_clean data (all years) to these top 20 countries
organs_top_countries = organs_clean.query('country in @countries_top_2017').copy()
# add the word "year" to the year column so that we can use it as a variable name later
organs_top_countries["year"] = "year_" + organs_top_countries["year"].astype(str)
# Add the column proportion of brain death
organs_top_countries["proportion_brain_death"] = organs_top_countries["deceased_donors_brain_death"] / organs_top_countries["total_deceased_donors_imputed"]
# select just the three columns: year, country, total_deceased_donors_imputed_per_mil
brain_death_prop_top_countries = organs_top_countries[["year", "country", "proportion_brain_death"]]

# spread the data across year
brain_death_prop_top_countries_wide = (brain_death_prop_top_countries.pivot(
    index="country", 
    columns="year", 
    values="proportion_brain_death"
  ).sort_values("year_2017", ascending=False))
px.imshow(brain_death_prop_top_countries_wide, color_continuous_scale="Greys")

For the top 20 countries with donor per mil, we can see that around 14 countries still have high proportions of brain dead donors >90%, while the remaining 6 countries (US, Canada, Spain, Belgium, Australia and UK) have had the proportion of brain dead donors to be decreasing over time from 2000 to 2017.

## [Exercise: to complete] Create a dot plot comparing the organ donation rates for the US and Spain

Below you will find some code for creating the data that underlies the dot plot in Exercise 27 of Chapter 6. Use `px.scatter()` to create the dot plot. 


In [39]:
# filter the 2017 data to just the USA and Spain
organs_clean_2017_spain_usa = organs_clean_2017.query('country in ["Spain", "United States of America"]').copy()
# compute the donor rates for kidneys, livers, hearts, and lungs for each country
organs_clean_2017_spain_usa["kidney"] = organs_clean_2017_spain_usa["total_kidney_tx"] / organs_clean_2017_spain_usa["population"] * 1_000_000
organs_clean_2017_spain_usa["liver"] = organs_clean_2017_spain_usa["total_liver_tx"] / organs_clean_2017_spain_usa["population"] * 1_000_000
organs_clean_2017_spain_usa["heart"] = organs_clean_2017_spain_usa["total_heart_tx"] / organs_clean_2017_spain_usa["population"] * 1_000_000
organs_clean_2017_spain_usa["lung"] = organs_clean_2017_spain_usa["total_lung_tx"] / organs_clean_2017_spain_usa["population"] * 1_000_000
# melt the data to long-form
organs_clean_2017_spain_usa = organs_clean_2017_spain_usa.melt(id_vars="country", 
                                                               value_vars=["kidney", "liver", "heart", "lung"], 
                                                               value_name="donation_rate", 
                                                               var_name="organ")
organs_clean_2017_spain_usa

Unnamed: 0,country,organ,donation_rate
0,Spain,kidney,70.452586
1,United States of America,kidney,63.599384
2,Spain,liver,26.875
3,United States of America,liver,24.906009
4,Spain,heart,6.551724
5,United States of America,heart,10.086287
6,Spain,lung,7.823276
7,United States of America,lung,7.636364


In [55]:
fig = px.scatter(organs_clean_2017_spain_usa.sort_values("donation_rate"),
           x="donation_rate",
           y="organ",
           color="country",
           hover_name="country",
           labels={"donation_rate": "Donation rate per million",
                   "organ": "Organ",
                   "country": "Country"})
# Update the legend to be on top of the plot
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

From the graph we can see that while Spain has more donations that USA for kidnyes, liver and (slightly) lung. America seem to have higher donation rates for heart.



## Additional explorations

There are many more explorations that you could include in this document if you are editing it yourself (we've only included the ones that appeared in the EDA book chapter), and if you're interested in challenging yourself we encourage you to add a few additional exploration sections to this document.

Start by thinking of a question you have about a data trend or relationship. Perhaps it is related to some of the organ-specific transplant variables that we haven't explored, or perhaps you even want to bring in some external data (such as GDP) to explore whether there is a relationship between GDP and organ donation rates. There are almost infinite avenues that you can explore!
