# Exploration of Malaria Dataset

I wanted to create some simple visualisations in Plotly. To do this, I selected a dataset from Kaggle looking at the effects of Malaria interventions. The dataset containted information on the intervention, country level data, and estimated number of malaria deaths across a number of years. 

Of course there is so much that could be done with this data. But here I want to keep things simple. I outline some future applications of this data at the end of the notebook. 

The dataset can be found here: https://www.kaggle.com/teajay/the-fight-against-malaria?select=estimated_deaths.csv

In [18]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

## Load & Tidy Data

In [2]:
# load data

amf_data = pd.read_csv("amf_distributions.csv", encoding='latin-1')
est_deaths = pd.read_csv("estimated_deaths.csv", encoding='latin-1')

In [None]:
# merge files

data = pd.merge(amf_data, est_deaths, left_on='country_code', right_on='COUNTRY (CODE)')
data = data[['country', 'when', 'by_whom', 'Numeric', 'YEAR (DISPLAY)']]
data.rename(columns = {'when':'intervention_dates', 'by_whom':'partner_org', 'Numeric':'estimated_deaths',
                              'YEAR (DISPLAY)':'year_of_estimation'}, inplace = True)

In [6]:
# count duplicates
len(data)-len(data.drop_duplicates())

28

In [7]:
# drop duplicates
data = data.drop_duplicates()

In [5]:
# tidy up intervention_dates

# as this is a basic analysis I will just be taking the year of intervention
data[['intervention_year']] = pd.to_numeric("20" + data['intervention_dates'].str.split(" ").str[1].str.split("-").str[0])

# field to classify if row is post intervention
data['post_intervention_data'] = data['intervention_year'] < data['year_of_estimation']

# field to classify how many years between deaths data and intervention
data['years_since_intervention'] = data['year_of_estimation'] - data['intervention_year']
# minus number indicates years before intervention, positive value is years after intervention

In [47]:
data.head(10)

Unnamed: 0,country,intervention_dates,partner_org,estimated_deaths,year_of_estimation,intervention_year,post_intervention_data,years_since_intervention
0,Kenya,May-Jun 06,Red Cross,13000.0,2000,2006,False,-6
1,Kenya,May-Jun 06,Red Cross,10000.0,2005,2006,False,-1
2,Kenya,May-Jun 06,Red Cross,9000.0,2010,2006,True,4
3,Kenya,May-Jun 06,Red Cross,9900.0,2013,2006,True,7
4,Kenya,Nov-Dec 06,AMREF/Akamba,13000.0,2000,2006,False,-6
5,Kenya,Nov-Dec 06,AMREF/Akamba,10000.0,2005,2006,False,-1
6,Kenya,Nov-Dec 06,AMREF/Akamba,9000.0,2010,2006,True,4
7,Kenya,Nov-Dec 06,AMREF/Akamba,9900.0,2013,2006,True,7
8,Kenya,Dec 06,PSI,13000.0,2000,2006,False,-6
9,Kenya,Dec 06,PSI,10000.0,2005,2006,False,-1


In [43]:
data.describe(include = 'all')

Unnamed: 0,country,intervention_dates,partner_org,estimated_deaths,year_of_estimation,intervention_year,post_intervention_data,years_since_intervention
count,624,624,624,624.0,624.0,624.0,624,624.0
unique,35,112,70,,,,2,
top,Uganda,May-Jun 06,Red Cross,,,,False,
freq,124,36,88,,,,386,
mean,,,,14665.666667,2007.0,2009.160256,,-2.160256
std,,,,17307.783227,4.953718,2.90568,,5.743022
min,,,,0.0,2000.0,2006.0,,-18.0
25%,,,,5450.0,2003.75,2007.0,,-7.0
50%,,,,10000.0,2007.5,2009.0,,-2.0
75%,,,,19000.0,2010.75,2010.0,,3.0


## Plot

I'm interested in looking at if the number of estimated deaths changes after the intervention. To do this, I'll start with a basic plot looking at the estimated deaths before and after the year of intervention (where the year of intervention = 0).

In [63]:
# create a line plot
fig = px.line(data, x="years_since_intervention", y="estimated_deaths", color='country', 
              labels={
                     "estimated_deaths": "Estimated Deaths",
                     "years_since_intervention": "Years Since Intervention",
                     "country": "Country"
                 },
                title="Estimated Deaths by Country", 
             template='simple_white')

# ensure we are showing useful numbers on the x-axis
number_of_ticks = abs(data['years_since_intervention'].min() - data['years_since_intervention'].max())
fig.update_xaxes(nticks=number_of_ticks)

# add a reference line to the chart to show the date of intervention
fig.add_shape(type='line',
                yref="y",
                xref="x",
                x0=0,
                y0=0,
                x1=0,
                y1=data['estimated_deaths'].max()*1.2,
                line=dict(color='black', width=1))

fig.show()

We can't tell much from these chart. Even with plotly's ability to filter and zoom into the data, it's pretty difficult to grasp any story from this data. 

It seems the issue here comes from the multiple partner organisations collecting data within the same year. To make the visualization simpler, I'll create a filter to allow the user to select between different partner organisations.

In [48]:
# subset to just a few partner orgs

subset_data = data[(data[['partner_org']].values == ['Red Cross', 'AMREF/Akamba', 'PSI'])]
subset_data.head()

Unnamed: 0,country,intervention_dates,partner_org,estimated_deaths,year_of_estimation,intervention_year,post_intervention_data,years_since_intervention
0,Kenya,May-Jun 06,Red Cross,13000.0,2000,2006,False,-6
1,Kenya,May-Jun 06,Red Cross,10000.0,2005,2006,False,-1
2,Kenya,May-Jun 06,Red Cross,9000.0,2010,2006,True,4
3,Kenya,May-Jun 06,Red Cross,9900.0,2013,2006,True,7
4,Kenya,Nov-Dec 06,AMREF/Akamba,13000.0,2000,2006,False,-6


In [65]:
# create the line plot 
fig_subset = px.line(subset_data, x="years_since_intervention", y="estimated_deaths", color='country', 
              labels={
                     "estimated_deaths": "Estimated Deaths",
                     "years_since_intervention": "Years Since Intervention",
                     "country": "Country"
                 },
                title="Interventions by Red Cross", 
             template='simple_white')

# ensure we are showing useful numbers on the x-axis
number_of_ticks_subset = abs(subset_data['years_since_intervention'].min() - subset_data['years_since_intervention'].max())
fig_subset.update_xaxes(nticks=number_of_ticks_subset)

# add a dropdown filter
fig_subset.update_layout(
    updatemenus=[
        dict(active=0,
            buttons=list([
            dict(label="Red Cross",
                 method="update",
                 args=[{"visible":[True, False, False]},
                       {"title":"Interventions by Red Cross"}]),
            dict(label="AMREF/Akamba",
                 method="update",
                 args=[{"visible":[False,True,False]},
                       {"title":"Interventions by AMREF/Akamba"}]),
            dict(label="PSI",
                 method="update",
                 args=[{"visible":[False,False,True]},
                       {"title":"Interventions by PSI"}])
        ]),
        )
    ]
)

# add a reference line to the chart to show the date of intervention
fig_subset.add_shape(type='line',
                yref="y",
                xref="x",
                x0=0,
                y0=0,
                x1=0,
                y1=subset_data['estimated_deaths'].max()*1.2,
                line=dict(color='black', width=1))


fig_subset.show()

This is much easier to understand, but still doesn't tell us much unfortunately.

I also want to explore how this data looks on a map over time, we may be able to clearly see a shift in trends which would be a great thing to demo to stakeholders. People always love an interactive map!

In [92]:
# I'll just use Red Cross data here to keep it simple 

rc_data = data[(data[['partner_org']].values == ['Red Cross'])]

In [91]:
fig_map = px.choropleth(data_frame = rc_data, locations='country', locationmode='country names', 
                           color='estimated_deaths',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           animation_frame=rc_data["years_since_intervention"].sort_values(),
                           scope="world",
                           labels={'estimated_deaths':'Estimated Deaths (k)'}
                          )
fig_map.show()

Again it seems like the dataset isn't great for this type of viz. But hopefully it sparks some ideas of the kind of thing that could be done with more time. 

## Conclusions

Although we couldn't draw much insight from this analysis, I hope it has provided a useful excersise in visualizing such data. Of course with more data we could create more insight, but this is always the case no matter how much data we have. 

To carry this forward, we could create a Dash app to allow users to explore the data on their own and discover their own insights. 