<a href="https://www.kaggle.com/code/kamtoeze/malaria-investigation-from-2000-to-2018?scriptVersionId=101900571" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Investigating Malaria cases across different WHO Regions from 2000 to 2018**

**Overview**

Malaria is an illness caused by parasites called plasmodium parasites. it is transmitted when a person is bitten by an infected female anopheles mosquito. there are 5 parasite species that cause malaria in humans namely:

* Plasmodium Falciparum
* Plasmodium Vivax
* Plasmodium Ovale
* Plasmodium Malariae
* Plasmodium Knowlesi

The deadliest malaria parasite happens to be plasmodium falciparum because it is responsible for most deaths and it is prevalent in Africa

Plasmodium Vivax and Plasmodium Ovale have the added combination of a dormant liver stage, which can even be reactivated in the absence of a mosquito bite

Plasmodium Ovale and Plasmodium Malariae have a small percentage of infections.

Plasmodium Knowlesi is a species that infects primates however the mode of transmission remains unclear.

Sources: https://www.who.int/news-room/fact-sheets/detail/malaria#:~:text=Malaria%20is%20a%20life%2Dthreatening,at%20627%20000%20in%202020, https://www.mmv.org/malaria-medicines/five-species?gclid=EAIaIQobChMIlpmhwZH--AIV9RSLCh1ztgyyEAAYASAAEgKP3PD_BwE

**Case Study**

Malaria is a terrible illness which has stolen the lives of many people. i will be conducting an investigation to observe which nations are most hit with these malaria infections and which nations suffer deaths compared to other nations. also i will like to notice if the malaria infection rate and death rate has increased or decreased over time. i will be using a dataset that focuses on malaria cases from 2000 to 2018 across different WHO REGIONS.

Data source: https://www.kaggle.com/datasets/imdevskp/malaria-dataset

Data tools used: I have used python for the data cleaning process, analysis and visualization.

How was the data stored? The data files were stored in a zip folder in a csv file format. The data was stored in a long format.


**Data Limitations: **

although cases per 1000 population was given, the population of the country wasnt given so we cannot know if the malaria infection rate is increasing in that country.


**Data exploration**

I will start by importing the datasets and exploring them.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)

In [2]:
incidence_data = pd.read_csv('../input/malaria-dataset/incidence_per_1000_pop_at_risk.csv')
reported_data = pd.read_csv('../input/malaria-dataset/reported_numbers.csv')

In [3]:
incidence_data.head()

Unnamed: 0,Country,Year,No. of cases,WHO Region
0,Afghanistan,2018,29.01,Eastern Mediterranean
1,Algeria,2018,0.0,Africa
2,Angola,2018,228.91,Africa
3,Argentina,2018,0.0,Americas
4,Armenia,2018,0.0,Europe


In [4]:
incidence_data.shape

(2033, 4)

* In this dataset, there are 2033 observations and 4 columns


In [5]:
incidence_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2033 entries, 0 to 2032
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country       2033 non-null   object 
 1   Year          2033 non-null   int64  
 2   No. of cases  2033 non-null   float64
 3   WHO Region    2033 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 63.7+ KB


* Dataset contains floats, integers and object values
* No variable column has missing values.

In [6]:
reported_data.head()

Unnamed: 0,Country,Year,No. of cases,No. of deaths,WHO Region
0,Afghanistan,2017,161778.0,10.0,Eastern Mediterranean
1,Algeria,2017,0.0,0.0,Africa
2,Angola,2017,3874892.0,13967.0,Africa
3,Argentina,2017,0.0,1.0,Americas
4,Armenia,2017,0.0,,Europe


In [7]:
reported_data.shape

(1944, 5)

* In this dataset, there are 1944 observations and 5 columns

In [8]:
reported_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1944 entries, 0 to 1943
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country        1944 non-null   object 
 1   Year           1944 non-null   int64  
 2   No. of cases   1710 non-null   float64
 3   No. of deaths  1675 non-null   float64
 4   WHO Region     1944 non-null   object 
dtypes: float64(2), int64(1), object(2)
memory usage: 76.1+ KB


* Dataset contains objects, floats and integer values
* There are no values values in three columns, but in the remaining two columns there are null values

**DATA CLEANING**

In the Data exploration stage, i observed there are null values in one of the datasets. i will proceed to drop the null values using the dropna() function

In [9]:
reported_data.isnull().sum()

Country            0
Year               0
No. of cases     234
No. of deaths    269
WHO Region         0
dtype: int64

In [10]:
reported_data = reported_data.dropna()

**DATA MANIPULATION**

In this stage, we will be renaming columns for better readability. Also we will change data types from floats to int for better visualization of the data

In [11]:
incidence_data.columns

Index(['Country', 'Year', 'No. of cases', 'WHO Region'], dtype='object')

In [12]:
reported_data.columns

Index(['Country', 'Year', 'No. of cases', 'No. of deaths', 'WHO Region'], dtype='object')

In [13]:
incidence_data.rename (columns = {'No. of cases':'no_of_cases'}, inplace = True)
incidence_data.rename (columns = {'WHO Region': 'who_region'}, inplace = True)
reported_data.rename (columns = {'No. of cases': 'no_of_cases'}, inplace = True)
reported_data.rename (columns = {'No. of deaths': 'no_of_deaths'}, inplace = True)
reported_data.rename (columns = {'WHO Region': 'who_region'}, inplace = True)


In [14]:
reported_data

Unnamed: 0,Country,Year,no_of_cases,no_of_deaths,who_region
0,Afghanistan,2017,161778.0,10.0,Eastern Mediterranean
1,Algeria,2017,0.0,0.0,Africa
2,Angola,2017,3874892.0,13967.0,Africa
3,Argentina,2017,0.0,1.0,Americas
6,Bangladesh,2017,4893.0,13.0,South-East Asia
...,...,...,...,...,...
1936,United Republic of Tanzania,2000,17734.0,379.0,Africa
1937,Uzbekistan,2000,126.0,0.0,Europe
1938,Vanuatu,2000,6768.0,3.0,Western Pacific
1939,Venezuela (Bolivarian Republic of),2000,29736.0,24.0,Americas


In [15]:
reported_data.dtypes

Country          object
Year              int64
no_of_cases     float64
no_of_deaths    float64
who_region       object
dtype: object

In [16]:
reported_data.no_of_cases = reported_data.no_of_cases.astype(int)
reported_data.no_of_deaths = reported_data.no_of_deaths.astype(int)

In [17]:
incidence_data.dtypes

Country         object
Year             int64
no_of_cases    float64
who_region      object
dtype: object

In [18]:
incidence_data.no_of_cases = incidence_data.no_of_cases.astype(int) 

**Data analysis and visualization**


After the data cleaning and manipulation stages, the malaria dataset is now ready for analysis. There are so many countries in the data, also the time frame for this dataset is 18 years. I think the best approach would be to use the groupby() function to group these data accordingly and create some visualizations so we can gain insights from them.


In [19]:
who_region = reported_data.groupby(['who_region'])[['no_of_cases','no_of_deaths']].apply(sum).reset_index()
who_region.head()

Unnamed: 0,who_region,no_of_cases,no_of_deaths
0,Africa,545111852,1480850
1,Americas,13433321,11039
2,Eastern Mediterranean,15841260,26764
3,Europe,112675,25
4,South-East Asia,38305249,49802


In [20]:
Year = reported_data.groupby(['Year'])[['no_of_cases','no_of_deaths']].apply(sum).reset_index()
Year

Unnamed: 0,Year,no_of_cases,no_of_deaths
0,2000,5279182,21419
1,2001,5534764,26162
2,2002,5335247,70683
3,2003,8243454,91247
4,2004,9389638,87926
5,2005,11170319,76842
6,2006,11898896,78995
7,2007,13365529,76904
8,2008,13395349,87024
9,2009,17454477,115694


In [21]:
countries = reported_data.groupby(['Country'])[['no_of_cases','no_of_deaths']].apply(sum).reset_index()
countries

Unnamed: 0,Country,no_of_cases,no_of_deaths
0,Afghanistan,1045271,363
1,Algeria,1044,4
2,Angola,26006152,125364
3,Argentina,2098,2
4,Armenia,355,0
...,...,...,...
100,Venezuela (Bolivarian Republic of),1039480,278
101,Viet Nam,445213,564
102,Yemen,895910,544
103,Zambia,18619166,8898


After grouping the data, some insights are becoming clear. To make them even clearer, we will plot charts. 

In [22]:
fig = px.bar(countries.sort_values("no_of_cases",ascending=False)[:10][::-1],
             x="no_of_cases",y ="Country",text="no_of_cases",
             title="Top 10 Countries with highest number of Malaria Cases till 2000 to 2018")
fig.show()


In [23]:
fig = px.bar(countries.sort_values('no_of_deaths',ascending = False)[:10][::-1],
            x='no_of_deaths',y='Country',text = 'no_of_deaths',
            title = 'Top 10 countries with the highest malaria deaths from 2000 to 2018')
fig.show()

**Insights**

From the barcharts above, we can see the top ten countries with the most death and cases from 2000 to 2018. The democratic republic of congo has the largest number of cases and deaths associated with malaria. We can also observe that is majorly african countries that have the most cases and deaths from malaria. let us investigate this further by visualizing the data with pie charts

In [24]:
fig = px.pie(who_region, values='no_of_cases',names = 'who_region', color = 'who_region',
            color_discrete_map={'Africa':'lightcyan',
                                 'South-East Asia':'cyan',
                                 'Eastern Mediterranean':'royalblue',
                                 'Americas':'darkblue',
                                 'Western Pacific':'blue',
                                 'Europe': 'red'},
            title= 'MALARIA CASES IN W.H.O REGIONS')
fig.show()

In [25]:
fig = px.pie(who_region, values='no_of_deaths',names = 'who_region', color = 'who_region',
            color_discrete_map={'Africa':'lightcyan',
                                 'South-East Asia':'cyan',
                                 'Eastern Mediterranean':'royalblue',
                                 'Americas':'darkblue',
                                 'Western Pacific':'blue',
                                 'Europe': 'red'},
            title= 'MALARIA DEATHS IN W.H.O REGIONS')
fig.show()

**Insights**

From the charts above, we can see that Africa is the worst hit region with over 90 percent of deaths from 2000 to 2018. Europe is the least hit region with less than 1 percent of deaths caused by malaria from 2000 to 2018. We can also note that malaria cases in other regions have a combined percentage of 12 percent leaving Africa with a huge amount of 88 percent of malaria cases from 2000 to 2018. Let us go deeper into our analysis to see if the chances of getting malaria increased or reduced as the years went by.

In [26]:
fig = px.line(Year, x = 'Year',y = 'no_of_cases', title = 'Malaria Cases from 2000 to 2018')
fig.show()


as seen in the line chart above, malaria cases increased with each year

In [27]:
fig= px.line(Year, x= 'Year', y = 'no_of_deaths', title = 'Malaria deaths from 2000 to 2018')
fig.show()

* here it shows deaths from malaria were not increasing at a constant rate. in some years it increased while in other years it reduced.

In [28]:
fig = px.choropleth(countries,locationmode="country names",
                    locations ="Country",
                    hover_data =['no_of_cases',"no_of_deaths","Country"],
                    hover_name = 'Country',
                    color = 'Country',
                    title = 'Malaria cases across the world')

fig.show()

* To view the data on a larger scale all at once, i created a chloropleth map whichs shows the countries and the number of cases they have had from 2000 to 2018.

* With the focus on the incidence_data, i will like to investigate Nigeria and observe if the number of cases per 1000 people in the population have increased over the years or not

In [29]:
incidence_data.head()

Unnamed: 0,Country,Year,no_of_cases,who_region
0,Afghanistan,2018,29,Eastern Mediterranean
1,Algeria,2018,0,Africa
2,Angola,2018,228,Africa
3,Argentina,2018,0,Americas
4,Armenia,2018,0,Europe


In [30]:
Nigeria = incidence_data[incidence_data['Country'] == 'Nigeria']

In [31]:
Nigeria.head()

Unnamed: 0,Country,Year,no_of_cases,who_region
69,Nigeria,2018,291,Africa
176,Nigeria,2017,283,Africa
283,Nigeria,2016,281,Africa
390,Nigeria,2015,296,Africa
497,Nigeria,2014,314,Africa


In [32]:
fig = px.line(Nigeria, x = 'Year',y ='no_of_cases', title = 'Malaria cases in Nigeria from 2000 to 2018')
fig.show()

* from the line chart, its observed that the malaria cases have decreased over the years in nigeria