# Malaria Data Science Project
by Mervin Keith Cuadera, Jacob Speigel, and Lane Fitzsimmons

## Introduction

## Data Description

**Why were these datasets created?**

The datasets on malaria incidences, cases and deaths were created by the World Health Organization to better understand the threat of malaria in observed countries. Observations of trends in the data can assist in the WHO’s efforts to eradicate the disease. We combined these three datasets to observe trends in malaria incidence, case number and deaths by country and by year. These trends may reveal insights about the spread of the disease, and pinpoint areas where malaria incidence has declined to allow further investigation into these places. 

For the World Bank dataset, it was created to "identify effective public and private actions, set goals and targets, monitor progress and evaluate impacts."

The Earth surface temperature dataset, which was obtained in Kaggle, was created to understand the trends in global temperatures with respect to finding evidence for climate change. 

**Who created the datasets?**

The World Health Organization funded the dataset, which is an organization under the direction of the United Nations. The World Bank dataset was created by the World Bank, United Nations Population Division, and US Census Bureau. The Earth surface temperature dataset was compiled by Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. 

**Who funded the creation of the datasets?**

The WHO is funded largely by Member States and UN organizations. 
The World Bank is capitally funded by member countries/organizations. 
The Earth surface temperature data is fully funded through charitable contributions from unrestricted grants from donor organizations, as well as the U.S. Department of Energy.

**What are the observations (rows) and the attributes (columns)?**
The row represents malaria statistics and Gross Domestic Product per Capita adjusted for Purchase Power Parity (GDPpcPPP) per year and country. The attributes in the final dataset includes Year, Country, Incidence (the number of cases per 1,000 people), Deaths (confirmed deaths from malaria), confirmed cases, Region (using WHO Classification), and Global South (whether or not the country is a developing nation - Global North if it is developed).

## Pre-registration Assignment

**Overview**

We “pre registered” two analyses of our data to ensure that we were not selectively choosing analyses that reveal statistically significant results. We did not know whether these analyses would yield interesting results before we performed them.

**Analysis 1: Multivariable Regression of Temperature, GDP and Malaria Incidence**

In phase 3, we committed to create a multivariable linear regression that predicts the incidence of malaria given temperature and GDP by country for the most recent year in our dataset. However, we realized that grouping by regions rather than by country may be more informative as malaria cases are probably more influenced by georgraphy rather than by specific countries. Temperature and GDP by year and country were not variables in the dataset we used for Phase II, but performed web scraping to add those variables to the final dataset. This was to observe the resulting coefficient of the model and draw conclusions based on that. For instance, if the coefficient is large and positive, we could infer that temperature and GDP together correlate with a higher incidence of malaria in most countries. From the evidence gathered, we discuss in our analysis factors that are likely to impact malaria incidence worldwide, such as change in GDP and global warming.

**Analysis 2: K-Means Clustering**

We committed to perform a k-means clustering analysis to discover ways in which the countries in the dataset can be categorized. This involved trying different clusterings with different values of k. A significant result would be that countries with high, medium and low malaria incidence rates share common categoristics, such as region or temperature change. This analysis provided further insight as to what country characteristics may put countries at a higher risk for the spread of malaria.

## Data Analysis

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import requests
from scipy import stats
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
plt.rcParams["figure.figsize"] = (10, 5)

In [8]:
curated_df_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/main/datasets/malaria_project_curated_data.csv"
malaria_df = pd.read_csv(curated_df_url)
malaria_df = malaria_df.set_index('Year')
malaria_df.head()

Unnamed: 0_level_0,Country,Incidence,Deaths,Confirmed Cases,Total Population,Region,Global South,GDPpcPPP,AverageTemperature
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-01,Afghanistan,9.01,24.0,39263.0,32269589.0,Asia & Pacific,Global South,2015.514962,16.533625
2012-01-01,Afghanistan,11.15,36.0,54840.0,31161376.0,Asia & Pacific,Global South,1914.774351,14.481583
2011-01-01,Afghanistan,18.87,40.0,77549.0,30117413.0,Asia & Pacific,Global South,1699.487997,15.518
2010-01-01,Afghanistan,15.11,22.0,69397.0,29185507.0,Asia & Pacific,Global South,1710.575645,15.828667
2009-01-01,Afghanistan,14.77,32.0,64880.0,28394813.0,Asia & Pacific,Global South,1519.692548,15.25775


### Analysis 1: Multivariable Regression of Temperature, GDP and Malaria Incidence

In [10]:
#Generating dummy variables for regions
subset_2013 = malaria_df.loc["2013-01-01"].copy()
subset_2013["Is Asia & Pacific"] = pd.get_dummies(subset_2013["Region"])["Asia & Pacific"]
subset_2013["Is Arab States"] = pd.get_dummies(subset_2013["Region"])["Arab States"]
subset_2013["Is Africa"] = pd.get_dummies(subset_2013["Region"])["Africa"]
subset_2013["Is South/Latin America"] = pd.get_dummies(subset_2013["Region"])["South/Latin America"]
subset_2013["Is Europe"] = pd.get_dummies(subset_2013["Region"])["Europe"]
subset_2013["Is Middle east"] = pd.get_dummies(subset_2013["Region"])["Middle east"]

##### Our multivariate linear regression model

In [11]:
incidence_model_vars = ["Is Asia & Pacific", "Is Arab States", "Is Africa",
                        "Is South/Latin America", "Is Europe", "Is Middle east",
                        "AverageTemperature", "GDPpcPPP"]

subset_2013 = subset_2013.dropna(subset = incidence_model_vars).copy() # making sure there are no NA values

incidence_model_2013 = LinearRegression()
incidence_model_2013.fit(subset_2013[incidence_model_vars], subset_2013["Incidence"])
incidence_model_2013_coeff = incidence_model_2013.coef_[:]

In [12]:
for i in range(len(incidence_model_2013_coeff)):
    print('For', incidence_model_vars[i], 'variable, the regression coefficient is: {:.2f}'.format(incidence_model_2013_coeff[i]))

For Is Asia & Pacific variable, the regression coefficient is: -26.35
For Is Arab States variable, the regression coefficient is: -68.27
For Is Africa variable, the regression coefficient is: 167.87
For Is South/Latin America variable, the regression coefficient is: -61.45
For Is Europe variable, the regression coefficient is: 35.38
For Is Middle east variable, the regression coefficient is: -47.18
For AverageTemperature variable, the regression coefficient is: 7.89
For GDPpcPPP variable, the regression coefficient is: -0.00


##### Using pooled data rather than just the 2013 year.

In [14]:
malaria_df_dropna = malaria_df.copy()
malaria_df_dropna["Is Asia & Pacific"] = pd.get_dummies(malaria_df_dropna["Region"])["Asia & Pacific"]
malaria_df_dropna["Is Arab States"] = pd.get_dummies(malaria_df_dropna["Region"])["Arab States"]
malaria_df_dropna["Is Africa"] = pd.get_dummies(malaria_df_dropna["Region"])["Africa"]
malaria_df_dropna["Is South/Latin America"] = pd.get_dummies(malaria_df_dropna["Region"])["South/Latin America"]
malaria_df_dropna["Is Europe"] = pd.get_dummies(malaria_df_dropna["Region"])["Europe"]
malaria_df_dropna["Is Middle east"] = pd.get_dummies(malaria_df_dropna["Region"])["Middle east"]
malaria_df_dropna = malaria_df_dropna.dropna(subset=incidence_model_vars).copy() # making sure there are no NA values

incidence_model_pooled = LinearRegression()
incidence_model_pooled.fit(malaria_df_dropna[incidence_model_vars], malaria_df_dropna["Incidence"])
incidence_model_pooled_coeff = incidence_model_pooled.coef_[:]

In [15]:
for i in range(len(incidence_model_pooled_coeff)):
    print('For', incidence_model_vars[i], 'variable, the regression coefficient is: {:.2f}'.format(incidence_model_pooled_coeff[i]))

For Is Asia & Pacific variable, the regression coefficient is: -24.84
For Is Arab States variable, the regression coefficient is: -69.34
For Is Africa variable, the regression coefficient is: 168.21
For Is South/Latin America variable, the regression coefficient is: -74.09
For Is Europe variable, the regression coefficient is: 26.72
For Is Middle east variable, the regression coefficient is: -26.67
For AverageTemperature variable, the regression coefficient is: 8.26
For GDPpcPPP variable, the regression coefficient is: -0.00


## Evaluation of Significance

## Interpretation and Conclusions

## Limitations

## Source Code

## Acknowledgements

## Appendix: Data Cleaning Description