# Malaria Data Science Project
by Mervin Keith Cuadera, Jacob Speigel, and Lane Fitzsimmons

## Introduction

## Data Description

## Pre-registration Assignment

**Overview**

We “pre registered” two analyses of our data to ensure that we were not selectively choosing analyses that reveal statistically significant results. We did not know whether these analyses would yield interesting results before we performed them.

**Analysis 1: Multivariable Regression of Temperature, GDP and Malaria Incidence**

We committed to create a multivariable linear regression that predicts the incidence of malaria given temperature and GDP by country for 2013 (the most recent year in our dataset). Temperature and GDP by year and country were not variables in the dataset we used for Phase II, but performed web scraping to add those variables to the final dataset. This was to observe the resulting coefficient of the model and draw conclusions based on that. For instance, if the coefficient is large and positive, we could infer that temperature and GDP together correlate with a higher incidence of malaria in most countries. From the evidence gathered, we discuss in our analysis factors that are likely to impact malaria incidence worldwide, such as change in GDP and global warming.

**Analysis 2: K-Means Clustering**

We committed to perform a k-means clustering analysis to discover ways in which the countries in the dataset can be categorized. This involved trying different clusterings with different values of k. A significant result would be that countries with high, medium and low malaria incidence rates share common categoristics, such as region or temperature change. This analysis provided further insight as to what country characteristics may put countries at a higher risk for the spread of malaria.

## Data Analysis

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import requests
from scipy import stats
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
plt.rcParams["figure.figsize"] = (10, 5)

### Dataset

In [93]:
incidence_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/master/datasets/malaria_incidence.csv"
deaths_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/master/datasets/malaria_deaths.csv"
cases_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/master/datasets/malaria_confirmed_cases.csv"
country_regions_url = "https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification"
population_data_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/master/datasets/population_data.csv"
gdp_data_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/main/datasets/gdppcppp_per_country.csv"
temp_data_url = "https://raw.githubusercontent.com/mcuadera/info_2950_malaria_project/main/datasets/temp_by_country.csv"

incidence = pd.read_csv(incidence_url) # downloaded: 03/11/2021, last updated: 2020-03-27
deaths = pd.read_csv(deaths_url) #downloaded: 03/11/2021, last updated: 2018-12-20
cases = pd.read_csv(cases_url) #downloaded: 03/11/2021, last updated: 2018-12-20
population_data = pd.read_csv(population_data_url) #downloaded 03/16/2021, last updated: 2021-02-17
gdp_data = pd.read_csv(gdp_data_url) #downloaded 04/16/2021, last updated: 2021-03-19
temp_data = pd.read_csv(temp_data_url) #downloaded 04/16/2021, last updated: 2022-12-24
country_regions = requests.get(country_regions_url)
country_regions_table = pd.read_html(country_regions.text)[0] #tables of country regions

The format of the population_data dataset (which represented the total country population per year) was slightly different than the rest of the datasets so there were additional data cleaning steps. We first got rid of the columns called "Country Code," "Indicator Name," and "Indicator Code" as there were not relevant for our analysis. We also renamed "Country Name" to "Country" to be consistent with the other datasets.

In [94]:
population_data = population_data.drop(["Country Code", "Indicator Name", "Indicator Code"], axis=1).copy()
population_data = population_data.rename(columns={"Country Name":"Country"}).copy()

We converted the data from wide to long format to make sure that our columns represented variables rather than values of certain variables (in this case years). We created a column of years, and this was done so that our analysis was easier. We also limited our dataset to 2000-2017 to match the other datasets. However, our temperature data is only limited up to 2013.

In [95]:
incidence_long = pd.melt(incidence, id_vars=["Country"], var_name="Year", value_name="Incidence")
incidence_long["Year"] = pd.to_datetime(incidence_long["Year"], format="%Y")

deaths_long = pd.melt(deaths, id_vars=["Country"], var_name="Year", value_name="Deaths")
deaths_long["Year"] = pd.to_datetime(deaths_long["Year"], format="%Y")

cases_long = pd.melt(cases, id_vars=["Country"], var_name="Year", value_name="Confirmed Cases")
cases_long["Year"] = pd.to_datetime(cases_long["Year"], format="%Y")

gdp_long = pd.melt(gdp_data, id_vars=["Country"], var_name="Year", value_name="GDPpcPPP")
gdp_long["Year"] = pd.to_datetime(gdp_long["Year"], format="%Y")

population_data_long = pd.melt(population_data, id_vars=["Country"], var_name="Year", value_name="Total Population")
population_data_long["Year"] = pd.to_datetime(population_data_long["Year"], format="%Y")
population_data_long = population_data_long[(population_data_long["Year"] >= "2000-01-01")].copy()

temp_data["Date"] = pd.to_datetime(temp_data["Date"])
temp_data["Date"] = temp_data[temp_data["Date"] >= "2000-01-01"].copy()

Finally we merged our separate datasets into one. For temperature data, we calculated the yearly average mean temperature per country and then combined it to the malaria_stat_merged dataset.

In [96]:
temp_data_subset = temp_data[temp_data["Country"].isin(malaria_stat_merged["Country"].unique())] #only include WHO countries
temp_data_subset = temp_data_subset.set_index("Date")

In [97]:
temp_data_subset = temp_data_subset.groupby([temp_data_subset.index.year, "Country"])["AverageTemperature"].mean().reset_index() #annual average temp
temp_data_subset["Year"] = pd.to_datetime(temp_data_subset["Date"], format="%Y")
temp_data_subset = temp_data_subset.drop("Date", axis=1).copy()
temp_data_subset = temp_data_subset.set_index("Year")

In [98]:
malaria_stat_merged = incidence_long.merge(deaths_long, on=["Country", "Year"])
malaria_stat_merged = malaria_stat_merged.merge(cases_long, on=["Country", "Year"])
malaria_stat_merged = malaria_stat_merged.merge(population_data_long, on=["Country","Year"])
malaria_stat_merged = malaria_stat_merged.merge(country_regions_table, on="Country")
malaria_stat_merged = malaria_stat_merged.merge(gdp_long, on=["Country", "Year"])
malaria_stat_merged = malaria_stat_merged.merge(temp_data_subset, on=["Country", "Year"])
malaria_stat_merged = malaria_stat_merged.set_index("Year").copy()

We then checked to make sure that our dataset was what we expected it to look like.

In [99]:
malaria_stat_merged.head()

Unnamed: 0_level_0,Country,Incidence,Deaths,Confirmed Cases,Total Population,Region,Global South,GDPpcPPP,AverageTemperature
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-01,Afghanistan,9.01,24.0,39263.0,32269589.0,Asia & Pacific,Global South,2015.514962,16.533625
2012-01-01,Afghanistan,11.15,36.0,54840.0,31161376.0,Asia & Pacific,Global South,1914.774351,14.481583
2011-01-01,Afghanistan,18.87,40.0,77549.0,30117413.0,Asia & Pacific,Global South,1699.487997,15.518
2010-01-01,Afghanistan,15.11,22.0,69397.0,29185507.0,Asia & Pacific,Global South,1710.575645,15.828667
2009-01-01,Afghanistan,14.77,32.0,64880.0,28394813.0,Asia & Pacific,Global South,1519.692548,15.25775


Our dataset looks like it is limited to the year 2013. Although incidence data goes up to 2017, the temperature data only goes up to 2013.

### Analysis 1: Multivariable Regression of Temperature, GDP and Malaria Incidence
We committed to create a multivariable linear regression that predicts the incidence of malaria given temperature and GDP by country for 2017 (the most recent year in our dataset). We modified this question a bit to use region rather than country. Temperature and GDP by year and country were not variables in the dataset we used for Phase II, but performed web scraping to add those variables to the final dataset. This was to observe the resulting coefficient of the model and draw conclusions based on that. For instance, if the coefficient is large and positive, we could infer that temperature and GDP together correlate with a higher incidence of malaria in most countries. From the evidence gathered, we discuss in our analysis factors that are likely to impact malaria incidence worldwide, such as change in GDP and global warming.

In [106]:
#Generating dummy variables for regions
subset_2013 = malaria_stat_merged.loc["2013-01-01"].copy()
subset_2013["Is Asia & Pacific"] = pd.get_dummies(subset_2013["Region"])["Asia & Pacific"]
subset_2013["Is Arab States"] = pd.get_dummies(subset_2013["Region"])["Arab States"]
subset_2013["Is Africa"] = pd.get_dummies(subset_2013["Region"])["Africa"]
subset_2013["Is South/Latin America"] = pd.get_dummies(subset_2013["Region"])["South/Latin America"]
subset_2013["Is Europe"] = pd.get_dummies(subset_2013["Region"])["Europe"]
subset_2013["Is Middle east"] = pd.get_dummies(subset_2013["Region"])["Middle east"]
subset_2013.head()

Unnamed: 0_level_0,Country,Incidence,Deaths,Confirmed Cases,Total Population,Region,Global South,GDPpcPPP,AverageTemperature,Is Asia & Pacific,Is Arab States,Is Africa,Is South/Latin America,Is Europe,Is Middle east
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2013-01-01,Afghanistan,9.01,24.0,39263.0,32269589.0,Asia & Pacific,Global South,2015.514962,16.533625,1,0,0,0,0,0
2013-01-01,Algeria,0.0,0.0,8.0,38140132.0,Arab States,Global South,13056.80362,25.1215,0,1,0,0,0,0
2013-01-01,Angola,180.9,7300.0,1999868.0,26015780.0,Africa,Global South,7682.477158,22.507875,0,0,1,0,0,0
2013-01-01,Argentina,0.0,0.0,0.0,42202935.0,South/Latin America,Global South,20131.68042,14.457125,0,0,0,1,0,0
2013-01-01,Armenia,0.0,0.0,0.0,2897584.0,Europe,Global North,9835.833011,11.34375,0,0,0,0,1,0


##### Our multivariate linear regression model

In [119]:
incidence_model_vars = ["Is Asia & Pacific", "Is Arab States", "Is Africa",
                        "Is South/Latin America", "Is Europe", "Is Middle east",
                        "AverageTemperature", "GDPpcPPP"]
subset_2013 = subset_2013.dropna(subset = incidence_model_vars).copy() # making sure there are no NA values

incidence_model_2013 = LinearRegression()
incidence_model_2013.fit(subset_2013[incidence_model_vars], subset_2013["Incidence"])
incidence_model_2013_coeff = incidence_model_2013.coef_[:]

In [121]:
for i in range(len(incidence_model_2013_coeff)):
    print('For', incidence_model_vars[i], 'variable, the regression coefficient is: {:.2f}'.format(incidence_model_2013_coeff[i]))

For Is Asia & Pacific variable, the regression coefficient is: -26.35
For Is Arab States variable, the regression coefficient is: -68.27
For Is Africa variable, the regression coefficient is: 167.87
For Is South/Latin America variable, the regression coefficient is: -61.45
For Is Europe variable, the regression coefficient is: 35.38
For Is Middle east variable, the regression coefficient is: -47.18
For AverageTemperature variable, the regression coefficient is: 7.89
For GDPpcPPP variable, the regression coefficient is: -0.00


##### Using pooled data rather than just the 2017 year.

In [125]:
malaria_stat_merged_dropna = malaria_stat_merged.copy()
malaria_stat_merged_dropna["Is Asia & Pacific"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["Asia & Pacific"]
malaria_stat_merged_dropna["Is Arab States"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["Arab States"]
malaria_stat_merged_dropna["Is Africa"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["Africa"]
malaria_stat_merged_dropna["Is South/Latin America"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["South/Latin America"]
malaria_stat_merged_dropna["Is Europe"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["Europe"]
malaria_stat_merged_dropna["Is Middle east"] = pd.get_dummies(malaria_stat_merged_dropna["Region"])["Middle east"]
malaria_stat_merged_dropna = malaria_stat_merged_dropna.dropna(subset=incidence_model_vars).copy() # making sure there are no NA values

incidence_model_pooled = LinearRegression()
incidence_model_pooled.fit(malaria_stat_merged_dropna[incidence_model_vars], malaria_stat_merged_dropna["Incidence"])
incidence_model_pooled_coeff = incidence_model_pooled.coef_[:]

In [126]:
for i in range(len(incidence_model_pooled_coeff)):
    print('For', incidence_model_vars[i], 'variable, the regression coefficient is: {:.2f}'.format(incidence_model_pooled_coeff[i]))

For Is Asia & Pacific variable, the regression coefficient is: -24.84
For Is Arab States variable, the regression coefficient is: -69.34
For Is Africa variable, the regression coefficient is: 168.21
For Is South/Latin America variable, the regression coefficient is: -74.09
For Is Europe variable, the regression coefficient is: 26.72
For Is Middle east variable, the regression coefficient is: -26.67
For AverageTemperature variable, the regression coefficient is: 8.26
For GDPpcPPP variable, the regression coefficient is: -0.00


## Evaluation of Significance

## Interpretation and Conclusions

## Limitations

## Source Code

## Acknowledgements

## Appendix: Data Cleaning Description