# Final Project - Project Proposal

## Problem Statement

For my final project, I would like to build a model that can **predict where patients will seek care for different health issues.** In every health system, there are a variety of health providers that people can visit to address health issues that arise. Global surveys of health behavior have found that many factors influence patient health seeking behavior, including distance from their home, cost of the provider, attitudes towards the government, and education levels. Patient care seeking patterns are one indicator of patient prefererences for health care in their country and could inform public health officials on which types of health providers they ought to use to meet patients where they prefer to receive healthcare.

However, in many countries data does not exist on where patients seek care for health services, leaving public health officials to choose the location of service delivery based primarily on where it is easiest to deliver the services, rather than where it is preferrable to deliver the services from a patient's perspective. 

*How Model Can Inform Decision Making*

Using data from Demographic and Health Surveys (DHS), I want to build a model that can predict patient care seeking patterns, leveraging commonly available demographic data as the primary features informing those predictions. Using this model, public health officials could make better decisions about where it is best to delivery public health interventions based on patient care seeking preferences. This model would be especially useful for settings that don't have data on patient care seeking, but do collect data on commonly available deomgrpahic information. 

## Data Source ##
The Demographic and Health Surveys (DHS) are surveys implemented nationally in many countries around the world. They provide information on key demographic and health behavior indicators. Countries often conduct a DHS every five years to allow for analysis over time and tracking of their progression against key population health indicators (e.g. immunization coverage or access to contraception). 

For this project, I will leverage the most recent surveys from 5-10 countries across the world. Different survey instruments are used for different groups of the population. The broadest and most comprehensive survey is the individual recode survey, which is given to women ages 15-49. The surveys are powered to the stratum level (every urban/rural x region combination), allowing for potential subnational variation to inform predictions. I will only use the individual recode survey for my model. 

## Target Variables ##
I will use several target variables to see if there are care seeking patterns that are more predictable given a set of baseline demogrpahic information. The target variables of care seeking patterns I plan to use are: 

1. Place first sought for treatment of diarrhea (H44A)
2. Place first sought for treatment of fever (H46A)
3. Place of delivery of child (M15)
4. Place of first postnatal checkup (M73)

## Prediction Features ##
The DHS provides many demogrpahic variables for each individual respondent to the survey. I will use several of the most commonly included features so that I can make a model that could be leveraged in many different settings even if the data is limited. There is likely to be collinearity between many of these features, which I will need to test for an consider when building out the model. Initially, the features I would like include are: 

**Geographic Features**
- Type of Place of Residence | e.g. Urban/Rural (V025)
- De Facto Place of Residence | e.g. small-town, village, capital city (V026) or (V134)
- Region (v024)
- Distance from place of residence to nearest city (will need to be engineered by including another dataset)

**Personal Features**
- Education level (V106 and V107)
- Religion (V130)
- Age (v012)

**Household Features**
- Has TV, radio, internet, electricity (V119-V121)
- Has telephone (V153)
- Source of drinking water (V113)
- Time do get water (V115)
- Has bike, scooter, motorcycle, car/truck (V123-V125)
- Wealth Index (V190)
- Number of children (V137 or V201)

In [3]:
dhs_variables = ["v005","v012","v023","v024","v025", "v026", "v134", "v106", "v107", "v130", "v119", "v120", "v121", "v153", "v113", "v115", "v123", "v124", "v125", "v190", "v137", "v201", "h44a_1", "h46a_1", "m15_1", "m73_1"]

## Goals and Success Metrics ##
- I imagine that a classifier algorithm (like KNN) will be the type of model I'll use for this problem. With that being the case, I'll probably evaluate the success of the model based on how well it can predict care seeking patterns relative to just taking the most likely value for care seeking in the data set (testing relative to null model).

## Risks or Limitations ##
- Definitions of health facilities across countries are not standardized, meaning it will be difficult to make an internationally generalizable model unless we know something about the health systems in each country and the way they define health facilities
- The dataset only surveys women between the agees of 15-49. The model won't be generalizable to care seeking patterns for men unless we make some strong assumption or include more data about male care seeking patterns
- As mentioned above, there is likely to be collinearity between the features. I will need to determine the extent to which collinearity exists and remove features accordingly.

## Resources ##
**DHS Resources**
Investigation of Nutritional Status of Children based on Machine Learning Techniques using Indian Demographic and Health Survey Data: https://www.sciencedirect.com/science/article/pii/S187705091731894X

Combining satellite imagery and machine learning to predict poverty: https://sustain.stanford.edu/predicting-poverty; https://github.com/nealjean/predicting-poverty

Guide to DHS Statistics: https://www.dhsprogram.com/Data/Guide-to-DHS-Statistics/index.cfm

**Python Resources**
http://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html

## Initial EDA ##

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.weightstats import DescrStatsW
from pathlib import Path

In [5]:
data = Path('Datasets','NMIR61FL.DTA')
nmb_dhs = pd.read_stata(data, columns = dhs_variables)

In [6]:
#create a dataset that focuses on only the values of women ages 15-49. the survey notes suggest that there were 842 women who were added (over the ages of 50) in half of the households survyed. 
nmb_dhs_small = nmb_dhs.loc[nmb_dhs.loc[:,"v012"]<50,:]

In [7]:
#create a survey weight variable. For DHS, you have to take the survey weight variable and divide it by 1,000,000
svy_weight = nmb_dhs_small.loc[:,"v005"]/1000000

In [8]:
nmb_dhs_small.shape

(9176, 26)

In [9]:
nmb_dhs_small.loc[:,"v025"].value_counts()

urban    4843
rural    4333
Name: v025, dtype: int64

In [13]:
nmb_dhs_small.head(20)

Unnamed: 0,v005,v012,v023,v024,v025,v026,v134,v106,v107,v130,...,v123,v124,v125,v190,v137,v201,h44a_1,h46a_1,m15_1,m73_1
0,531957,44,"erongo, urban",erongo,urban,,,secondary,3.0,protestant/anglican,...,no,no,yes,richest,0,3.0,,,,
1,564917,25,"caprivi, rural",caprivi,rural,,,no education,,other,...,yes,no,no,poorest,1,2.0,,,government hospital,government health post
2,564917,31,"caprivi, rural",caprivi,rural,,,primary,6.0,protestant/anglican,...,no,no,no,poorest,1,4.0,,,respondent's home,government health post
4,564917,35,"caprivi, rural",caprivi,rural,,,primary,6.0,seventh-day adventist,...,no,no,no,poorest,1,2.0,,government health care clinic,government hospital,
5,564917,16,"caprivi, rural",caprivi,rural,,,secondary,2.0,seventh-day adventist,...,no,no,no,poorest,1,0.0,,,,
6,564917,41,"caprivi, rural",caprivi,rural,,,secondary,2.0,seventh-day adventist,...,no,no,no,poorest,2,7.0,,,respondent's home,government health post
7,564917,38,"caprivi, rural",caprivi,rural,,,secondary,1.0,protestant/anglican,...,no,no,no,poorest,3,6.0,,government health care clinic,government hospital,
8,564917,19,"caprivi, rural",caprivi,rural,,,secondary,1.0,protestant/anglican,...,no,no,no,poorest,3,1.0,,,government hospital,
9,564917,35,"caprivi, rural",caprivi,rural,,,secondary,1.0,seventh-day adventist,...,no,no,no,poorest,1,3.0,,,government health care clinic,
10,564917,30,"caprivi, rural",caprivi,rural,,,secondary,5.0,elcin,...,no,no,yes,middle,1,3.0,,pharmacy,government hospital,


In [14]:
nmb_dhs_small.tail(20)

Unnamed: 0,v005,v012,v023,v024,v025,v026,v134,v106,v107,v130,...,v123,v124,v125,v190,v137,v201,h44a_1,h46a_1,m15_1,m73_1
9992,1019991,18,"oshikoto, urban",oshikoto,urban,,,secondary,1.0,elcin,...,no,no,yes,richer,1,1.0,,,government hospital,government health center
9996,978455,31,"oshikoto, rural",oshikoto,rural,,,secondary,2.0,elcin,...,no,yes,yes,poorer,2,5.0,,,government hospital,government hospital
9997,978455,32,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,elcin,...,no,no,no,poorer,2,3.0,government health care clinic,government health care clinic,government hospital,government hospital
9998,978455,29,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,elcin,...,no,no,no,poorest,1,3.0,,,respondent's home,government health post
9999,978455,25,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,elcin,...,not a dejure resident,not a dejure resident,not a dejure resident,poorer,0,2.0,,government health care clinic,government hospital,government hospital
10000,978455,32,"oshikoto, rural",oshikoto,rural,,,no education,,elcin,...,no,no,no,poorest,0,2.0,,,,
10001,978455,39,"oshikoto, rural",oshikoto,rural,,,primary,7.0,elcin,...,no,no,no,poorest,0,3.0,,,,
10003,978455,36,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,elcin,...,no,no,yes,richer,0,4.0,,,,
10004,978455,38,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,elcin,...,no,no,yes,middle,1,4.0,,,government hospital,government hospital
10005,978455,33,"oshikoto, rural",oshikoto,rural,,,secondary,3.0,protestant/anglican,...,no,no,no,poorest,4,7.0,,,government hospital,
