<div align='center'>
    <h1>Predicting Subjective Accessment of General Health</h1>
    Flor Cabral, Alijah O'Connor <br>
    CSCI3022 - Data Science, Spring 2019
</div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Extra functions module
from extra_functions.functions import visualize, kde_by_category

# I. Table of Contents
<a>asdf</a>

# II. Introduction

Assessing subjective feelings (e.g. general health, quality of life, etc) has been an important task in medicine for as long as the profession has existed.  Considering that a core goal in medicine seeks to restore positive subjective perceptions and/or prevent negative subjective perceptions in the subjects, the salience of accurate measurements is not hard to overstate.  As with any subjective assessment, however, a number of biases can introduce a large amount of arbitrary variance to the measurements.  There have been several attempts to standardize measurements of subjective experience using various survey questions--one example is called the Nottingham Health Profile, an attempt to measure "quality of life" (Hunt, 1985).  To expand to the existing list of subjective-to-objective mapping, we are attempting to predict a particular subjective measurement ("General Health") from CDC health data by building statistical models with laboratory-based data, sociological-based data, and combined subsets of the two.  The general hypothesis for this project is given a certain set of information about a patient, we can determine/predict how they feel about their health, without explicitly asking them.

## II.a Data Description

#### NHANES Overview
The data used in this study is compiled and propogated by the CDC under the program name National Health and Nutrition Examination Survey (which will herein be referred to as NHANES).  The NHANES program operates annually (though with different pools of participants) with the goal of assessing health and nutrition of the denizens of America.  The data produced by the program is unique because it utilizes biochemical laboratory-based methods, questionnaires, diet, external body measures, and demographics.  This abundance of data is attractive for statisticians and data scientists (or students of the fields, like us) for generating models for all sorts of different outcomes using a number of measurements (referred to as features in data science).

#### Data acquistion
The specific NHANES data used herein is a subset of files and fields from the 2015-2016 collection (https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015).  This is the most recent release of NHANES data, as it takes years for the researchers to organize, clean, and verify the data from a particular year.
As noted previously, the NHANES data is broken up into several categories (e.g. demographics, laboratory, etc), and each category is subsequently broken up into a number of files which are made up a number of columns corresponding to the particular measurement.  Interestingly, the data files within each category are released to the public in XPT format, which cannot readily be used for analysis in Pandas.  To be able to use these datafiles in our study, we wrote a shell script that coverts the xpt files to csv.  

We have placed the converted files in a github repository here: https://github.com/oconnoag/NHANES_Data.  This repository also contains the compiled files, which will be described shortly.  We opted for having separate repositories for the data and the analyses, because (1) if anyone else would like the download the data themselves in csv format, they can easily clone or zip the repository and (2) this architecture allows for easy reading of the csv files from Github, so there is no need for someone to download the entire datasets to their local machine to view our analysis.  

#### Selecting intial features to compile for consideration
For the compiled data (i.e. the data that we actually analyzed for incorporation in our models), we decided to choose a subset of fators that we surmised would correlate with a subject's general health.  In the data setup files (found here:  https://github.com/oconnoag/NHANES_Data/tree/master/compiled_data), the features we divided between the authors of this study (Alijah and Flor).  Alijah would choose and explore data from the laboratory category, and Flor would choose and explore data from the questionnaire and demographics categories.  This is approach serves as an initial test to target specific factors out of intuition; however, once these original models are constructed, we may switch gears and generalize our approach by including many, many more factors.

#### General Health
The general_health field is originally coded as the "HSD010" column in the /Questionnaire/HSQ_I ("General Health Status") file.  The measurements come from subjects answering the question:  "Would you say {your/SP's} health in general is {List Options}?".  The answers are broken up into 5 levels: 1, 2, 3, 4, 5 corresponding to "Excellent", "Very Good", "Good", "Fair", and "Poor", respectively.

#### Laboratory Features Considered
| Filename_NHANES | Filename_Project | Feature_Name_NHANES | Feature_Name_Project | Description |
|------|------|------|------|------|
|   BIOPRO_I  | standard_biochem_profile | LBXSBU | Blood Urea Nitrogen  | Measured from blood in mg/dL |
|   BIOPRO_I  | standard_biochem_profile | LBXSC3SI | Bicarbonate  | Measured from serum in mmol/L |
|   BIOPRO_I  | standard_biochem_profile | LBXSCA | Total Calcium | Measured from serum in mg/dL
|   BIOPRO_I  | standard_biochem_profile | LBXSCH | Cholesterol | Measured from serum in mg/dL
|   BIOPRO_I  | standard_biochem_profile | LBXSCLSI | Chloride | Measured from serum in mmol/L
|   BIOPRO_I  | standard_biochem_profile | LBXSGL | Glucose | Measured from serum in mg/dL
|   BIOPRO_I  | standard_biochem_profile | LBXSIR | Iron | Measured from serum in ug/dL
|   BIOPRO_I  | standard_biochem_profile  | LBXSKSI | Potassium | Measured from serum in mmol/L
|   BIOPRO_I  | standard_biochem_profile  | LBXSNASI | Sodium | Measured from serum in mmol/L
|   BIOPRO_I  | standard_biochem_profile  | LBXSTP | Total Protein | Measured from serum in g/dL
|   BIOPRO_I  | standard_biochem_profile  | LBXSTR | Triglycerides | Measured from serum in mg/dL
|   BIOPRO_I  | standard_biochem_profile  | LBXSUA | Uric acid | Measured from serum in mg/dL
|   TST_I  | sex_steroid_hormone | 	LBXTST | Testosterone | Measured from serum in ng/dL
|   TST_I  | sex_steroid_hormone | 	LBXEST | Estradiol | Measured from serum in pg/dL
|   TST_I  | sex_steroid_hormone | LBXSHBG | Sex Hormone Binding Globulin (SHBG) | Measured from serum in nmol/L
|   GHB_I  | glycohemoglobin | LBXGH | glycohemoglobin | Measured from serum as a percentage (%)


#### Questionnaire/Demographic Features



## II.b Analysis/Modeling Descriptions

#### Data Cleaning
In order to be able to use our data for model generation (and some visualizations like pairplot), there cannot be nan values in any of the columns.  This led to removing some initial features that didn't have enough datapoints from the start.  Once, we had features from files that had a sufficient # of entries (>6000), we removed all entries that contain any nan values.  For the laboratory set, this reduced the size to ~5500 complete subject profiles.  Outlier culling was not undertaken in this part of the project, because we don't know just how particular values will affect the construction of the models.  Biology is messy, so it's difficult to determine what constitutes an outlier; however, we may pick some threshold for distance (in standard deviations from the mean) if certain values seem to be skewing the data in drastic ways (like perhaps glucose measurements).

Analysis

#### Possible Models
Because we are working with a multi-level classification problem, a couple of different models could be (and likely will be constructed) including logistic regression, linear discriminant analysis (LDA), or even a Random Forest. As mentioned previously, we are looking to build models for using laboratory data only, sociological data only, and the combined datasets.


# III. Exploring Laboratory-based Data

# IV. Exploring Sociological-based Data

# V. Combined Data

# VI. Conclusion, Discussion, and Future Plans

If mapping subjective experience were easy, no one would be concerned with asking these types of questions in the medical community anymore.  Just as other less reliable measures related to well-being have been stripped from the profession due to advancing techniques and technologies, so too may this subjective assessment of health.  

# VII. References

Hunt, S. M., McEwen, J., & McKenna, S. P. (1985). Measuring health status: a new tool for clinicians and epidemiologists. JR Coll Gen Pract, 35(273), 185-188.