# Summary:
Gretta Ferguson

Project Luther

10/15/18


### Problem statement and motivation:

For my project, I wanted to predict high LDL cholesterol (i.e. "bad" cholesterol) based on metrics that people could measure at home without the need for doctors or bloodwork. LDL cholesterol is found in blood, and at high levels it sticks to the walls of blood vessles. This clogs pathways, causing the heart to need to work harder and causing blood clots from high pressures which can lead to heart attacks, stroke, and other health issues. Another factor which makes high cholesterol so dangerous if that there are usually no symptoms until these secondary, often deadly, health complications arise. From personal experience, my father had a 95% blockage with no symptoms.


### Data selection and collection:

I began my investigation by researching the risk factors of high cholesterol. My research showed that both lifestyle and demographic factors could inform one's risk. I collected most of my data from the CDC's National Health & Nutrition Examination Survey, which was available on Kaggle. This data combines survey responses about lifestyle with exam and lab (e.g. bloodwork) measurements for about ten thousand individuals. I also extended this data to include a BMI measurement for each individual by scraping data from the National Institute of Health website using Beautiful Soup. The final risk factors I considered were as follows:

* Lifestyle
    * Diet
        * Continuous survey data on how many grams of **saturated fatty acids** an individual consumed. I also considered consumption of other types of fat e.g. unsaturated and total, however I would have introduced multicolinearity into my model if I included all of these. My research showed saturated fats were more predictive of high cholesterol risk, so that is what I kept.
        * Survey data counting how often an individual added **salt** to his/her food at the table.
        * Continuous survey data on how many grams of **alcohol** an individual consumed.
    * Exercise
        * Calculated value of days per week with significant **exerices**. There were numerous survey questions that asked about exercise (e.g. heavy vs moderate vs light vs at home vs at work vs via sports). Some of these categories were overlapping so I needed to come up with a final score that didn't double count exercise. I ended up looking at two questions: the first asked how many days one exercises to the point of sweating and the second asked about whether one's work involves "vigorous-intensity activity" like digging, lifting etc. If the answer to the work question was yes, I gave a value of at least 5 days of exercise per week, adding that to the non-work days per week of exercise, with a max of 7 days per week.
    * Smoking
        * Count of **cigarettes smoked in last 30 days**, which I calculated by multiplying values of two survey questions: "how many days in the last month did you smoke" and "on days when you smoked, how many ciggarettes did you smoke".
        * Count of typical **past cigarette consumption**, of how many ciggarettes they used to smoke at peak of smoking
    * Stress
        * **Income to poverty ratio**
        * Systolic and diastolic **blood pressure**
        * **Marriage status**, but I ended up getting rid of this metric because it had 6 different categories and it led to model overfitting
    * Obesity
        * **Waist circumference** as measured in the examination
        * **BMI** pulled from NIH website using beautiful soup to web scrape a BMI table. The CDC dataset gave me height and weight for each individual. I then created a formula which "looked" for these values in the web-scraped table and returned the relevant BMI. I then applied this formula to the height and weight arrays to create a BMI array using numpy.vectorize.
* Demographic
    * Gender
        * **Sex** as reported in the survey
    * Age
        * **Age** in years as reported in the survey
    * Race
        * **Race** category (6 in total) as reported in the survey
    * Diabetes
        * Whether or not someone has been diagnosed with **diabetes**
        
        
I filtered out the rows which had missing data for any of these categories and was left with 2,431 distinct data points (i.e. participants).


### Feature engineering and model selection:

I began by checking my data for correlations using a heatmap to get a sense for which variables were strong predictors of cholesterol and also to check for multicoliniarity. I used pd.get_dummies to transform my categorical variables into binary columns. I then created a mixed interaction variable between race and gender since my research indicated that the interaction of these factors would be important. I split my data into train and test sets and set aside the test data till I had selected my model. 

I then created histograms to view the distributions of each variable. For the linear regression model assumptions to hold, my variables needed to be (at least somewhat) normally distributed, but I had several variables which were positively skewed. I was able to transform LDL cholesterol and BMI data using the BoxCox transformation. For alcohol, smoking, and exercise data, I used log(x+1) transformations (adding "+1" since the data included 0's). I examined the histograms as well as QQ plots, predictions vs actuals, and residuals for all the transformed variables. I confirmed my data satisfied the normalization and hederoskedastic assumptions.



### Tools/libraries used:

* pandas
* seaborn
* statsmodels.api
* numpy 
* scipy.stats
* matplotlib.pyplot
* patsy
* scipy
* sklearn.metrics
* sklearn.preprocessing 
* sklearn.model_selection 
* sklearn.linear_model 
* sklearn.pipeline 
* sklearn.linear_model
* requests
* bs4 (BeautifulSoup)