# Homework 6

The National Health and Nutrition Examination Survey (NHANES) is a
cross sectional observational study run every 2-3 years by the
United States Centers for Disease Control (CDC).  It collects
extensive demographic and health-related data on a representative
sample of the US population. We will work with a subset of the possible measures that the NHANES survey collects on basic biometric features.


First we import the libraries that we will be using.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

Now we load the NHANES data from a file.


In [2]:
df = pd.read_csv("./nhanes.csv.gz")

Many biological processes behave differently during development
compared to in adulthood.  For this analysis, we will focus on
people of age 18 or greater (RIDAGEYR contains each subject's age in
years).

In [3]:
df = df.loc[df["RIDAGEYR"] >= 18]

We want to great the gender variable as categorical, so we just indicate that here:

In [4]:
df["RIAGENDR"] = df["RIAGENDR"].replace([1.0, 2.0], ["Male","Female"]).astype("category")

Here are the columns from this data set:

In [5]:
df.columns

Index(['SEQN', 'RIDAGEYR', 'RIAGENDR', 'BMXWT', 'BMXHT', 'BMXBMI', 'BPXSY1',
       'BPXSY2', 'BPXSY3'],
      dtype='object')

## Question 1

### Question 1.1

The `BPXSY1`, `BPXSY2`, and `BXSPY3` columns in the data are repeated systolic blood pressure measurements (the "120" in when your doctor says your blood pressure is "120/80"). Numbers higher than 120 indicate various degrees of "hypertension" (i.e., high blood pressure).

Create a new column `BPXSY`  in the `df` table that is mean of the three measurements for each person (i.e., add the three values together and divide by 3 for each person).


In [7]:
df['BPXSY'] = df[['BPXSY1', 'BPXSY2', 'BPXSY3']].mean

Plot the marginal distribution of this measurement.

Comment on the **location**, **spread**, and **skew** in this plot. You only need to give broad descriptions (what is a typical value? where are most of the values? is there any skew?).



### Question 1.2

Create a plot that shows two boxplots that show the **conditional distribution** of `BPXSY` for the two genders ("MALE" and "FEMALE") in the `RIAGENDR` column.

Based on this plot, would you say there is much of a difference between men and women on BPXSY? Justify your answer based on both location and spread observed in the plot.




### Question 1.3

Calculate the **effect size** of the difference between men and women on BPSXY. As a reminder, you will need to compute the **pooled standard deviation**:

$$S_p = \sqrt{\frac{n_1 S_1^2 + n_2 S_2^2}{n_1 + n_1}}$$

where $S_g^2$ is the variance of group $g$ and $n_g$ is the size of group $g$.

Please note: the class notes suggest using `df.groupby("GROUP").size()` to get the number of units in each group. Because of missing data issues, it turns out this is bad advice. A better strategy is focus on the variable of interest `x = df.groupby("GROUP")["VAR"]` and the `x.count()` to get the size of each group.

Based on the following lists of effect sizes, how would you categorize the difference in BP between men and women?

Effect size magnitude (ignore +/-)
* Very small: 0 - 0.01
* Small: 0.1 - 0.20
* Medium: 0.2 - 	0.50
* Large: 0.5 - 	0.80
* Very large: 0.8 -	1.20
* Huge: 1.2 - 	2.0+




## Question 2

Let's investigate the relationship between body-mass-index (BMI) and blood pressure.

### Question 2.1

Make a `regplot` of these two measurements with "BMXBMI" on the x-axis. The `marker = '.'` option makes the plot a little cleaner and the `line_kws = {"color": "orange"}` makes the line easier to see.

Given what you see in this plot, how would you categorize the **direction** of the relationship between these variables?





### Question 2.2

To get a better understanding of the relationship for these data compute **Z-scores** for both BMI and BP. Plot them again using a scatter plot (`marker = '.'`), also add vertical and horizontal lines at 0 using

```
plt.axhline(0, color = "black")
plt.axvline(0, color = "black")
```

Based on the number of observations in each quadrant, would you say that that more points are in the upper-left and lower-right or in the upper-left and lower-right quadrants?



### Question 2.3

Calculate the Pearson correlation/correlation coefficient for these two measures. You can use a built in method for DataFrames.

What does this value tell us? Is is consistent with the graphs in the previous two part? Why or why not?

### Question 2.4

We have observed that both BMI and blood pressure exhibit some right skew. Transform both using reciprocal, square root, and log;  and the compute the correlation matrix for  all combinations of both. For the strongest linear relationship, plot the two transformations and compute the correlation.

## Question 3

### Question 3.1

Compute and save into a variable the correlation of `BMXBMI` and `BPXSY`.

Compute and save into variables the means and standard deviations of these variables.

Using these values compute the slope and intercept of the regression line that relates blood pressure and BMI. For this analysis, consider BMI as the "X" variable (sometimes called the independent variable) and blood pressure as the "Y" variable (sometimes called the dependent variable).

### Question 3.2

Using the slope and intercept from the previous part, compute the following:

- The average blood pressure for people with a BMI of 20.
- The average blood pressure for people with a BMI of 30.
- The difference in average blood pressures for people with BMI values of 18 and BMI values of 21.

### Question 3.2

Create a scatter plot of the data with the regression line overlayed. Do not use the `regplot` or `lmplot` functions. You may use `plt.axline` to add the regression line.


### Question 3.3

Use the `statsmodels` library to fit a linear regression model to the data using the `sm.OLS` function. Print the summary of the model and verify that your calculations from the previous parts are correct.