# Case Study: NHANES Dataset

This notebook analyzes the NHANES dataset, focusing on **uninvariate analysis**.

The [NHANES Dataset](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx) contains the results of the National Health and Nutrition Examination Survey, from the CDC (USA). We have many variables in code in it.
The codebooks for the 2015-2016 wave of NHANES can be found here:

https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015

Direct links:

- [Demographics code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)
- [Body measures code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm)
- [Blood pressure code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm)
- [Alcohol questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm)
- [Smoking questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm)

The commands in this file were written mainly while following a lab notebook on Coursera on the same topic.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
da = pd.read_csv("nhanes_2015_2016.csv")

### Categorical Data: Fequency Tables

Example variables:
- [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2): education level in adults.
- RIAGENDR: gender.

In [15]:
# Count number of category levels
da.DMDEDUC2.value_counts()

4.0    1621
5.0    1366
3.0    1186
1.0     655
2.0     643
9.0       3
Name: DMDEDUC2, dtype: int64

In [4]:
da.DMDEDUC2.value_counts().sum()

5474

In [5]:
da.DMDEDUC2.shape

(5735,)

We see there is a discrepancy: missing values.

In [6]:
pd.isnull(da.DMDEDUC2).sum()

261

In [8]:
# Check that numbers add up: total - null - nonull = 0
da.DMDEDUC2.shape[0] - da.DMDEDUC2.value_counts().sum() - pd.isnull(da.DMDEDUC2).sum()

0

In [12]:
# We might want to create a new variable (e.g., var_x)
# which contains human readable labels for the different category levels
da["DMDEDUC2x"] = da.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 
                                       7: "Refused", 9: "Don't know"})
da.DMDEDUC2x.value_counts()

Some college/AA    1621
College            1366
HS/GED             1186
<9                  655
9-11                643
Don't know            3
Name: DMDEDUC2x, dtype: int64

In [13]:
# Similar for gender
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

In [14]:
# Proportions
x = da.DMDEDUC2x.value_counts()
x / x.sum()

Some college/AA    0.296127
College            0.249543
HS/GED             0.216661
<9                 0.119657
9-11               0.117464
Don't know         0.000548
Name: DMDEDUC2x, dtype: float64

In [17]:
# Create a category-evel for missing cases (null)
# Sometimes we want to do that instead of eliminating/ignoring those cases
da["DMDEDUC2x"] = da.DMDEDUC2x.fillna("Missing")
x = da.DMDEDUC2x.value_counts()
x / x.sum()

Some college/AA    0.282650
College            0.238187
HS/GED             0.206800
<9                 0.114211
9-11               0.112119
Missing            0.045510
Don't know         0.000523
Name: DMDEDUC2x, dtype: float64

### Quantitative Variables: Numerical Summaries

Example variables:
- [BMXWT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXWT): weight (kg).

In [19]:
# describe() provides numerical summaries of quantitative data
# Note: use dropna() before applying any describe()
da.BMXWT.dropna().describe()

count    5666.000000
mean       81.342676
std        21.764409
min        32.400000
25%        65.900000
50%        78.200000
75%        92.700000
max       198.900000
Name: BMXWT, dtype: float64

In [20]:
# Manual summaries
x = da.BMXWT.dropna() # Extract all non-missing values of BMXWT into a variable called 'x'
print(x.mean()) # Pandas method
print(np.mean(x)) # Numpy function
print(x.median())
print(np.percentile(x, 50)) # 50th percentile, same as the median
print(np.percentile(x, 75)) # 75th percentile
print(x.quantile(0.75)) # Pandas method for quantiles, equivalent to 75th percentile

81.34267560889509
81.34267560889509
78.2
78.2
92.7
92.7
