In [1]:
import numpy as np
import pandas as pd


# Importing Data

## Cholesterol data
Reading in cholesterol data for 2017-2018

In [2]:
df = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/TCHOL_J.XPT')

In [3]:
df.head()

Unnamed: 0,SEQN,LBXTC,LBDTCSI
0,93705.0,157.0,4.06
1,93706.0,148.0,3.83
2,93707.0,189.0,4.89
3,93708.0,209.0,5.4
4,93709.0,176.0,4.55


SEQN is the particpant sequence number.  This will be used to join tables.  LBTXC is our taget variable, total cholesterol measured in mg/dL.  LBDTCSI is total cholecterol measured in mmol/L.

Dropping LBDTSCI then dropping any rows with null values as both values are needed.

In [4]:
df.drop('LBDTCSI',axis=1,inplace=True)
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6738 entries, 0 to 7434
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SEQN    6738 non-null   float64
 1   LBXTC   6738 non-null   float64
dtypes: float64(2)
memory usage: 157.9 KB


With less than 7000 observations we may want to gather data from the previos year, especially since we are planning to use nueral networks.  For now, continue importing the data.

## Demographic data

In [5]:
demo = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT')

In [6]:
columns = ['SEQN','RIAGENDR','RIDAGEYR','RIDRETH3']
df = df.merge(demo[columns],on='SEQN',how='left')
df.head()

Unnamed: 0,SEQN,LBXTC,RIAGENDR,RIDAGEYR,RIDRETH3
0,93705.0,157.0,2.0,66.0,4.0
1,93706.0,148.0,1.0,18.0,6.0
2,93707.0,189.0,1.0,13.0,7.0
3,93708.0,209.0,2.0,66.0,6.0
4,93709.0,176.0,2.0,75.0,4.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6738 entries, 0 to 6737
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      6738 non-null   float64
 1   LBXTC     6738 non-null   float64
 2   RIAGENDR  6738 non-null   float64
 3   RIDAGEYR  6738 non-null   float64
 4   RIDRETH3  6738 non-null   float64
dtypes: float64(5)
memory usage: 315.8 KB


Since we want those age 20 and over, filtering age before adding more.

In [8]:
df = df[df.RIDAGEYR>=20].copy()

## Body Mass Index

In [9]:
body_df = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.XPT')

In [10]:
columns = ['SEQN','BMXBMI']
df = df.merge(body_df[columns],on='SEQN',how='left')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4937 entries, 0 to 4936
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      4937 non-null   float64
 1   LBXTC     4937 non-null   float64
 2   RIAGENDR  4937 non-null   float64
 3   RIDAGEYR  4937 non-null   float64
 4   RIDRETH3  4937 non-null   float64
 5   BMXBMI    4860 non-null   float64
dtypes: float64(6)
memory usage: 270.0 KB


In [11]:
df.head()

Unnamed: 0,SEQN,LBXTC,RIAGENDR,RIDAGEYR,RIDRETH3,BMXBMI
0,93705.0,157.0,2.0,66.0,4.0,31.7
1,93708.0,209.0,2.0,66.0,6.0,23.7
2,93709.0,176.0,2.0,75.0,4.0,38.9
3,93711.0,238.0,1.0,56.0,6.0,21.3
4,93713.0,184.0,1.0,67.0,3.0,23.5


still need pulse, blood pressure, nutrition, exercise, smoking and drinking