# NHANES EDA on Health Disparities

<p style='text-align: center;'><i><b>Unraveling Health Disparities: A Data-Driven Exploration through Social Determinants of Health (SDOH) using NHANES (Demographic & Questionnaire) data</b></i></p>
<center><img src='https://github.com/kkrusere/NHANES-EDA-on-Health-Disparities-and-Inequities/blob/main/assets/nhanes_health_disparities.png?raw=true' width=600/></center>




# Table of Contents

1. [Introduction](#introduction)
   - [Overview of Health Disparities](#health_disparities)
   - [Social Determinants of Health (SDoH)](#social_determinants)
   - [NHANES Dataset](#nhanes_dataset)
   - [Project Goals and Objectives](#project_goals_objectives)

2. [Background and Context](#background_context)
   - [Understanding Health Disparities](#understanding_disparities)
   - [Role of Social Determinants](#role_of_social_determinants)
   - [Significance of NHANES Dataset](#nhanes_significance)

3. [Data Collection and Preparation](#data_collection_preparation)
   - [Introduction to NHANES-pyTOOL-API](#nhanes_pytool_api)
   - [NHANES Demographic Data](#nhanes_demographic_data)
   - [NHANES Questionnaire Data](#nhanes_questionnaire_data)
   - [Joining Demographic and Questionnaire Data](#joining_data)

4. [Exploratory Data Analysis (EDA)](#exploratory_data_analysis)
   - [Descriptive Statistics](#descriptive_statistics)
   - [Demographic Analysis](#demographic_analysis)
   - [Analysis of Health Behaviors](#health_behavior_analysis)
   - [Access to Healthcare](#access_to_healthcare)
   - [Social Determinants of Health Analysis](#social_determinants_analysis)

5. [Statistical Analysis](#statistical_analysis)
   - [Inferential Statistics](#inferential_statistics)
   - [Multivariate Analysis](#multivariate_analysis)
   - [Comparative Analysis](#comparative_analysis)
   - [Correlation Analysis](#correlation_analysis)
   - [Hypothesis Testing](#hypothesis_testing)

6. [Interpretation and Insights](#interpretation_insights)
   - [Key Findings from EDA](#key_findings_eda)
   - [Implications for Addressing Disparities](#addressing_disparities)

7. [Conclusion](#conclusion)
   - [Preject Summary](#project_summary)
   - [Future Directions](#future_directions)

8. [Appendix](#appendix)
   - [NHANES-pyTOOL-API Documentation](#api_documentation)
   - [Technical Details](#technical_details)

9. [References](#references)




## [Introduction](#introduction)

### [Overview of Health Disparities](#overview-of-health-disparities)


### [Social Determinants of Health (SDoH)](#social-determinants-of-health-sdoh)


### [NHANES Dataset](#nhanes-dataset)


### [Project Goals and Objectives](#project-goals-and-objectives)

## [Background and Context](#background-and-context)

### [Understanding Health Disparities](#understanding-health-disparities)

### [Role of Social Determinants](#role-of-social-determinants)

### [Significance of NHANES Dataset](#significance-of-nhanes-dataset)

## [Data Collection and Preparation](#data-collection-and-preparation)

### [Introduction to NHANES-pyTOOL-API](#introduction-to-nhanes-pytool-api)

In [1]:
#importing all the libraries that we are going to be neeing for data collection, processing and cleaning
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests


from nhanes_data.nhanes_data_api import NHANESDataAPI

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
#we are going to create an instance of the NHANES api

nhanes_data_api = NHANESDataAPI()

In [3]:
#we are going to list the survey cycles that we have data available for to use this project
cycle_list = nhanes_data_api.list_cycle_years()
cycle_list

['1999-2000',
 '2001-2002',
 '2003-2004',
 '2005-2006',
 '2007-2008',
 '2009-2010',
 '2011-2012',
 '2013-2014',
 '2015-2016',
 '2017-2018']

In [4]:
#we are going to list the different data categories that we have access to via the NHANES API
data_category_list = nhanes_data_api.list_data_categories()
data_category_list

['demographics',
 'dietary',
 'examination',
 'laboratory',
 'questionnaire',
 'limitedaccess']

Please note that for the project and as explained in the introductory part, we are going to be using the `demographic` and `questionnaire` data. the rest of the data categories we will use in later projects. 

### [NHANES Demographic Data](#nhanes-demographic-data)

In [5]:
#we are going to list the datafile categories within the demographic category, as we know some of these data categories have multiple data files in them
demo_files = nhanes_data_api.list_file_names(data_category='demographics', cycle_years='1999-2018')
demo_files

['Demographic Variables & Sample Weights']

The demographic data category only have one data file which makes it easier for us in the data retrieval process

In [6]:
#we are now ging to retrierve all the demographic data from 1999 to 2018
demo_dataframe = nhanes_data_api.retrieve_data(data_category='demographics', cycle='1999-2018', filename='Demographic Variables & Sample Weights')
demo_dataframe.head(5)

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,DMDBORN4,AIALANGA,DMDHHSZA,DMDHHSZB,DMDHHSZE,DMDHRBR4,DMDHRAGZ,DMDHREDZ,DMDHRMAZ,DMDHSEDZ
0,1.0,1.0,2.0,2.0,2.0,2.0,29.0,31.0,4.0,2.0,...,,,,,,,,,,
1,2.0,1.0,2.0,2.0,1.0,77.0,926.0,926.0,3.0,1.0,...,,,,,,,,,,
2,3.0,1.0,2.0,1.0,2.0,10.0,125.0,126.0,3.0,1.0,...,,,,,,,,,,
3,4.0,1.0,2.0,2.0,1.0,1.0,22.0,23.0,4.0,2.0,...,,,,,,,,,,
4,5.0,1.0,2.0,2.0,1.0,49.0,597.0,597.0,3.0,1.0,...,,,,,,,,,,


In [7]:
#first we change the datatype of the `SEQN` column from float to just an object or a string
demo_dataframe['SEQN'] = demo_dataframe['SEQN'].astype(int)
demo_dataframe['SEQN'] = demo_dataframe['SEQN'].astype(str)

In [8]:
demo_dataframe.head(5)

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,DMDBORN4,AIALANGA,DMDHHSZA,DMDHHSZB,DMDHHSZE,DMDHRBR4,DMDHRAGZ,DMDHREDZ,DMDHRMAZ,DMDHSEDZ
0,1,1.0,2.0,2.0,2.0,2.0,29.0,31.0,4.0,2.0,...,,,,,,,,,,
1,2,1.0,2.0,2.0,1.0,77.0,926.0,926.0,3.0,1.0,...,,,,,,,,,,
2,3,1.0,2.0,1.0,2.0,10.0,125.0,126.0,3.0,1.0,...,,,,,,,,,,
3,4,1.0,2.0,2.0,1.0,1.0,22.0,23.0,4.0,2.0,...,,,,,,,,,,
4,5,1.0,2.0,2.0,1.0,49.0,597.0,597.0,3.0,1.0,...,,,,,,,,,,


In [9]:
#we need to take a look at what all theses column mean
demo_list_of_variable = list(demo_dataframe.columns)
demo_list_of_variable


['SEQN',
 'SDDSRVYR',
 'RIDSTATR',
 'RIDEXMON',
 'RIAGENDR',
 'RIDAGEYR',
 'RIDAGEMN',
 'RIDAGEEX',
 'RIDRETH1',
 'RIDRETH2',
 'DMQMILIT',
 'DMDBORN',
 'DMDCITZN',
 'DMDYRSUS',
 'DMDEDUC3',
 'DMDEDUC2',
 'DMDEDUC',
 'DMDSCHOL',
 'DMDMARTL',
 'DMDHHSIZ',
 'INDHHINC',
 'INDFMINC',
 'INDFMPIR',
 'RIDEXPRG',
 'RIDPREG',
 'DMDHRGND',
 'DMDHRAGE',
 'DMDHRBRN',
 'DMDHREDU',
 'DMDHRMAR',
 'DMDHSEDU',
 'WTINT2YR',
 'WTINT4YR',
 'WTMEC2YR',
 'WTMEC4YR',
 'SDMVPSU',
 'SDMVSTRA',
 'SDJ1REPN',
 'DMAETHN',
 'DMARACE',
 'WTMREP01',
 'WTMREP02',
 'WTMREP03',
 'WTMREP04',
 'WTMREP05',
 'WTMREP06',
 'WTMREP07',
 'WTMREP08',
 'WTMREP09',
 'WTMREP10',
 'WTMREP11',
 'WTMREP12',
 'WTMREP13',
 'WTMREP14',
 'WTMREP15',
 'WTMREP16',
 'WTMREP17',
 'WTMREP18',
 'WTMREP19',
 'WTMREP20',
 'WTMREP21',
 'WTMREP22',
 'WTMREP23',
 'WTMREP24',
 'WTMREP25',
 'WTMREP26',
 'WTMREP27',
 'WTMREP28',
 'WTMREP29',
 'WTMREP30',
 'WTMREP31',
 'WTMREP32',
 'WTMREP33',
 'WTMREP34',
 'WTMREP35',
 'WTMREP36',
 'WTMREP37',
 'WTMREP3

In [27]:
#we are going to create a function thats going to allow us to retrieve the veriable documentation
def get_variable_df(url, cycle_list=cycle_list):
    """
    This function inputs the NHANES URL for Variable list.
    The Pandas' pandas.read_html() function is used to read tables on the URL.
    Resulting data frame is then cleaned (adding a year column and removing some other columns not needed).
    Year column matches the survey cycle periods.
    Cycle list is used to filter the data to just the cycle of interest.
    Returns the data frame of the variable.
    """
    dfs = pd.read_html(url)
    df = dfs[0]  # the table of interest is at index 0

    Years = [i for i in range(len(df))]
    df["Years"] = Years
    for i in range(len(df)):
        x = df['Begin Year'][i]
        y = df['EndYear'][i]
        df.loc[i, "Years"] = f"{x}-{y}"

    df.drop(["Begin Year", "EndYear", "Component", "Use Constraints"], axis=1, inplace=True)
    df = df.loc[df["Years"].isin(cycle_list)]
    df.reset_index(drop=True, inplace=True)

    return df

In [28]:
#we are going to retrieve the variable table 
#Demography variable URL
demographics_url = "https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=demographics"

In [29]:
#we call the get_variable_df fuction for the demography URL
demographics_var_df = get_variable_df(demographics_url)
demographics_var_df.head()

  df.loc[i, "Years"] = f"{x}-{y}"


Unnamed: 0,Variable Name,Variable Description,Data File Name,Data File Description,Years
0,AIALANG,Language of the MEC ACASI Interview Instrument,DEMO_D,Demographic Variables & Sample Weights,2005-2006
1,DMDBORN,In what country {were you/was SP} born?,DEMO_D,Demographic Variables & Sample Weights,2005-2006
2,DMDCITZN,{Are you/Is SP} a citizen of the United States...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
3,DMDEDUC2,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
4,DMDEDUC3,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006


In [31]:
Variable_Description_my_dict = dict(zip(demographics_var_df['Variable Name'], demographics_var_df['Variable Description']))
Variable_Description_my_dict

{'AIALANG': 'Language of the MEC ACASI Interview Instrument',
 'DMDBORN': 'In what country {were you/was SP} born?',
 'DMDCITZN': '{Are you/Is SP} a citizen of the United States? [Information about citizenship is being collected by the U.S. Public Health Service to perform health related research. Providing this information is voluntary and is collected under the authority of the Public Health Service Act. There will be no effect on pending immigration or citizenship petitions.]',
 'DMDEDUC2': 'What is the highest grade or level of school {you have/SP has} completed or the highest degree {you have/s/he has} received?',
 'DMDEDUC3': 'What is the highest grade or level of school {you have/SP has} completed or the highest degree {you have/s/he has} received?',
 'DMDFMSIZ': 'Total number of people in the Family',
 'DMDHHSIZ': 'Total number of people in the Household',
 'DMDHRAGE': "HH reference person's age in years",
 'DMDHRBRN': 'In what country {were you/was NON-SP Head} born?',
 'DMDHR

In [32]:
#we are going to get rid of the below variable as they are of no use to our project:

variables_to_be_removed = {'SDJ1REPN': 'Jack Knife Replicate Number',
 'WTIREP01': 'Interview Weight Jack Knife Replicate 01',
 'WTIREP02': 'Interview Weight Jack Knife Replicate 02',
 'WTIREP03': 'Interview Weight Jack Knife Replicate 03',
 'WTIREP04': 'Interview Weight Jack Knife Replicate 04',
 'WTIREP05': 'Interview Weight Jack Knife Replicate 05',
 'WTIREP06': 'Interview Weight Jack Knife Replicate 06',
 'WTIREP07': 'Interview Weight Jack Knife Replicate 07',
 'WTIREP08': 'Interview Weight Jack Knife Replicate 08',
 'WTIREP09': 'Interview Weight Jack Knife Replicate 09',
 'WTIREP10': 'Interview Weight Jack Knife Replicate 10',
 'WTIREP11': 'Interview Weight Jack Knife Replicate 11',
 'WTIREP12': 'Interview Weight Jack Knife Replicate 12',
 'WTIREP13': 'Interview Weight Jack Knife Replicate 13',
 'WTIREP14': 'Interview Weight Jack Knife Replicate 14',
 'WTIREP15': 'Interview Weight Jack Knife Replicate 15',
 'WTIREP16': 'Interview Weight Jack Knife Replicate 16',
 'WTIREP17': 'Interview Weight Jack Knife Replicate 17',
 'WTIREP18': 'Interview Weight Jack Knife Replicate 18',
 'WTIREP19': 'Interview Weight Jack Knife Replicate 19',
 'WTIREP20': 'Interview Weight Jack Knife Replicate 20',
 'WTIREP21': 'Interview Weight Jack Knife Replicate 21',
 'WTIREP22': 'Interview Weight Jack Knife Replicate 22',
 'WTIREP23': 'Interview Weight Jack Knife Replicate 23',
 'WTIREP24': 'Interview Weight Jack Knife Replicate 24',
 'WTIREP25': 'Interview Weight Jack Knife Replicate 25',
 'WTIREP26': 'Interview Weight Jack Knife Replicate 26',
 'WTIREP27': 'Interview Weight Jack Knife Replicate 27',
 'WTIREP28': 'Interview Weight Jack Knife Replicate 28',
 'WTIREP29': 'Interview Weight Jack Knife Replicate 29',
 'WTIREP30': 'Interview Weight Jack Knife Replicate 30',
 'WTIREP31': 'Interview Weight Jack Knife Replicate 31',
 'WTIREP32': 'Interview Weight Jack Knife Replicate 32',
 'WTIREP33': 'Interview Weight Jack Knife Replicate 33',
 'WTIREP34': 'Interview Weight Jack Knife Replicate 34',
 'WTIREP35': 'Interview Weight Jack Knife Replicate 35',
 'WTIREP36': 'Interview Weight Jack Knife Replicate 36',
 'WTIREP37': 'Interview Weight Jack Knife Replicate 37',
 'WTIREP38': 'Interview Weight Jack Knife Replicate 38',
 'WTIREP39': 'Interview Weight Jack Knife Replicate 39',
 'WTIREP40': 'Interview Weight Jack Knife Replicate 40',
 'WTIREP41': 'Interview Weight Jack Knife Replicate 41',
 'WTIREP42': 'Interview Weight Jack Knife Replicate 42',
 'WTIREP43': 'Interview Weight Jack Knife Replicate 43',
 'WTIREP44': 'Interview Weight Jack Knife Replicate 44',
 'WTIREP45': 'Interview Weight Jack Knife Replicate 45',
 'WTIREP46': 'Interview Weight Jack Knife Replicate 46',
 'WTIREP47': 'Interview Weight Jack Knife Replicate 47',
 'WTIREP48': 'Interview Weight Jack Knife Replicate 48',
 'WTIREP49': 'Interview Weight Jack Knife Replicate 49',
 'WTIREP50': 'Interview Weight Jack Knife Replicate 50',
 'WTIREP51': 'Interview Weight Jack Knife Replicate 51',
 'WTIREP52': 'Interview Weight Jack Knife Replicate 52',
 'WTMREP01': 'MEC Exam Weight Jack Knife Replicate 01',
 'WTMREP02': 'MEC Exam Weight Jack Knife Replicate 02',
 'WTMREP03': 'MEC Exam Weight Jack Knife Replicate 03',
 'WTMREP04': 'MEC Exam Weight Jack Knife Replicate 04',
 'WTMREP05': 'MEC Exam Weight Jack Knife Replicate 05',
 'WTMREP06': 'MEC Exam Weight Jack Knife Replicate 06',
 'WTMREP07': 'MEC Exam Weight Jack Knife Replicate 07',
 'WTMREP08': 'MEC Exam Weight Jack Knife Replicate 08',
 'WTMREP09': 'MEC Exam Weight Jack Knife Replicate 09',
 'WTMREP10': 'MEC Exam Weight Jack Knife Replicate 10',
 'WTMREP11': 'MEC Exam Weight Jack Knife Replicate 11',
 'WTMREP12': 'MEC Exam Weight Jack Knife Replicate 12',
 'WTMREP13': 'MEC Exam Weight Jack Knife Replicate 13',
 'WTMREP14': 'MEC Exam Weight Jack Knife Replicate 14',
 'WTMREP15': 'MEC Exam Weight Jack Knife Replicate 15',
 'WTMREP16': 'MEC Exam Weight Jack Knife Replicate 16',
 'WTMREP17': 'MEC Exam Weight Jack Knife Replicate 17',
 'WTMREP18': 'MEC Exam Weight Jack Knife Replicate 18',
 'WTMREP19': 'MEC Exam Weight Jack Knife Replicate 19',
 'WTMREP20': 'MEC Exam Weight Jack Knife Replicate 20',
 'WTMREP21': 'MEC Exam Weight Jack Knife Replicate 21',
 'WTMREP22': 'MEC Exam Weight Jack Knife Replicate 22',
 'WTMREP23': 'MEC Exam Weight Jack Knife Replicate 23',
 'WTMREP24': 'MEC Exam Weight Jack Knife Replicate 24',
 'WTMREP25': 'MEC Exam Weight Jack Knife Replicate 25',
 'WTMREP26': 'MEC Exam Weight Jack Knife Replicate 26',
 'WTMREP27': 'MEC Exam Weight Jack Knife Replicate 27',
 'WTMREP28': 'MEC Exam Weight Jack Knife Replicate 28',
 'WTMREP29': 'MEC Exam Weight Jack Knife Replicate 29',
 'WTMREP30': 'MEC Exam Weight Jack Knife Replicate 30',
 'WTMREP31': 'MEC Exam Weight Jack Knife Replicate 31',
 'WTMREP32': 'MEC Exam Weight Jack Knife Replicate 32',
 'WTMREP33': 'MEC Exam Weight Jack Knife Replicate 33',
 'WTMREP34': 'MEC Exam Weight Jack Knife Replicate 34',
 'WTMREP35': 'MEC Exam Weight Jack Knife Replicate 35',
 'WTMREP36': 'MEC Exam Weight Jack Knife Replicate 36',
 'WTMREP37': 'MEC Exam Weight Jack Knife Replicate 37',
 'WTMREP38': 'MEC Exam Weight Jack Knife Replicate 38',
 'WTMREP39': 'MEC Exam Weight Jack Knife Replicate 39',
 'WTMREP40': 'MEC Exam Weight Jack Knife Replicate 40',
 'WTMREP41': 'MEC Exam Weight Jack Knife Replicate 41',
 'WTMREP42': 'MEC Exam Weight Jack Knife Replicate 42',
 'WTMREP43': 'MEC Exam Weight Jack Knife Replicate 43',
 'WTMREP44': 'MEC Exam Weight Jack Knife Replicate 44',
 'WTMREP45': 'MEC Exam Weight Jack Knife Replicate 45',
 'WTMREP46': 'MEC Exam Weight Jack Knife Replicate 46',
 'WTMREP47': 'MEC Exam Weight Jack Knife Replicate 47',
 'WTMREP48': 'MEC Exam Weight Jack Knife Replicate 48',
 'WTMREP49': 'MEC Exam Weight Jack Knife Replicate 49',
 'WTMREP50': 'MEC Exam Weight Jack Knife Replicate 50',
 'WTMREP51': 'MEC Exam Weight Jack Knife Replicate 51',
 'WTMREP52': 'MEC Exam Weight Jack Knife Replicate 52',
 'AIALANGA': 'Language of the MEC ACASI Interview Instrument',}

list_of_variables_to_be_removed = list(variables_to_be_removed.keys())

In [33]:
list_of_variables_to_be_removed

['SDJ1REPN',
 'WTIREP01',
 'WTIREP02',
 'WTIREP03',
 'WTIREP04',
 'WTIREP05',
 'WTIREP06',
 'WTIREP07',
 'WTIREP08',
 'WTIREP09',
 'WTIREP10',
 'WTIREP11',
 'WTIREP12',
 'WTIREP13',
 'WTIREP14',
 'WTIREP15',
 'WTIREP16',
 'WTIREP17',
 'WTIREP18',
 'WTIREP19',
 'WTIREP20',
 'WTIREP21',
 'WTIREP22',
 'WTIREP23',
 'WTIREP24',
 'WTIREP25',
 'WTIREP26',
 'WTIREP27',
 'WTIREP28',
 'WTIREP29',
 'WTIREP30',
 'WTIREP31',
 'WTIREP32',
 'WTIREP33',
 'WTIREP34',
 'WTIREP35',
 'WTIREP36',
 'WTIREP37',
 'WTIREP38',
 'WTIREP39',
 'WTIREP40',
 'WTIREP41',
 'WTIREP42',
 'WTIREP43',
 'WTIREP44',
 'WTIREP45',
 'WTIREP46',
 'WTIREP47',
 'WTIREP48',
 'WTIREP49',
 'WTIREP50',
 'WTIREP51',
 'WTIREP52',
 'WTMREP01',
 'WTMREP02',
 'WTMREP03',
 'WTMREP04',
 'WTMREP05',
 'WTMREP06',
 'WTMREP07',
 'WTMREP08',
 'WTMREP09',
 'WTMREP10',
 'WTMREP11',
 'WTMREP12',
 'WTMREP13',
 'WTMREP14',
 'WTMREP15',
 'WTMREP16',
 'WTMREP17',
 'WTMREP18',
 'WTMREP19',
 'WTMREP20',
 'WTMREP21',
 'WTMREP22',
 'WTMREP23',
 'WTMREP24',

In [35]:
len(list(demo_dataframe.columns))

175

In [36]:
#we are going to drop the columns from our main demo_dataframe
demo_dataframe = demo_dataframe.drop(columns=list_of_variables_to_be_removed)

In [37]:
len(list(demo_dataframe.columns))

69

In [39]:
variable_list = [col for col in demo_dataframe.columns if col != 'year']
variable_list

['SEQN',
 'SDDSRVYR',
 'RIDSTATR',
 'RIDEXMON',
 'RIAGENDR',
 'RIDAGEYR',
 'RIDAGEMN',
 'RIDAGEEX',
 'RIDRETH1',
 'RIDRETH2',
 'DMQMILIT',
 'DMDBORN',
 'DMDCITZN',
 'DMDYRSUS',
 'DMDEDUC3',
 'DMDEDUC2',
 'DMDEDUC',
 'DMDSCHOL',
 'DMDMARTL',
 'DMDHHSIZ',
 'INDHHINC',
 'INDFMINC',
 'INDFMPIR',
 'RIDEXPRG',
 'RIDPREG',
 'DMDHRGND',
 'DMDHRAGE',
 'DMDHRBRN',
 'DMDHREDU',
 'DMDHRMAR',
 'DMDHSEDU',
 'WTINT2YR',
 'WTINT4YR',
 'WTMEC2YR',
 'WTMEC4YR',
 'SDMVPSU',
 'SDMVSTRA',
 'DMAETHN',
 'DMARACE',
 'SIALANG',
 'SIAPROXY',
 'SIAINTRP',
 'FIALANG',
 'FIAPROXY',
 'FIAINTRP',
 'MIALANG',
 'MIAPROXY',
 'MIAINTRP',
 'AIALANG',
 'DMDFMSIZ',
 'DMDBORN2',
 'INDHHIN2',
 'INDFMIN2',
 'DMDHRBR2',
 'RIDRETH3',
 'RIDEXAGY',
 'RIDEXAGM',
 'DMQMILIZ',
 'DMQADFC',
 'DMDBORN4',
 'DMDHHSZA',
 'DMDHHSZB',
 'DMDHHSZE',
 'DMDHRBR4',
 'DMDHRAGZ',
 'DMDHREDZ',
 'DMDHRMAZ',
 'DMDHSEDZ']

In [None]:
#we are going to go retreive the demographic data documentation, that way we can figure out what the variable/columns mean


varibale_code_table = dict()
variable_sas_label = dict()
variable_English_Text = dict()


for i, cycle_year in enumerate(cycle_list):
   
  temp = ''
  letter = chr(ord('A') + i)
  if letter != 'A':
      temp = f"_{letter}"
      
  data_File_Name = f"DEMO{temp}"

  print(f"datafile: {data_File_Name} -----> cycle: {cycle_year}")

  url = f"https://wwwn.cdc.gov/Nchs/Nhanes/{cycle_year}/{data_File_Name}.htm"
  req=requests.get(url)
  content=req.text
  soup = BeautifulSoup(content)

  mydivs = soup.find_all("div", {"class": "pagebreak"})
  for i, div in enumerate(mydivs):
    x = div.find_all_next()
    variable = x[0]["id"]
    if variable in variable_list:
      #print(variable)
      #print(f"{x[2].text}{x[3].text}")
      variable_sas_label[variable] = x[5].text
      #print(f"{x[4].text}{x[5].text}")
      variable_English_Text[variable] = {x[7].text}
      #print(f"{x[6].text}{x[7].text}")
      if div.find("table") is not None:
        table = pd.read_html(str(div.find('table')))[0]
        varibale_code_table[variable] = table
        #print(data_frame[i-1])
      #print("#####################")

In [48]:
!pip install html5lib

Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.2/112.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting webencodings (from html5lib)
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1


In [50]:

url = f"https://wwwn.cdc.gov/Nchs/Nhanes/{cycle_year}/{data_File_Name}.htm"
req = requests.get(url)
content = req.text
soup = BeautifulSoup(content, 'html5lib') 

mydivs = soup.find_all("div", {"class": "pagebreak"})
for i, div in enumerate(mydivs):
    x = div.find_all_next()
    variable = x[0]["id"]
    if variable in variable_list:
        variable_sas_label[variable] = x[5].text
        variable_English_Text[variable] = {x[7].text}
        if div.find("table") is not None:
            #pd.read_html with flavor='bs4' to read the HTML table
            table = pd.read_html(str(div.find('table')), flavor='bs4')[0]
            varibale_code_table[variable] = table

FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

In [14]:
#we need to do a lot of cleanup on the data, and we are going to do some descriptive statistics on the dataframe
demo_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101316 entries, 0 to 101315
Columns: 175 entries, SEQN to DMDHSEDZ
dtypes: float64(174), object(1)
memory usage: 135.3+ MB


In [15]:
demo_dataframe.describe()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,DMDBORN4,AIALANGA,DMDHHSZA,DMDHHSZB,DMDHHSZE,DMDHRBR4,DMDHRAGZ,DMDHREDZ,DMDHRMAZ,DMDHSEDZ
count,101316.0,101316.0,101316.0,96766.0,101316.0,101316.0,63085.0,57874.0,101316.0,31126.0,...,39156.0,23010.0,39156.0,39156.0,39156.0,28844.0,9254.0,8764.0,9063.0,4751.0
mean,51134.397193,5.425984,1.955091,1.523407,1.507551,31.12829,339.2977,351.7031,2.895831,2.086584,...,1.241828,1.119991,0.5322301,0.9717029,0.4204975,1.412668,2.860061,2.050776,1.472691,2.110714
std,29836.19262,2.850576,0.207105,0.499454,0.499945,24.94308,287.4832,283.1834,1.251255,1.099042,...,1.745682,0.369281,0.8111773,1.160525,0.7164718,2.808172,0.810059,0.652806,0.721168,0.688517
min,1.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,5.397605e-79,1.0,1.0,...,1.0,1.0,5.397605e-79,5.397605e-79,5.397605e-79,1.0,1.0,1.0,1.0,1.0
25%,25329.75,3.0,2.0,1.0,1.0,10.0,98.0,118.25,2.0,1.0,...,1.0,1.0,5.397605e-79,5.397605e-79,5.397605e-79,1.0,2.0,2.0,1.0,2.0
50%,50658.5,5.0,2.0,2.0,2.0,24.0,233.0,248.0,3.0,2.0,...,1.0,1.0,5.397605e-79,1.0,5.397605e-79,1.0,3.0,2.0,1.0,2.0
75%,77627.25,8.0,2.0,2.0,2.0,52.0,560.0,572.0,4.0,3.0,...,1.0,1.0,1.0,2.0,1.0,2.0,4.0,2.0,2.0,3.0
max,102956.0,10.0,2.0,2.0,2.0,85.0,1019.0,1019.0,5.0,5.0,...,99.0,3.0,3.0,4.0,3.0,99.0,4.0,3.0,3.0,3.0


### [NHANES Questionnaire Data](#nhanes-questionnaire-data)

### [Joining Demographic and Questionnaire Data](#joining-demographic-and-questionnaire-data)

## [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)

### [Descriptive Statistics](#descriptive-statistics)

### [Demographic Analysis](#demographic-analysis)


### [Analysis of Health Behaviors and Life-Style](#analysis-of-health-behaviors)

### [Access to Healthcare](#access-to-healthcare)

### [Social Determinants of Health Analysis](#social-determinants-of-health-analysis)

## [Statistical Analysis](#statistical-analysis)

### [Inferential Statistics](#inferential-statistics)

### [Multivariate Analysis](#multivariate-analysis)

### [Comparative Analysis](#comparative-analysis)


### [Correlation Analysis](#correlation-analysis)

### [Hypothesis Testing](#hypothesis-testing)

## [Interpretation and Insights](#interpretation-and-insights)

### [Key Findings from EDA](#key-findings-from-eda)


### [Implications for Addressing Disparities](#implications-for-addressing-disparities)

## [Conclusion](#conclusion)

### [Preject Summary](#project_summary)

### [Future Directions](#future_directions)

## [Appendix](#appendix)


### [NHANES-pyTOOL-API Documentation](#api_documentation)


### [Technical Details](#technical_details)

## [References](#references)