# **Project Outline**
Problem Statement: The NHANES data is a rich collection of demographics, socioeconomic, dietary, and health-related data. Exploratory Data Analysis of this data can provide some insight into the US Health System and the resulting Health Outcomes of the system. This can also be Used to shed light on and Investigate Health Disparities/Inequities in the Healthcare System. However, if you do not possess the technical know-how on extracting, analyzing, and visualizing NHANES data, for example, knowledge of SAS (an innovative Analytics, Artificial Intelligence, and Data Management software), which is the format that the data is in, I would imagine that if you’re not some sort of a Data Analyst or Data Scientist the data is useless. 


The goal/idea or solution that is being presented by the research project is an interactive data visualization tool on the NHANES data. This  User-friendly, no need for IT support, interactive Data Visualization tool/dashboard. This would Aid anyone with an internet connection (so  doctors, nurses, healthcare professionals and providers, health insurance providers, Health Department officials, regular people, etc.) to visualize, gain insight from the data, and make data-driven decisions on the subject matter (Health disparities and inequities). 

1. **Project Objectve/Define Business Requirement** 
  > Creating an Interactive Data Visualisation (Dashboard) of the NHANES data using Python and Streamlit.
2. **Data Collecton**
  > Since we dont have access to the database where the NHANES data s housed, we are going to devise a plan to programatically down load the data from the NHANES website. As stated we are going to look at selected Data Files that have a direct relation to the subject matter and is present in at least 9 cycles (the survey cycles) starting from 1999 to 2018.
3. **Data Cleaning and Preparation** 
  > This is going to involve first, cleaning each data file per survey cycle, then joining the same data files from each cycle to gether. The data cleaning will also involve determining which features/attributes/columns in the data files are of mosts relevance to the subject matter. Then we will create a table/dataframe that have the the `"SAS Label"` licked to the data file columns. Finally here, we will merge all the data files in to one dataframe using the `"SEQN - Respondent sequence number"` as the reference. Since some of the columns are numerically encoded, another Data frame is going to be create that links the `"Code or Value"` of each column to its `"Value Description"`.
4. **Data Exploration and Analysis**
  > EDA is the process of performing investigations on data to discover patterns, anomalies, testing hypothesis and cheking assumptions, which we might intially have of the data, using summary statistics and graphical representations. This part of the project is basically for exploring/investigating the data and figuring out what we can do with it. Here, we will explore the different visualizations that we can create from the data and what insight can be generalized from these Visualization (and the data at large).
5. **Minimal Viable Model**
6. **Deployment and Enhancements** 


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

## **Data collection**

As mentioned above, the data from the NHANES is collected in 2 year cycles.

In [2]:
#creating a list of the survey cycles that we are going to be collecting the data
cycle_list = list()
a = 1999
b = 2000
for i in range(10):
  cycle_list.append(f"{a}-{b}")
  a = b + 1
  b = b + 2

cycle_list

['1999-2000',
 '2001-2002',
 '2003-2004',
 '2005-2006',
 '2007-2008',
 '2009-2010',
 '2011-2012',
 '2013-2014',
 '2015-2016',
 '2017-2018']

### **Data Variable Names**

**Demographic Variables**

In [3]:
#Demography variable URL from the NHANES
demographics_url = "https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=demographics"

In [4]:
def get_variable_df(url, cycle_list = cycle_list):
  """
  This fuction inputs the NHANES URL for Variable list 
  The Pandas' pandas.read_html() fuction is used to read tables on the URL
  Resulting data frame is then cleaned (adding a year column and removing some other columns not needed)
  Year column matches the survey cycle periods
  Cycle list is used to filter the data to just the cycle of interest 
  Returns the data frame of the variable  
  """
  dfs = pd.read_html(url)
  df = dfs[0] #the table of interest in on the index 0

  Years = [i for i in  range(len(df))]
  df["Years"] = Years 
  for i in range(len(df)):
    x = df['Begin Year'][i]
    y = df['EndYear'][i]
    df["Years"][i] = f"{x}-{y}"
  df.drop(["Begin Year", "EndYear", "Component","Use Constraints"], axis=1, inplace=True)
  df = df.loc[df["Years"].isin(cycle_list)]
  df.reset_index(drop=True, inplace=True)

  return df

In [5]:
#we call the get_variable_df fuction for the demography URL
demographics_var_df = get_variable_df(demographics_url)
demographics_var_df.head()

Unnamed: 0,Variable Name,Variable Description,Data File Name,Data File Description,Years
0,AIALANG,Language of the MEC ACASI Interview Instrument,DEMO_D,Demographic Variables & Sample Weights,2005-2006
1,DMDBORN,In what country {were you/was SP} born?,DEMO_D,Demographic Variables & Sample Weights,2005-2006
2,DMDCITZN,{Are you/Is SP} a citizen of the United States...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
3,DMDEDUC2,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
4,DMDEDUC3,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006


In [6]:
def check_variable(data_frame):
  """
  This fuction look at the variables if they are in all cycles or not 
  If the variable is present in all 10 cycles append it to a list for later use
  """
  variable_list = list()
  new_list = list((dict(data_frame['Variable Name'].value_counts())).keys())
  for i in range(len(new_list)):
    temp = data_frame[data_frame["Variable Name"] == f"{new_list[i]}"]
    temp = temp.reset_index(drop=True)
    if temp.shape[0] > 5:#the variable has to be in all the years that we are looking at
      variable_list.append(new_list[i])
      print(f"the variable name: {new_list[i]}")
      print(f"the variable description is: {temp['Variable Description'][0]}")
      print(f"there are {temp.shape[0]} examples of this variable")
      print("#################################################################")
  
  return variable_list

In [7]:
#so we check the variable in the demography dataframe 
variable_list = check_variable(demographics_var_df)

the variable name: INDFMPIR
the variable description is: Poverty income ratio (PIR) - a ratio of family income to poverty threshold
there are 10 examples of this variable
#################################################################
the variable name: DMDYRSUS
the variable description is: Length of time SP has been in the US.
there are 10 examples of this variable
#################################################################
the variable name: WTINT2YR
the variable description is: Interviewed Sample Persons.
there are 10 examples of this variable
#################################################################
the variable name: SEQN
the variable description is: Respondent sequence number.
there are 10 examples of this variable
#################################################################
the variable name: SDMVSTRA
the variable description is: Masked Variance Unit Pseudo-Stratum variable for variance estimation
there are 10 examples of this variable
######################

In [8]:
#we are going to remove the below variable from the variable list as they are of no use here
variable_list = [ele for ele in variable_list if ele not in ["RIDEXMON", "WTINT2YR", "SDMVSTRA", "RIDSTATR", "WTMEC2YR"]]

#we use the variable list to filter the demography variable dataframe
demographics_var_df =  demographics_var_df.loc[demographics_var_df["Variable Name"].isin(variable_list)]
demographics_var_df.reset_index(drop=True, inplace=True)
demographics_var_df.head()

Unnamed: 0,Variable Name,Variable Description,Data File Name,Data File Description,Years
0,DMDCITZN,{Are you/Is SP} a citizen of the United States...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
1,DMDEDUC2,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
2,DMDEDUC3,(SP Interview Version) What is the highest gra...,DEMO_D,Demographic Variables & Sample Weights,2005-2006
3,DMDFMSIZ,Total number of people in the Family,DEMO_D,Demographic Variables & Sample Weights,2005-2006
4,DMDHHSIZ,Total number of people in the Household,DEMO_D,Demographic Variables & Sample Weights,2005-2006


Next we collect the variiable documentation. This documentation describes what each of the variables is and the encoding in the actual data file.

In [9]:
def get_variable_documentation(data_File_Name, cycle = cycle_list[0], variable_list = variable_list):
  """
  This fuction goes to the  NHANES Data Documentation for the input data_file_name with the help of BeautifulSoup
  Three dictionaries are created for the code table, sas label and the english text (variable explanation)
  The fuction then returns the dictionaries. 
  """
  url = f"https://wwwn.cdc.gov/Nchs/Nhanes/{cycle}/{data_File_Name}.htm"
  varibale_code_table = dict()
  variable_sas_label = dict()
  variable_English_Text = dict()
  
  req=requests.get(url)
  content=req.text
  soup = BeautifulSoup(content)

  mydivs = soup.find_all("div", {"class": "pagebreak"})
  for i, div in enumerate(mydivs):
    x = div.find_all_next()
    variable = x[0]["id"]
    if variable in variable_list:
      #print(variable)
      #print(f"{x[2].text}{x[3].text}")
      variable_sas_label[variable] = x[5].text
      #print(f"{x[4].text}{x[5].text}")
      variable_English_Text[variable] = {x[7].text}
      #print(f"{x[6].text}{x[7].text}")
      if div.find("table") is not None:
        table = pd.read_html(str(div.find('table')))[0]
        varibale_code_table[variable] = table
        #print(data_frame[i-1])
      #print("#####################")

  return varibale_code_table, variable_sas_label, variable_English_Text
 


In [10]:
#code we will call the get_variable_documentation fuction 
demography_varibale_code_table, demography_variable_sas_label, demography_variable_English_Text = get_variable_documentation("DEMO")

**Questionnaire Variables**

We are first going to select the Data File the are present in all of the 10 survey cycles. After selecting those, we will go through each selecting variables that are of importance to this project.

In [11]:
questionnaire_url = "https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=questionnaire"
questionnaire_var_df = get_variable_df(questionnaire_url)
questionnaire_var_df.drop(questionnaire_var_df[questionnaire_var_df['Data File Name'] == "OCQ_H_R"].index, inplace = True)
questionnaire_var_df.reset_index(drop=True, inplace=True)
questionnaire_var_df.head()

Unnamed: 0,Variable Name,Variable Description,Data File Name,Data File Description,Years
0,ACD010A,What language(s) {do you/does SP} usually spea...,ACQ_D,Acculturation,2005-2006
1,ACD010B,What language(s) {do you/does SP} usually spea...,ACQ_D,Acculturation,2005-2006
2,ACD010C,What language(s) {do you/does SP} usually spea...,ACQ_D,Acculturation,2005-2006
3,ACD040,Now I'm going to ask you about language use. W...,ACQ_D,Acculturation,2005-2006
4,SEQN,Respondent sequence number.,ACQ_D,Acculturation,2005-2006


In [12]:
questionnaire_data_file_list = list()
for string_key in list(questionnaire_var_df["Data File Description"].unique()):
  if string_key not in ["Dermatology",
                        'Blood Pressure & Cholesterol',
                        'Diet Behavior & Nutrition',
                        'Immunization',
                        'Kidney Conditions - Urology',
                        'Oral Health',
                        'Physical Functioning',
                        'Pesticide Use',
                        'Smoking - Household Smokers',
                        'Weight History',
                        'Respiratory Health',
                        'Sexual Behavior',
                        'Diabetes',
                        'Drug Use',
                        'Reproductive Health',
                        'Consumer Behavior',
                        'Food Security',
                        "Sexual Behavior - Youth", 
                        "Acculturation", 
                        "Alcohol Use",
                        "Audiometry", 
                        "Prescription Medications",
                        "Cardiovascular Health", 
                        "Early Childhood"]:
    a = len(questionnaire_var_df[questionnaire_var_df["Data File Description"] == string_key]["Years"].value_counts())
    if a >=9 or string_key in ["Respiratory Health","Consumer Behavior","Income"]:
      questionnaire_data_file_list.append(string_key)
questionnaire_data_file_list

['Current Health Status',
 'Medical Conditions',
 'Physical Activity',
 'Health Insurance',
 'Hospital Utilization & Access to Care',
 'Housing Characteristics',
 'Occupation',
 'Income']

Now that we have this data file list, we can go through it file name by file name selecting the variable that of more interest to this project 

In [13]:
def return_temp_df(data_File_Name):
  """
  This fuction filters the questionnaire_var_df using the input data file name and returns the resulting dataframe
  """
  print(f"The data frame is for {data_File_Name}")
  df = questionnaire_var_df[questionnaire_var_df["Data File Description"] == data_File_Name]
  df.reset_index(drop= True, inplace=True)
  return df

We need to create a dictionary that will house all the information about the variables. so for a particulart Data Category, we will have the varibale_code_table, variable_sas_label, variable_English_Text. So we will have these three in a list and then have that list in a dictionary.

In [14]:
variable_documentation_dict = dict() #the keys will be the Data Category names
#so we will add the Demography variable information first
variable_documentation_dict["Demography"] = [variable_list, demography_variable_sas_label,demography_variable_English_Text,demography_varibale_code_table]

In [15]:
for i, the_name in enumerate(questionnaire_data_file_list):
  print("\n")
  df = return_temp_df(the_name)
  q_variable_list = check_variable(df)
  print(f"This is the variable list: {q_variable_list} \n \n")
  x,y,z = get_variable_documentation(df["Data File Name"][0], cycle=df["Years"][0], variable_list = q_variable_list)
  variable_documentation_dict[f"{the_name}"] = [q_variable_list, y,z,x]




The data frame is for Current Health Status
the variable name: HSAQUEX
the variable description is: Source of Health Status Data
there are 10 examples of this variable
#################################################################
the variable name: HSQ500
the variable description is: Did {you/SP} have a head cold or chest cold that started during those 30 days?
there are 10 examples of this variable
#################################################################
the variable name: HSQ510
the variable description is: Did {you/SP} have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days?
there are 10 examples of this variable
#################################################################
the variable name: HSQ520
the variable description is: Did {you/SP} have flu, pneumonia, or ear infections that started during those 30 days?
there are 10 examples of this variable
#################################################################
the var



The data frame is for Physical Activity
the variable name: SEQN
the variable description is: Respondent sequence number.
there are 10 examples of this variable
#################################################################
the variable name: PAAQUEX
the variable description is: Questionnaire source flag for weighting
there are 8 examples of this variable
#################################################################
the variable name: PAQ610
the variable description is: In a typical week, on how many days {do you/does SP} do vigorous-intensity activities as part of your work?
there are 6 examples of this variable
#################################################################
the variable name: PAQ670
the variable description is: In a typical week, on how many days {do you/does SP} do moderate-intensity sports, fitness or recreational activities?
there are 6 examples of this variable
#################################################################
the variable name: PAQ665
t



The data frame is for Hospital Utilization & Access to Care
the variable name: HUQ020
the variable description is: Compared with 12 months ago, would you say {your/SP's} health is now . . .
there are 10 examples of this variable
#################################################################
the variable name: HUQ030
the variable description is: Is there a place that {you/SP} usually {go/goes} when {you are/he/she is} sick or {you/s/he} need{s} advice about {your/his/her} health?
there are 10 examples of this variable
#################################################################
the variable name: HUQ090
the variable description is: During the past 12 months, that is since {DISPLAY CURRENT MONTH} of {DISPLAY LAST YEAR}, {have you/has SP} seen or talked to a mental health professional such as a psychologist, psychiatrist, psychiatric nurse or clinical social worker about {your/his/her} health?
there are 10 examples of this variable
###############################################



The data frame is for Income
the variable name: IND235
the variable description is: Monthly family income (reported as a range value in dollars).
there are 6 examples of this variable
#################################################################
the variable name: INQ060
the variable description is: Did {you/you or any family members living here} receive any disability pension [other than Social Security or Railroad Retirement] in {LAST CALENDAR YEAR}?
there are 6 examples of this variable
#################################################################
the variable name: SEQN
the variable description is: Respondent sequence number.
there are 6 examples of this variable
#################################################################
the variable name: INQ150
the variable description is: Did {you/you or any family members living here} receive income in {LAST CALENDAR YEAR} from child support, alimony, contributions from family or others, VA payments, worker's compensation, or u

In [16]:
x,y,z = get_variable_documentation(df["Data File Name"][0], cycle=df["Years"][0], variable_list = q_variable_list)
variable_documentation_dict[f"{the_name}"] = [q_variable_list, y,z,x]

### **Data Files**
*****************************

We are going to create a diictionary of dataframe. These are going to include demography data and the questionnaire data created from their respective data files. Data files from all 10 survey cycles will be merged and then appended into this dictionary.

In [17]:
Dict_Data_files = dict()

In [18]:
#we will create dictionary fo the data file name and the survey cycle corresponding to that name
def create_dict_cycle_and_file_name(variable_df):
  """
  This fuction input the variable dataframe 
  It returns the dictionary of the variable names as keys and cycles as items
  """
  data_File_Name_Cycle_dict = dict(zip(variable_df['Data File Name'], variable_df['Years']))

  return data_File_Name_Cycle_dict



def get_data_files(Data_file_Name, Doc_File_Name_dict, COLUMN_NAMES):
  """
  This fuction merges the data frames 
  """
  key = list(Doc_File_Name_dict.keys())[0]
  print(f"Starting to merge {Data_file_Name}")
  url = f"https://wwwn.cdc.gov/Nchs/Nhanes/{Doc_File_Name_dict[key]}/{key}.XPT"
  df = pd.read_sas(url, format='xport', encoding='utf-8')
  #print(f"this is the columns of df {list(df.columns)}")
  merged_df = pd.DataFrame(columns = list(df.columns)) 
  #print(f"this is the columns of merged_df {list(merged_df.columns)}")
  
  for key in Doc_File_Name_dict:
    print(f"We are in year: {Doc_File_Name_dict[key]} for doc {key}")
    
    url = f"https://wwwn.cdc.gov/Nchs/Nhanes/{Doc_File_Name_dict[key]}/{key}.XPT"
    df_temp = pd.read_sas(url, format='xport', encoding='utf-8')
    #print(f"this is the columns of df_temp {list(df_temp.columns)}")

    #now we merge
    merged_df = pd.merge(merged_df, df_temp, how="outer")

  merged_df = merged_df.astype({"SEQN": int})
  merged_df = merged_df.astype({"SEQN": str})

  COLUMN_NAMES = [ele for ele in list(merged_df.columns) if ele in COLUMN_NAMES]

  
  merged_df = merged_df[COLUMN_NAMES]
  print(f"Done merging {Data_file_Name}!!!!!!!")

  #we add the merged that frame to the dictionary of Data files
  Dict_Data_files[Data_file_Name] = merged_df
  

**Demography**

In [None]:
get_data_files("Demography",create_dict_cycle_and_file_name(demographics_var_df), COLUMN_NAMES= variable_documentation_dict['Demography'][0])
print(f"The Type of the data frame: {type(Dict_Data_files['Demography'])}")
demography_data = Dict_Data_files["Demography"]
print(f"The shape of the data frame: {demography_data.shape}")

demography_data.head()


Starting to merge Demography
We are in year: 2005-2006 for doc DEMO_D
We are in year: 2007-2008 for doc DEMO_E
We are in year: 2003-2004 for doc DEMO_C
We are in year: 2001-2002 for doc DEMO_B
We are in year: 1999-2000 for doc DEMO


**Questionnaire**

In [None]:
data_category_list = list(variable_documentation_dict.keys())
for i in range(1, len(data_category_list)):
  data_category_name = data_category_list[i]
  column_names = variable_documentation_dict[data_category_name][0]
  df = questionnaire_var_df[questionnaire_var_df["Data File Description"] == data_category_name]
  get_data_files(data_category_name, create_dict_cycle_and_file_name(df), COLUMN_NAMES= column_names)

### **Data Cleaning and Preparation**

For the data creaning we are going to  start with the Demographic.

In [None]:
#we need to check if the columns that we have in the actual dataframe are the same one as the one we have in the variable documentation 
#if not, we are going to append the missing documentation 

#we are going to create a list of all the document type that we have 
var_dict_keys = list(variable_documentation_dict.keys())

#so starting with demographic which is at index 0
demo = var_dict_keys[0]
demography_df  = Dict_Data_files[demo]
#now we retreave the dictionary of column names and they SAS label name
demo_var_dict = variable_documentation_dict[demo][1]

#now we compare the keys in 'demo_var_dict' and the 'demography_df.columns' and create a list 

missing_var = [x for x in list(demography_df.columns) if x not in list(demo_var_dict.keys())]
print(missing_var)

In [None]:
#we go to the web and find the description/SAS label of the variable 

missing_var_dict = {'DMDFMSIZ' : "Total number of people in the Family",   
                    'INDHHIN2' : "Annual Household Income", 
                    'INDFMIN2' : "Annual Family Income"
                  }

In [None]:
#we update the missing_var_dict to the demo_vardict
variable_documentation_dict[demo][1].update(missing_var_dict)
#we are going to remove the below entry from the dictionary 
#'SDMVPSU': 'Masked Variance Pseudo-PSU',
variable_documentation_dict[demo][1].pop('SDMVPSU', None)
demo_var_dict = variable_documentation_dict[demo][1]

demo_var_dict

In [None]:
Dict_Data_files[demo]

In [None]:
import pickle

In [None]:
with open('data.json', 'wb') as fp:
    pickle.dump(variable_documentation_dict, fp)

In [None]:
#now we filter the data frame leaving on the variable in the dict
Dict_Data_files[demo] = Dict_Data_files[demo][list(demo_var_dict.keys())]
Dict_Data_files[demo]['RIDAGEYR'] = Dict_Data_files[demo]['RIDAGEYR'].round(decimals = 2)
Dict_Data_files[demo]['INDFMPIR'] = Dict_Data_files[demo]['INDFMPIR'].round(decimals = 3)
demography_df  = Dict_Data_files[demo]

demography_df.info()


In [None]:
demography_df['RIDAGEYR'].round(decimals = 2)

In [None]:
demography_df.head()

In [None]:
#for Age in Months - Recode
#we are going to fill in the null using the value of column RIDAGEYR (Age at Screening Adjudicated - Recode) mutiplied by 12
#we then going to drop RIDAGEYR and RIDAGEEX (Exam Age in Months - Recode) as we dont need them
demography_df['RIDAGEMN'] = demography_df['RIDAGEMN'].fillna( 12 * demography_df['RIDAGEYR'] )
demography_df = demography_df.drop(["RIDAGEYR","RIDAGEEX", "DMDHRGND", "DMDHRAGE", "DMDSCHOL","DMDEDUC2","DMDEDUC3","DMDMARTL", "DMDHSEDU"], 1)
demography_df.head()

In [None]:
demography_df.columns

In [None]:
demo_var_dict["INDFMIN2"]

In [None]:
demography_df.info()

In [None]:
#for Veteran/Military Status
demography_df["DMQMILIT"].value_counts()

In [None]:
#here we are filling in the missing values/null with 5 and updating the documentation
demography_df['DMQMILIT'] = demography_df['DMQMILIT'].fillna(5)
variable_documentation_dict[demo][3]["DMQMILIT"]["Code or Value"][4] = 5

In [None]:
#for Citizenship Status
#here we are filling in the missing values/null with 5 and updating the documentation
demography_df['DMDCITZN'] = demography_df['DMDCITZN'].fillna(5)
variable_documentation_dict[demo][3]["DMDCITZN"]["Code or Value"][4] = 5

In [None]:
#for Length of time in US
#here we are filling in the missing values/null with 15 and updating the documentation
demography_df['DMDYRSUS'] = demography_df['DMDYRSUS'].fillna(15)
variable_documentation_dict[demo][3]["DMDYRSUS"]["Code or Value"][12] = 15

In [None]:
#for Marital Status
#here we are filling in the missing values/null with 9 and updating the documentation
demography_df['DMDHRMAR'] = demography_df['DMDHRMAR'].fillna(9)
variable_documentation_dict[demo][3]["DMDHRMAR"]["Code or Value"][8] = 9

In [None]:
#Family PIR
#here we are filling in the missing values/null with 9 and updating the documentation
demography_df['INDFMPIR'] = demography_df['INDFMPIR'].fillna(9)
variable_documentation_dict[demo][3]["INDFMPIR"]["Code or Value"][2] = 9

In [None]:
#Pregnancy Status at Exam - Recode
#here we are filling in the missing values/null with 9 and updating the documentation
demography_df['RIDEXPRG'] = demography_df['RIDEXPRG'].fillna(9)
variable_documentation_dict[demo][3]["RIDEXPRG"]["Code or Value"][3] = 9

In [None]:
demography_df.head()

In [None]:
demography_df.info()

In [None]:
Dict_Data_files[demo] = demography_df

In [None]:
var_dict_keys

Now we look at the Blood Pressure data 

In [None]:
#the Blood Pressure is at index 1
BP = var_dict_keys[1]
BP_df  = Dict_Data_files[BP]

Dict_Data_files[var_dict_keys[1]]

In [None]:
BP_df.head()

In [None]:
Dict_Data_files[var_dict_keys[1]].shape

In [None]:
#merging the demo and BP
merged_df = pd.merge(Dict_Data_files[var_dict_keys[0]],Dict_Data_files[var_dict_keys[1]],on='SEQN',how='outer')
merged_df.info()

In [None]:
for i in range(2, len(var_dict_keys)):
  merged_df = pd.merge(merged_df, Dict_Data_files[var_dict_keys[i]], on='SEQN',how='outer')

In [None]:
merged_df.info()

In [None]:
var_dict_keys = list(variable_documentation_dict.keys())
dict_of_var_name_and_SAS_name = dict()
#Now we want to replace the columns with their actual names 
for i in range(len(variable_documentation_dict)):
  dict_of_var_name_and_SAS_name.update(variable_documentation_dict[var_dict_keys[i]][1])

#the below where missing, so we manually add
dict_of_var_name_and_SAS_name["OCD150"] = "Type of work done last week"
dict_of_var_name_and_SAS_name["OCD390G"] = "Kind of work you have done the longest"


In [None]:
dict_of_var_name_and_SAS_name

In [None]:
merged_df.rename(columns=dict_of_var_name_and_SAS_name, inplace=True)

In [None]:
for name in merged_df.columns:
  print(name)

### **Data Exploration and Analysis**

In [None]:
merged_df = pd.read_csv("merged_data.csv")

In [None]:
merged_df.head()

In [None]:
merged_df[merged_df.columns[1:20]].info()

In [None]:
#we use the table use the variable table from the variable documentation for reference 
variable_documentation_dict['Demography'][3]["RIAGENDR"]

We are first going to convet all the floats in the variable_documentation_dict so that they match the dataframe and we are also going to replace NaN in the merged dataframe with '.' so that it matches with the documentation 

In [None]:
#the below is a brute force was of getting to the objective
#the try catch is to take care of when the category has changed 
#we also what a dictionary that will map each variable to it data category
cat_and_var_dict = dict()
sas_rev_var_name = dict()
for var_cat in var_dict_keys:
    for variable_name in list(dict_of_var_name_and_SAS_name.keys()):
        try:     
            df = variable_documentation_dict[var_cat][3][variable_name]
            cat_and_var_dict[dict_of_var_name_and_SAS_name[variable_name]] = var_cat
            sas_rev_var_name[dict_of_var_name_and_SAS_name[variable_name]] = variable_name
            for i, x in enumerate(df["Code or Value"]):
                if x.isdigit():
                    x = float(x)
                    df["Code or Value"][i] = x
        except:
            r = 2



In [None]:
merged_df = merged_df.fillna(".")

In [None]:
merged_df[merged_df.columns[1:20]].info()

In [None]:
variable_documentation_dict['Demography'][3]["DMDCITZN"]

In [None]:
#first we will look at the distribution of gender
merged_df["Gender"].value_counts()

In [None]:
#we use the table use the variable table from the variable documentation for reference 
variable_documentation_dict['Demography'][3]["RIAGENDR"]

In [None]:
#we will create a helper fuction 
def value_mapper(x):
  df = variable_documentation_dict['Demography'][3]["RIAGENDR"]
  i = df[df['Code or Value'] == x].index[0]
  return df['Value Description'][i]

In [None]:
# for i, x in enumerate(column):
#     y = df[df['Code or Value'] == str(x)].index[0]
#     column[i] = df['Value Description'][i]


In [None]:
value_mapper(2.0)

In [None]:
tempdf = pd.DataFrame(merged_df["Gender"].value_counts())
tempdf.reset_index(inplace= True)
tempdf.columns = ["Gender", "Count"]

tempdf['Gender'] = tempdf['Gender'].apply(value_mapper)
fig = px.pie(tempdf, values='Count', names='Gender', )
fig.show()


In [None]:
for name in merged_df.columns:
    print(name)

In [None]:
# #here we are going to create a value mapper
# def value_mapper_v2(column, data_cat, variable_name):
#     column = column.astype(int)
#     column = column.astype(str)
#     try:
#         df = variable_documentation_dict[data_cat][3][variable_name]
#         for i, x in enumerate(column):
#             y = df[df['Code or Value'] == str(x)].index[0]
#             column[i] = df['Value Description'][y]
#     except Exception as e:
#         print(e)
#     return 
    

In [None]:
#now we look at the distribution of the genders with regards to race 
temp_df = merged_df[["Gender", "Race/Ethnicity - Recode"]]

def value_mapper(x):
  df = variable_documentation_dict['Demography'][3]["RIAGENDR"]
  i = df[df['Code or Value'] == x].index[0]
  return df['Value Description'][i]
temp_df["Gender"] = temp_df["Gender"].apply(value_mapper)


def value_mapper(x):
  df = variable_documentation_dict['Demography'][3]["RIDRETH1"]
  i = df[df['Code or Value'] == x].index[0]
  return df['Value Description'][i]

temp_df["Race/Ethnicity - Recode"]  = temp_df["Race/Ethnicity - Recode"].apply(value_mapper)

fig = px.histogram(temp_df, x="Gender", color= "Race/Ethnicity - Recode", title= "Gender per Race/Ethnicity")
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
variable_documentation_dict['Demography'][3]["RIDRETH1"]

In [None]:
temp_df = merged_df[["Covered by health insurance", "Race/Ethnicity - Recode"]]

In [None]:
temp_df.head()

In [None]:
temp_df = merged_df[["Covered by health insurance", "Race/Ethnicity - Recode"]]
def value_mapper(x):
    df = variable_documentation_dict['Health Insurance'][3]["HIQ011"]
    i = df[df['Code or Value'] == x].index[0]
    return df['Value Description'][i]

temp_df['Covered by health insurance'] = temp_df['Covered by health insurance'].apply(value_mapper)


def value_mapper(x):
    df = variable_documentation_dict['Demography'][3]["RIDRETH1"]
    i = df[df['Code or Value'] == x].index[0]
    return df['Value Description'][i]

temp_df["Race/Ethnicity - Recode"]  = temp_df["Race/Ethnicity - Recode"].apply(value_mapper)

fig = px.histogram(temp_df, x="Covered by health insurance")
fig.show()


In [None]:
temp_df = merged_df[["Covered by health insurance", "Race/Ethnicity - Recode"]]


def value_mapper(x):
    df = variable_documentation_dict['Health Insurance'][3]["HIQ011"]
    i = df[df['Code or Value'] == x].index[0]
    return df['Value Description'][i]
temp_df["Covered by health insurance"] = temp_df["Covered by health insurance"].apply(value_mapper)


def value_mapper(x):
  df = variable_documentation_dict['Demography'][3]["RIDRETH1"]
  i = df[df['Code or Value'] == x].index[0]
  return df['Value Description'][i]

temp_df["Race/Ethnicity - Recode"]  = temp_df["Race/Ethnicity - Recode"].apply(value_mapper)

fig = px.histogram(temp_df, x="Covered by health insurance", color= "Race/Ethnicity - Recode", title= "Covered by health insurance")
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
var_dict_keys

In [None]:
dict_of_var_name_and_SAS_name

In [None]:
def create_chart(x, y):
  #now we look at the distribution of the genders with regards to race 
  temp_df = merged_df[[x, y]]

  def value_mapper(num):
    df = variable_documentation_dict[cat_and_var_dict[x]][3][sas_rev_var_name[x]]
    i = df[df['Code or Value'] == num].index[0]
    return df['Value Description'][i]
  temp_df[x] = temp_df[x].apply(value_mapper)


  def value_mapper(num):
    df = variable_documentation_dict[cat_and_var_dict[y]][3][sas_rev_var_name[y]]
    i = df[df['Code or Value'] == num].index[0]
    return df['Value Description'][i]
  temp_df[y] = temp_df[y].apply(value_mapper)

  fig = px.histogram(temp_df, x=x, color= y, title= f"{x}")
  fig.update_layout(bargap=0.2)
  fig.show()

In [None]:
create_chart(x = 'Family monthly poverty level category', y= 'Race/Ethnicity - Recode')

In [None]:
create_chart(x = 'Have Medicare?', y= 'Race/Ethnicity - Recode')

In [None]:
create_chart(x = 'No coverage of any type', y= 'Race/Ethnicity - Recode')

In [None]:
create_chart(x = 'General health condition', y= 'Race/Ethnicity - Recode')

In [None]:
create_chart(x = 'Type place most often go for healthcare', y= 'Race/Ethnicity - Recode')

In [None]:
create_chart(x = 'How long since last healthcare visit', y= 'Race/Ethnicity - Recode')