<a href="https://colab.research.google.com/github/laura-cramm/CIND-820-Big-Data-Analytics-Project/blob/main/CIND_820_Exploratory_Data_Analysis_(Core_Modules).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b>Conducting Exploratory Data Analysis Using YData Profiling</b>:
###<b>Behavioural Risk Factor Surveillance System Standardized Questionnaire </b>

The Behavioural Risk Factor Surveillance System (BRFSS) consists of a standardized core questionnaire that all states are obligated to conduct, in addition to a series of optional modules and state-added questions. Here, exploratory data analysis will be conducted on the standardized core questionnaire specifically. The primary exposure variable of interest is <b> COVIDPOS </b>, which is a categorical variable representing the answer to the question: <i> Has a doctor, nurse, or other health professional ever told you that you tested positive for COVID 19?</i>. The primary outcome variables of interest are <b> MENTHLTH </b> and <b> ADDEPEV3 </b>, which represent the answers to the questions <i>Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? </i> and <i>(Ever told) (you had) a depressive disorder (including depression, major depression, dysthymia, or minor depression)?</i>, respectively.

In [30]:
!pip install ydata-profiling



In [31]:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

In [32]:
#Reading the csv file containing the standardized core BRFSS data and saving
#it to a pandas data frame
brfssCore = pd.read_csv("/content/sample_data/CoreSurvey.csv")

In [33]:
#Removing the underscore from the _STATE attribute name.
brfssCore.rename(columns={'_STATE': 'STATE'}, inplace=True)

#Sorting the attribute names alphabetically.
brfssCore = brfssCore.sort_index(axis=1)

#Attributes that start with an underscore were calculated by the CDC staff using other attributes.
#Removing these attributes
brfssCore = brfssCore.drop(brfssCore.loc[:, '_AIDTST4':'_YRSSMOK'].columns, axis=1)

In [34]:
#Using the .astype() function to give categorical variables the 'category' data type
for col in ['STATE', 'FMONTH', 'IMONTH', 'IDAY', 'IYEAR', 'DISPCODE', 'CTELENM1', 'PVTRESD1', 'COLGHOUS', 'STATERE1', 'CELPHON1',
            'LADULT1', 'COLGSEX1', 'LANDSEX1', 'RESPSLCT',  'SAFETIME', 'CTELNUM1', 'CELLFON5',
            'CADULT1', 'CELLSEX1', 'PVTRESD3', 'CCLGHOUS', 'CSTATE1', 'LANDLINE', 'SEXVAR', 'GENHLTH',
            'PRIMINSR', 'PERSDOC3', 'MEDCOST1', 'CHECKUP1', 'EXERANY2',  'LASTDEN4', 'RMVTETH4',
            'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3', 'ASTHNOW', 'CHCSCNC1', 'CHCOCNC1',
            'CHCCOPD3', 'ADDEPEV3', 'CHCKDNY2', 'HAVARTH4', 'DIABETE4', 'MARITAL', 'EDUCA',
            'RENTHOM1', 'NUMHHOL4', 'VETERAN3', 'EMPLOY1', 'INCOME3', 'PREGNANT',
            'DEAF', 'BLIND', 'DECIDE', 'DIFFWALK', 'DIFFDRES', 'DIFFALON', 'HADMAM',
            'HOWLONG', 'CERVSCRN', 'CRVCLCNC', 'CRVCLPAP', 'CRVCLHPV', 'HADHYST2',
            'HADSIGM4', 'COLNSIGM', 'COLNTES1', 'SIGMTES1', 'LASTSIG4',
            'COLNCNCR', 'VIRCOLO1', 'VCLNTES2', 'SMALSTOL', 'STOLTEST', 'STOOLDN2',
            'BLDSTFIT', 'SDNATES1', 'SMOKE100', 'SMOKDAY2', 'USENOW3', 'ECIGNOW2',
            'LCSCTSC1', 'LCSSCNCR', 'LCSCTWHN', 'FLUSHOT7', 'PNEUVAC4', 'TETANUS1', 'HIVTST7',
            'HIVRISK5', 'COVIDPOS', 'COVIDSMP', 'COVIDPRM', 'PDIABTS1', 'PREDIAB2', 'DIABTYPE', 'INSULIN1',
            'EYEEXAM1', 'DIABEYE1', 'DIABEDU1', 'FEETSORE', 'IMFVPLA3', 'HPVADVC4', 'HPVADSHT', 'SHINGLE2', 'COVIDVA1',
            'COVACGET', 'COVIDNU1', 'COVIDINT', 'COPDCOGH', 'COPDFLEM', 'COPDBRTH', 'COPDBTST', 'CNCRDIFF', 'CNCRTYP2',
            'CSRVTRT3', 'CSRVDOC1', 'CSRVSUM', 'CSRVRTRN', 'CSRVINST', 'CSRVINSR', 'CSRVDEIN', 'CSRVCLIN', 'CSRVPAIN',
            'CSRVCTL2', 'PSATEST1', 'PSATIME1', 'PCPSARS2', 'PSASUGST', 'PCSTALK1',
            'CIMEMLOS', 'CDHOUSE', 'CDASSIST', 'CDHELP', 'CDSOCIAL', 'CDDISCUS', 'CAREGIV1', 'CRGVREL4',
            'CRGVLNG1', 'CRGVHRS1', 'CRGVPRB3', 'CRGVALZD', 'CRGVPER1', 'CRGVHOU1', 'CRGVEXPT',
            'ACEDEPRS', 'ACEDRINK', 'ACEDRUGS', 'ACEPRISN', 'ACEDIVRC', 'ACEPUNCH', 'ACEHURT1',
            'ACESWEAR', 'ACETOUCH', 'ACETTHEM', 'ACEHVSEX', 'ACEADSAF', 'ACEADNED', 'LSATISFY', 'EMTSUPRT',
            'SDHISOLT', 'SDHEMPLY', 'FOODSTMP', 'SDHFOOD1', 'SDHBILLS', 'SDHUTILS', 'SDHTRNSP', 'SDHSTRE1',
            'MARJSMOK', 'MARJEAT', 'MARJVAPE', 'MARJDAB', 'MARJOTHR', 'LASTSMK2', 'STOPSMK2', 'MENTCIGS', 'MENTECIG',
            'HEATTBCO', 'ASBIALCH', 'ASBIDRNK', 'ASBIBING', 'ASBIADVC', 'ASBIRDUC', 'FIREARM5', 'GUNLOAD', 'LOADULK2',
            'RCSGEND1', 'RCSXBRTH', 'RCSRLTN2', 'CASTHDX2', 'CASTHNO2', 'BIRTHSEX', 'SOMALE', 'SOFEMALE', 'TRNSGNDR', 'HADSEX',
            'PFPPRVN4', 'TYPCNTR9', 'BRTHCNT4', 'WHEREGET', 'NOBCUSE8', 'BCPREFER', 'RRCLASS3', 'RRCOGNT2', 'RRTREAT', 'RRATWRK2',
            'RRHCARE4', 'RRPHYSM2', 'QSTVER', 'QSTLANG', 'MSCODE', 'CAGEG', '_AGEG5YR', '_AGE65YR', '_AGE80', '_AGE_G', 'DRNKANY6']:
    brfssCore[col] = brfssCore[col].astype('category')

In [35]:
#Using the ydata_profiling library to generate an exploratory data analysis report
edaCore = ProfileReport(brfssCore, title="Exploratory Data Analysis", minimal=True)

In [36]:
#Saving this exploratory data analysis report to an .html file.
edaCore.to_file("exploratory_data_analysis_core_modules.html")



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]