<a href="https://colab.research.google.com/github/laura-cramm/CIND-820-Big-Data-Analytics-Project/blob/main/CIND_820_Exploratory_Data_Analysis_(Cognitive_Decline_Modules).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b>Conducting Exploratory Data Analysis Using YData Profiling</b>:
###<b>Behavioural Risk Factor Surveillance System Cognitive Decline Module </b>

A total of 18 states decided to include the optional <i> Cognitive Decline </i> module when administering the Behavioural Risk Factor Surveillance System (BRFSS). Here, exploratory analysis will be conducted on the cognitive decline module specifically. The primary exposure variable of interest is again <b> COVIDPOS </b>, which is a categorical variable representing the answer to the question: <i> Has a doctor, nurse, or other health professional ever told you that you tested positive for COVID 19?</i>.The primary outcome variable is <b> CIMEMLOS </b>, which represents the answer to the question <i> During the past 12 months, have you experienced confusion or memory loss that is happening more often or is getting worse? </i>.

In [19]:
!pip install ydata-profiling



In [20]:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

In [21]:
#Reading the csv file containing the standardized core BRFSS data and saving
#it to a pandas data frame
brfssCore = pd.read_csv("/content/CoreSurvey.csv")

#Not all states asked questions concerning cognitive decline. Creating a
#subset of the original data frame that only includes states that asked these
#questions.
coreCogData = brfssCore[brfssCore["_STATE"].isin([12, 16, 18, 23, 32, 41, 44, 45, 49, 50, 51])]

In [22]:
#Some states conducted multiple versions of the BRFSS. Creating a data frame
#that contains the BRFSS data denoted 'version one'. Selecting a subset of this
#data frame that only includes states that asked the questions pertaining to
#cognitive decline.
brfssV1 = pd.read_csv("/content/SurveyV1.csv")
v1CogData = brfssV1[brfssV1["_STATE"].isin([4, 6, 26])]

In [23]:
#Creating a data frame that contains the BRFSS data denoted 'version two'.
#Selecting a subset of this data frame that only includes the states that
#asked the questions pertaining to cognitive decline.
brfssV2 = pd.read_csv("/content/SurveyV2.csv")
v2CogData = brfssV2[brfssV2["_STATE"].isin([9, 19, 39])]

In [24]:
#Concatenating the three data frames. All states included in this concatenated
#data frame asked survey participants to answer the cognitive decline section.
cogData = pd.concat([coreCogData, v1CogData, v2CogData], ignore_index=True, axis=0)

In [25]:
#Removing the underscore from the _STATE attribute name.
cogData.rename(columns={'_STATE': 'STATE'}, inplace=True)

#Sorting the attribute names alphabetically.
cogData = cogData.sort_index(axis=1)

#Attributes that start with an underscore were calculated by the CDC staff using other attributes.
#For example, _BMI5 is the computed body mass index, which was calculated using the
#height and weight variables.
#Removing these attributes.
cogData = cogData.drop(cogData.loc[:, '_AIDTST4':'_YRSSMOK'].columns, axis=1)

#Using the .astype() function to give categorical variables the 'category' data type
for col in ['STATE', 'FMONTH', 'IMONTH', 'IDAY', 'IYEAR', 'DISPCODE', 'CTELENM1', 'PVTRESD1', 'COLGHOUS', 'STATERE1', 'CELPHON1',
            'LADULT1', 'COLGSEX1', 'LANDSEX1', 'RESPSLCT',  'SAFETIME', 'CTELNUM1', 'CELLFON5',
            'CADULT1', 'CELLSEX1', 'PVTRESD3', 'CCLGHOUS', 'CSTATE1', 'LANDLINE', 'SEXVAR', 'GENHLTH',
            'PRIMINSR', 'PERSDOC3', 'MEDCOST1', 'CHECKUP1', 'EXERANY2',  'LASTDEN4', 'RMVTETH4',
            'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3', 'ASTHNOW', 'CHCSCNC1', 'CHCOCNC1',
            'CHCCOPD3', 'ADDEPEV3', 'CHCKDNY2', 'HAVARTH4', 'DIABETE4', 'MARITAL', 'EDUCA',
            'RENTHOM1', 'NUMHHOL4', 'VETERAN3', 'EMPLOY1', 'INCOME3', 'PREGNANT',
            'DEAF', 'BLIND', 'DECIDE', 'DIFFWALK', 'DIFFDRES', 'DIFFALON', 'HADMAM',
            'HOWLONG', 'CERVSCRN', 'CRVCLCNC', 'CRVCLPAP', 'CRVCLHPV', 'HADHYST2',
            'HADSIGM4', 'COLNSIGM', 'COLNTES1', 'SIGMTES1', 'LASTSIG4',
            'COLNCNCR', 'VIRCOLO1', 'VCLNTES2', 'SMALSTOL', 'STOLTEST', 'STOOLDN2',
            'BLDSTFIT', 'SDNATES1', 'SMOKE100', 'SMOKDAY2', 'USENOW3', 'ECIGNOW2',
            'LCSCTSC1', 'LCSSCNCR', 'LCSCTWHN', 'FLUSHOT7', 'PNEUVAC4', 'TETANUS1', 'HIVTST7',
            'HIVRISK5', 'COVIDPOS', 'COVIDSMP', 'COVIDPRM', 'PDIABTS1', 'PREDIAB2', 'DIABTYPE', 'INSULIN1',
            'EYEEXAM1', 'DIABEYE1', 'DIABEDU1', 'FEETSORE', 'IMFVPLA3', 'HPVADVC4', 'HPVADSHT', 'SHINGLE2', 'COVIDVA1',
            'COVACGET', 'COVIDNU1', 'COVIDINT', 'COPDCOGH', 'COPDFLEM', 'COPDBRTH', 'COPDBTST', 'CNCRDIFF', 'CNCRTYP2',
            'CSRVTRT3', 'CSRVDOC1', 'CSRVSUM', 'CSRVRTRN', 'CSRVINST', 'CSRVINSR', 'CSRVDEIN', 'CSRVCLIN', 'CSRVPAIN',
            'CSRVCTL2', 'PSATEST1', 'PSATIME1', 'PCPSARS2', 'PSASUGST', 'PCSTALK1',
            'CIMEMLOS', 'CDHOUSE', 'CDASSIST', 'CDHELP', 'CDSOCIAL', 'CDDISCUS', 'CAREGIV1', 'CRGVREL4',
            'CRGVLNG1', 'CRGVHRS1', 'CRGVPRB3', 'CRGVALZD', 'CRGVPER1', 'CRGVHOU1', 'CRGVEXPT',
            'ACEDEPRS', 'ACEDRINK', 'ACEDRUGS', 'ACEPRISN', 'ACEDIVRC', 'ACEPUNCH', 'ACEHURT1',
            'ACESWEAR', 'ACETOUCH', 'ACETTHEM', 'ACEHVSEX', 'ACEADSAF', 'ACEADNED', 'LSATISFY', 'EMTSUPRT',
            'SDHISOLT', 'SDHEMPLY', 'FOODSTMP', 'SDHFOOD1', 'SDHBILLS', 'SDHUTILS', 'SDHTRNSP', 'SDHSTRE1',
            'MARJSMOK', 'MARJEAT', 'MARJVAPE', 'MARJDAB', 'MARJOTHR', 'LASTSMK2', 'STOPSMK2', 'MENTCIGS', 'MENTECIG',
            'HEATTBCO', 'ASBIALCH', 'ASBIDRNK', 'ASBIBING', 'ASBIADVC', 'ASBIRDUC', 'FIREARM5', 'GUNLOAD', 'LOADULK2',
            'RCSGEND1', 'RCSXBRTH', 'RCSRLTN2', 'CASTHDX2', 'CASTHNO2', 'BIRTHSEX', 'SOMALE', 'SOFEMALE', 'TRNSGNDR', 'HADSEX',
            'PFPPRVN4', 'TYPCNTR9', 'BRTHCNT4', 'WHEREGET', 'NOBCUSE8', 'BCPREFER', 'RRCLASS3', 'RRCOGNT2', 'RRTREAT', 'RRATWRK2',
            'RRHCARE4', 'RRPHYSM2', 'QSTVER', 'QSTLANG', 'MSCODE', 'CAGEG', '_AGEG5YR', '_AGE65YR', '_AGE80', '_AGE_G', 'DRNKANY6']:
    cogData[col] = cogData[col].astype('category')

In [26]:
#Using the ydata_profiling library to generate an exploratory data analysis report
edaCognitive = ProfileReport(cogData, title="Exploratory Data Analysis (Cognitive Decline) ", minimal=True)

In [27]:
#Saving this exploratory data analysis report to an .html file.
edaCognitive.to_file("exploratory_data_analysis_cognitive_decline.html")



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]