## Indicators of Anxiety or Depression Based on Reported Frequency of Symptoms During Last 7 Days
Dataset: https://data.cdc.gov/NCHS/Indicators-of-Anxiety-or-Depression-Based-on-Repor/8pt5-q6wp <br>
How this dataset was used: https://www.cdc.gov/nchs/covid19/pulse/mental-health.htm

"Estimates are weighted to adjust for nonresponse and to match Census Bureau estimates of the population by age, gender, race and ethnicity, and educational attainment." The [data] shows the percentage of adults who report symptoms of anxiety or depression that have been shown to be associated with diagnoses of generalized anxiety disorder or major depressive disorder.  These symptoms generally occur more than half the days or nearly every day.

2019 Benchmarks to compare against the data collected:<br>
-8.1% of adults aged 18 and over had symptoms of anxiety disorder<br>
-6.5% had symptoms of depressive disorder<br>
-10.8% had symptoms of anxiety disorder or depressive disorder

The National Center for Health Statistics conducted this survey in 3 phases. <br>

Phase 1 04/23/2020 - 07/21/2020<br>
Phase 2 08/19/2020 - 10/28/2020<br>
Phase 3 10/28/2020 - 03/29/2021<br>
Phase 3.1 04/14/2021 - 07/05/2021<br>
Phase 3.2 07/21/2021 - 10/11/2021**<br>
Phase 3.3 12/01/2021 - 02/07/2022<br>
Phase 3.4 02/23/2022 - 05/02/2022<br>
**At each phase, the survey asked about the "last 7 days". <br> Beginning in Phase 3.2, the survey asked about the "last two weeks"

In [6]:
#import libraries

import pandas as pd   

In [7]:
#load dataset

df = pd.read_csv('AnxietyData.csv', parse_dates=['Time Period Start Date', 'Time Period End Date'])

# 

# EDA

In [107]:
df.head()

Unnamed: 0,Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Time Period Start Date,Time Period End Date,Value,Low CI,High CI,Confidence Interval,Quartile Range
0,Symptoms of Depressive Disorder,National Estimate,United States,United States,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,23.5,22.7,24.3,22.7 - 24.3,
1,Symptoms of Depressive Disorder,By Age,United States,18 - 29 years,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,32.7,30.2,35.2,30.2 - 35.2,
2,Symptoms of Depressive Disorder,By Age,United States,30 - 39 years,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,25.7,24.1,27.3,24.1 - 27.3,
3,Symptoms of Depressive Disorder,By Age,United States,40 - 49 years,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,24.8,23.3,26.2,23.3 - 26.2,
4,Symptoms of Depressive Disorder,By Age,United States,50 - 59 years,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,23.2,21.5,25.0,21.5 - 25.0,


In [108]:
df.dtypes

Indicator                         object
Group                             object
State                             object
Subgroup                          object
Phase                             object
Time Period                        int64
Time Period Label                 object
Time Period Start Date    datetime64[ns]
Time Period End Date      datetime64[ns]
Value                            float64
Low CI                           float64
High CI                          float64
Confidence Interval               object
Quartile Range                    object
dtype: object

In [109]:
#values in the Group column 
df['Group'].unique()

#national estimate in Group column seems different than rest of values

array(['National Estimate', 'By Age', 'By Sex',
       'By Race/Hispanic ethnicity', 'By Education', 'By State',
       'By Disability status', 'By Gender identity',
       'By Sexual orientation'], dtype=object)

In [153]:
df['Subgroup'].unique()
#in the article in which this dataset was used, they split up the subgroup categories into y axis categories. 
#perhaps there is no need to clean this column, but rather create new columns out of each subgroup (age, state, race, orientation)


array(['United States', '18 - 29 years', '30 - 39 years', '40 - 49 years',
       '50 - 59 years', '60 - 69 years', '70 - 79 years',
       '80 years and above', 'Male', 'Female', 'Hispanic or Latino',
       'Non-Hispanic White, single race',
       'Non-Hispanic Black, single race',
       'Non-Hispanic Asian, single race',
       'Non-Hispanic, other races and multiple races',
       'Less than a high school diploma', 'High school diploma or GED',
       "Some college/Associate's degree", "Bachelor's degree or higher",
       'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina

In [110]:
#values in the Indicators column
df['Indicator'].unique()

array(['Symptoms of Depressive Disorder', 'Symptoms of Anxiety Disorder',
       'Symptoms of Anxiety Disorder or Depressive Disorder'],
      dtype=object)

In [111]:
#Data collection starts 04/23/2020
df['Time Period Start Date'].min()

Timestamp('2020-04-23 00:00:00')

In [112]:
#Data collection ends 01/10/2022
df['Time Period End Date'].max()

Timestamp('2022-01-10 00:00:00')

In [93]:
df['State'].unique()

array(['United States', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

In [130]:
df['Phase'].unique()

#considering replacing the values that are not 1,2,3 by using the date ranges

array(['1', '-1', '2', '3 (Oct 28 � Dec 21)', '3 (Jan 6 � Mar 29)', '3.1',
       '3.2', '3.3'], dtype=object)

In [131]:
#check for any missing values
df.isnull().sum()

Indicator                    0
Group                        0
State                        0
Subgroup                     0
Phase                        0
Time Period                  0
Time Period Label            0
Time Period Start Date       0
Time Period End Date         0
Value                      297
Low CI                     297
High CI                    297
Confidence Interval        297
Quartile Range            2862
dtype: int64

In [134]:
#missing values for the Value column
df[df['Value'].isna()].head()

Unnamed: 0,Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Time Period Start Date,Time Period End Date,Value,Low CI,High CI,Confidence Interval,Quartile Range
2520,Symptoms of Depressive Disorder,National Estimate,United States,United States,-1,1,"July 22 - Aug 18, 2020",2020-07-22,2020-08-18,,,,,
2521,Symptoms of Depressive Disorder,By Age,United States,18 - 29 years,-1,1,"July 22 - Aug 18, 2020",2020-07-22,2020-08-18,,,,,
2522,Symptoms of Depressive Disorder,By Age,United States,30 - 39 years,-1,1,"July 22 - Aug 18, 2020",2020-07-22,2020-08-18,,,,,
2523,Symptoms of Depressive Disorder,By Age,United States,40 - 49 years,-1,1,"July 22 - Aug 18, 2020",2020-07-22,2020-08-18,,,,,
2524,Symptoms of Depressive Disorder,By Age,United States,50 - 59 years,-1,1,"July 22 - Aug 18, 2020",2020-07-22,2020-08-18,,,,,


In [10]:
df[df['Value'].isna()]['Group'].unique()


array(['National Estimate', 'By Age', 'By Sex',
       'By Race/Hispanic ethnicity', 'By Education',
       'By Disability status'], dtype=object)

In [3]:
#function takes a value from the Group column as the parameter and returns the count of NA
def check_na_count(group):
    groupdf = df[df['Group'] == group]
    length_of_na = len(groupdf[groupdf['Value'].isna()])

    length_of_df = len(groupdf)

    outputstring = f"Out of {length_of_df}, there are {length_of_na} missing values" 

    return outputstring

In [8]:
#things we can pass through are: 
#'National Estimate, By Age, By Sex, By Race/Hispanic ethnicity, By Education, By State, By Disability Status, By Gender Identity, By Sexual Orientation
check_na_count('By Age')

'Out of 966, there are 105 missing values'

In [9]:
check_na_count('By Race/Hispanic ethnicity')

'Out of 690, there are 75 missing values'

In [11]:
check_na_count('')

'Out of 0, there are 0 missing values'

In [1]:
# I want to turn this line into a function that outputs: "out of __, there are ___ missing values"
len(age) - len(age[age['Value'].isna()])

NameError: name 'age' is not defined

After looking through the dat some questions to consider: 

Q: Is there a State that had a higher percentage of symptoms reported? anxiety, depression, both? <br>
Q: Is there a group/subgroup affected more than others? <br>
Q: Did symptoms get worse, better, or same with each Phase? <br>
Q: Is there a Phase that stands out? Would be interesting to try to find contributors (was the state in lockdown, did mask requirements change)

Possible projects:<br>
-focus on Texas statistics only<br>
-select a few states with interesting statistics<br>
-show how state mandates have some correlation with anxiety/depression/both<br>
-add unemployment data to compare with this dataset?

# Objective:<br>
1. Isolate Texas statistics
2. Find trends in Age subgroup.
3. Find trends in Race subgroup.
4. Find trends in percentage of anxiety, depression, both across different study periods. 


# Isolate Texas statistics

In [152]:
#create a dataframe for the state of Texas only
Texas = df[df.State == 'Texas']

Texas.head()

Unnamed: 0,Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Time Period Start Date,Time Period End Date,Value,Low CI,High CI,Confidence Interval,Quartile Range
62,Symptoms of Depressive Disorder,By State,Texas,Texas,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,24.4,21.1,27.9,21.1 - 27.9,24.1 - 28.7
132,Symptoms of Anxiety Disorder,By State,Texas,Texas,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,29.7,25.9,33.7,25.9 - 33.7,27.9 - 30.3
202,Symptoms of Anxiety Disorder or Depressive Dis...,By State,Texas,Texas,1,1,"Apr 23 - May 5, 2020",2020-04-23,2020-05-05,34.9,30.8,39.2,30.8 - 39.2,34.8 - 36.7
272,Symptoms of Depressive Disorder,By State,Texas,Texas,1,2,"May 7 - May 12, 2020",2020-05-07,2020-05-12,25.7,21.8,30.0,21.8 - 30.0,25.7 - 35.5
342,Symptoms of Anxiety Disorder,By State,Texas,Texas,1,2,"May 7 - May 12, 2020",2020-05-07,2020-05-12,32.4,28.2,36.9,28.2 - 36.9,31.6 - 38.3


# Trends by Age

In [188]:
#in the Texas df, look at By Age in Group column

# Trends by Race

In [163]:
#in the Texas df, look at By Race in the Group Column

# Find trends in the 3 symptoms across the 3 phases 

#This is a bigger task, best to break down into smaller parts. I will start by looking at Phase 1 & Symptoms of Depressive Disorder 
- Phase 2 & Depressive Disorder 
- Phace 3 & Depressive Disorder 
- Phase 1 & Anxiety Disorder
- Phase 2 & Anxiety Disorder 
- and so forth. 

I could create a function that has paramenters (phase, symptom) 

In [164]:
#in the Texas df, should I create a df of phases and df of symptoms and then use those to locate what I need? 

# Example of a function

In [None]:
#example
def function_name(parameters):
    # do something
    print("hello!")
    return None

In [74]:
def double(number):
    output = number+number
    return output

In [75]:
double(5)

10

In [76]:
double(15)

30