# Initial Data Exploration - CA Adult Depression Rates

* This notebook starts to explore the main data set I've found for my project, Adult Depression Rates in CA from 2012 to 2018. 
https://data.chhs.ca.gov/dataset/adult-depression-lghc-indicator-24/resource/724c6fd8-a645-4e52-b63f-32631a20db5d

The data set is from the LGHC (Let's Get Healthy California) Indicator and it displays the proportion of adults who were ever told they had a depressive disorder. According to the source, the data are from the California Behavioral Risk Factor Surveillance Survey (BRFSS). 

This indicator is based on the question: "“Has a doctor, nurse or other health professional EVER told you that you have a depressive disorder (including depression, major depression, dysthymia, or minor depression)?” 

I will look through the data (the rows and columns), clean them as necessary and rename columns, make the data easier for me to work with, and make initial observations before saving out the cleaned data file and starting my data analysis in the next folder.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math

In [2]:
# Read in the data set:

depress_df = pd.read_csv('../data/Raw/adult_depression_CA_2012_to_2018.csv')

In [3]:
depress_df.shape

(161, 8)

* It looks like there are 161 rows, and 8 columns.

In [4]:
depress_df

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


### Observations

* It looks like there are 161 rows and 8 columns total, with each row representing a variety of data including the year (any year from 2012 to 2018), a demographic/socioeconomic category, and the frequency (count) of people for each category who were told at one point in their lives that they had a depressive disorder. 
* I will be most interested in the frequency column which tells us how many people in each category in that year experienced depression, and the percent column. Also, the demographic breakdown will be useful information to discover any differences in depression rates among the categories of people.

* I want to get a better look at the data we are dealing with, so I will observe that in the next several cells:

In [5]:
depress_df.head()

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.3,16.2
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65


* One thing to note is that we have a "Total" column, which tells us how many people in total that year reported experiencing depression. For example, we see that in the year 2012, there were 1,920 people in total who had depression.

In [6]:
# Check the data type:

depress_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                161 non-null    int64  
 1   Strata              161 non-null    object 
 2   Strata Name         161 non-null    object 
 3   Frequency           161 non-null    int64  
 4   Weighted Frequency  154 non-null    float64
 5   Percent             161 non-null    float64
 6   Lower 95% CL        161 non-null    float64
 7   Upper 95% CL        161 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 10.2+ KB


Now we know about the data types we can work with, and which ones will help us most in doing some initial quantitative explorations of our data.

In [7]:
depress_df.sample(15)

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
36,2013,Income,"$20,000 - $34,999",283,560739.0,14.57,12.5,16.64
88,2015,Age,35 to 44,306,733597.0,14.21,11.55,16.87
11,2012,Education,College Graduate or Post Grad,717,1040822.0,10.0,9.11,10.9
110,2016,Age,18 to 34,357,1086345.0,11.84,9.25,14.43
49,2014,Race-Ethnicity,White,832,2050639.0,16.99,15.49,18.49
74,2015,Race-Ethnicity,Hispanic,439,1004173.0,11.09,9.27,12.91
28,2013,Race-Ethnicity,Hispanic,403,1011594.0,10.96,9.72,12.2
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
20,2012,Age,45 to 54,409,770238.0,14.67,13.14,16.2
69,2015,Total,Total,1848,,12.92,11.98,13.87


* The ".sample" function is useful because we get a nice look at a diverse group of rows from our overall data frame, in this case our sample returned a diverse group, with various ethnicities, income levels, sex, and education levels. This gives us a better understanding of the wide variety of people we are looking at.

In [8]:
# Look through the columns:

depress_df.columns

Index(['Year', 'Strata', 'Strata Name', 'Frequency', 'Weighted Frequency',
       'Percent', 'Lower 95% CL', 'Upper 95% CL'],
      dtype='object')

* The columns are listed above, and the important ones would be the 'Year,' 'Strata,' 'Strata Name,' 'Frequency,' and 'Percent' because those tell us the year and specific demographic category, and the amount and percentage of people who have been told they had a depressive disorder for each year.

## How can I understand the columns better?

Because I want to make the column names easier to understand and work with, I will now use a dictionary to rename some of the columns we have in our data frame!

* For convenience when coding, I like to have non-capitalized column names, so 'year' instead of 'Year,' for example.
* Additionally, I want to explain that: 
    
 1) I prefer the name 'category' instead of 'Strata' for each demographic (and subsequently, 'category_name' instead of 'Strata Name');
 
 2) I prefer 'count' over 'Frequency' and the same thing for 'weighted_count'; 
 
 3) And for the 95% CL, I know that it is 95%, so I want to take that number out and also make everything non-capitalized, like the rest.

In [9]:
# Renaming the columns to fit my preferences and for ease when coding:

cname_dict = {
    'Year' : 'year',
    'Strata' : 'category',
    'Strata Name' : 'category_name',
    'Frequency' : 'count',
    'Weighted Frequency' : 'weighted_count',
    'Percent' : 'percent',
    'Lower 95% CL' : 'lower_cl',
    'Upper 95% CL' : 'upper_cl'
}

In [10]:
depress_df.rename(columns=cname_dict)

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


In [11]:
depress_df = depress_df.rename(columns=cname_dict)

In [12]:
depress_df

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


In [13]:
depress_df.sample(15)

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
10,2012,Education,Some College or Tech School,563,947473.0,13.25,11.95,14.55
117,2017,Sex,Female,1025,3301418.0,23.33,20.65,26.01
22,2012,Age,65+ years,541,535838.0,12.63,11.38,13.87
17,2012,Income,"$100,000+",270,454444.0,8.31,7.17,9.46
158,2018,Age,45 to 54,301,1052945.0,20.06,15.6,24.52
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
48,2014,Sex,Female,840,2268864.0,16.03,14.57,17.5
85,2015,Income,"$75,000 - $99,999",150,297855.0,12.24,9.37,15.12
72,2015,Race-Ethnicity,White,1143,1992472.0,16.54,15.13,17.95
91,2015,Age,65+ years,426,585258.0,13.85,12.23,15.48


* I've successfully renamed the columns to display my preferences for the names. Before I examine the distribution of my data, I'll save this out as a new data set:

In [14]:
# Saving cleaned data to my folder

depress_df.to_csv('../data/Cleaned/depress_CLEANED.csv', index=False)

In [15]:
depress_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            161 non-null    int64  
 1   category        161 non-null    object 
 2   category_name   161 non-null    object 
 3   count           161 non-null    int64  
 4   weighted_count  154 non-null    float64
 5   percent         161 non-null    float64
 6   lower_cl        161 non-null    float64
 7   upper_cl        161 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 10.2+ KB


In [16]:
# All the averages of values we have:

depress_df.mean()

year                2015.000000
count                429.776398
weighted_count    889891.681818
percent               14.789627
lower_cl              11.955280
upper_cl              17.624224
dtype: float64

In [17]:
# To find out some initial info on the counts of people with depression
# throughout the years in CA:

depress_df['count'].describe()

count     161.000000
mean      429.776398
std       390.297867
min        28.000000
25%       186.000000
50%       314.000000
75%       511.000000
max      1964.000000
Name: count, dtype: float64

Some thoughts on this initial data observation: 

* The average percentage of people who had depression from 2012 to 2018 is about 14.8% of people in our data across all the demographic categories. 
* The average count of people who were told they had a depressive disorder in CA among all the various demographic categories is about 430 people. 
* Other interesting points: the minimum amount of people with depression in a category was 28 people, and the maximum for a category in the data frame was 1964 people. Quite a big range!

Since I have organized the data, renamed the columns, and saved out this cleaned data file, I will now move into analysis in the data_analysis folder! 

I hope to gain more insights about overall health and wellness patterns among demographic categories in CA (specifically interested in age groups), and then I will look into how I can best narrow down other factors to connect back to mental health and characterize these depression rate trends.

Thank you!