# Initial Data Exploration - CA Adult Depression Rates

* This notebook starts to explore the initial data files uploaded to the data folder
https://data.chhs.ca.gov/dataset/adult-depression-lghc-indicator-24/resource/724c6fd8-a645-4e52-b63f-32631a20db5d

## Write more and introduce the data file here:


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math

In [2]:
depress_df = pd.read_csv('../data/Raw/adult_depression_CA_2012_to_2018.csv')

In [3]:
depress_df.shape

(161, 8)

* It looks like there are 161 rows, and 8 columns.

In [4]:
depress_df

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


### Observations

* It looks like there are 161 rows and 8 columns total, with each row representing a variety of data including the year (any year from 2012 to 2018), a demographic/socioeconomic category, and the frequency (count) of people for each category who were told at one point in their lives that they had a depressive disorder. 
* I will be most interested in the frequency column which tells us how many people in each category in that year experienced depression, and the percent column. Also, the demographic breakdown will be useful information to discover any differences in depression rates among the categories of people.

* I want to get a better look at the data we are dealing with, so I will observe that in the next several cells:

In [5]:
depress_df.head()

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.3,16.2
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65


* One thing to note is that we have a "Total" column, which tells us how many people in total that year reported experiencing depression. For example, we see that in the year 2012, there were 1,920 people in total who had depression.

In [6]:
depress_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                161 non-null    int64  
 1   Strata              161 non-null    object 
 2   Strata Name         161 non-null    object 
 3   Frequency           161 non-null    int64  
 4   Weighted Frequency  154 non-null    float64
 5   Percent             161 non-null    float64
 6   Lower 95% CL        161 non-null    float64
 7   Upper 95% CL        161 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 10.2+ KB


Now we know about the data types we can work with, and which ones will help us most in doing some initial quantitative explorations of our data.

In [7]:
depress_df.sample(15)

Unnamed: 0,Year,Strata,Strata Name,Frequency,Weighted Frequency,Percent,Lower 95% CL,Upper 95% CL
132,2017,Income,"$100,000+",296,1060698.0,14.84,11.74,17.95
92,2016,Total,Total,1645,,13.77,12.53,15.0
130,2017,Income,"$50,000 - $74,999",192,595222.0,19.97,15.68,24.25
139,2018,Sex,Male,758,1835759.0,13.44,11.34,15.55
115,2017,Total,Total,1550,,19.04,17.1,20.98
40,2013,Income,"$100,000+",213,541245.0,9.89,8.34,11.44
129,2017,Income,"$35,000 - $49,999",155,457803.0,17.54,11.28,23.8
145,2018,Race-Ethnicity,Other,118,390013.0,26.04,16.38,35.7
25,2013,Sex,Female,1150,2337817.0,16.52,15.42,17.62
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21


* The ".sample" function is useful because we get a nice look at a diverse group of rows from our overall data frame, in this case our sample returned a diverse group, with various ethnicities, income levels, sex, and education levels. This gives us a better understanding of the wide variety of people we are looking at.

In [8]:
depress_df.columns

Index(['Year', 'Strata', 'Strata Name', 'Frequency', 'Weighted Frequency',
       'Percent', 'Lower 95% CL', 'Upper 95% CL'],
      dtype='object')

* The columns are listed above, and the important ones would be the 'Year,' 'Strata,' 'Strata Name,' 'Frequency,' and 'Percent' because those tell us the year and specific demographic category, and the amount and percentage of people who have been told they had a depressive disorder for each year.

## How can I understand the columns better?

Because I want to make the column names easier to understand and work with, I will now use a dictionary to rename some of the columns we have in our data frame!

* For convenience when coding, I like to have non-capitalized column names, so 'year' instead of 'Year,' for example.
* Additionally, I want to explain that: 
    
 1) I prefer the name 'category' instead of 'Strata' for each demographic (and subsequently, 'category_name' instead of 'Strata Name');
 
 2) I prefer 'count' over 'Frequency' and the same thing for 'weighted_count'; 
 
 3) And for the 95% CL, I know that it is 95%, so I want to take that number out and also make everything non-capitalized, like the rest.

In [9]:
# Renaming the columns to fit my preferences and for ease when coding:

cname_dict = {
    'Year' : 'year',
    'Strata' : 'category',
    'Strata Name' : 'category_name',
    'Frequency' : 'count',
    'Weighted Frequency' : 'weighted_count',
    'Percent' : 'percent',
    'Lower 95% CL' : 'lower_cl',
    'Upper 95% CL' : 'upper_cl'
}

In [10]:
depress_df.rename(columns=cname_dict)

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


In [11]:
depress_df = depress_df.rename(columns=cname_dict)

In [12]:
depress_df

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
0,2012,Total,Total,1920,,11.74,11.11,12.37
1,2012,Sex,Male,561,1116664.0,8.12,7.32,8.92
2,2012,Sex,Female,1359,2163108.0,15.25,14.30,16.20
3,2012,Race-Ethnicity,White,1314,1806371.0,14.57,13.67,15.46
4,2012,Race-Ethnicity,Black,97,222022.0,13.54,10.44,16.65
...,...,...,...,...,...,...,...,...
156,2018,Age,18 to 34,496,1623933.0,17.69,13.72,21.66
157,2018,Age,35 to 44,285,749615.0,14.56,10.91,18.21
158,2018,Age,45 to 54,301,1052945.0,20.06,15.60,24.52
159,2018,Age,55 to 64,432,854201.0,21.44,17.65,25.23


In [13]:
depress_df.sample(15)

Unnamed: 0,year,category,category_name,count,weighted_count,percent,lower_cl,upper_cl
140,2018,Sex,Female,1206,3106910.0,21.96,19.08,24.84
151,2018,Income,"$20,000 - $34,999",262,714071.0,20.61,14.6,26.63
43,2013,Age,45 to 54,347,880554.0,16.81,14.85,18.77
146,2018,Education,No High School Diploma,241,465426.0,14.03,9.65,18.42
121,2017,Race-Ethnicity,Asian/Pacific Islander,67,227004.0,6.76,4.41,9.1
160,2018,Age,65+ years,450,661974.0,15.6,13.42,17.78
59,2014,Income,"$20,000 - $34,999",177,452329.0,12.87,9.98,15.76
52,2014,Race-Ethnicity,Asian/Pacific Islander,32,152340.0,4.49,2.31,6.66
31,2013,Education,No High School Diploma,222,510371.0,12.84,10.87,14.81
125,2017,Education,Some College or Tech School,483,1643032.0,22.78,18.64,26.92


* I've successfully renamed the columns to display my preferences for the names. Before I examine the distribution of my data, I'll save this out as a new data set:

In [17]:
# Saving cleaned data to my folder

depress_df.to_csv('../data/Cleaned/depress_CLEANED.csv', index=False)

In [18]:
depress_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            161 non-null    int64  
 1   category        161 non-null    object 
 2   category_name   161 non-null    object 
 3   count           161 non-null    int64  
 4   weighted_count  154 non-null    float64
 5   percent         161 non-null    float64
 6   lower_cl        161 non-null    float64
 7   upper_cl        161 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 10.2+ KB


In [15]:
# All the averages of values we have:

depress_df.mean()

year                2015.000000
count                429.776398
weighted_count    889891.681818
percent               14.789627
lower_cl              11.955280
upper_cl              17.624224
dtype: float64

In [16]:
# To find out some initial info on the counts of people with depression
# throughout the years in CA:

depress_df['count'].describe()

count     161.000000
mean      429.776398
std       390.297867
min        28.000000
25%       186.000000
50%       314.000000
75%       511.000000
max      1964.000000
Name: count, dtype: float64

Some thoughts on this initial data observation: 

* The average percentage of people who had depression from 2012 to 2018 is about 14.8% of people in our data across all the demographic categories. 
* The average count of people who were told they had a depressive disorder in CA among all the various demographic categories is about 430 people. 
* Other interesting points: the minimum amount of people with depression in a category was 28 people, and the maximum for a category in the data frame was 1964 people. Quite a big range!

# Close out this notebook and lead to the next analysis