# Showing Physical Activity Levels Among Adults in CA

https://data.chhs.ca.gov/dataset/adults-meeting-physical-activity-guidelines-lghc-indicator-16/resource/d824b0e4-b325-4935-82be-1936b0546128

This data set will go through my initial data exploration (cleaning and observing the data set) of a table that displays the percentages of adults meeting physical activity guidelines, as identified through the Let's Get Healthy California indicator at https://letsgethealthy.ca.gov/.

* According to the website above, "This table displays the percentage of adults meeting Aerobic Physical Activity guidelines in California. 
* The data are from the California Behavioral Risk Factor Surveillance Survey (BRFSS). The California BRFSS is an annual cross-sectional health-related telephone survey that collects data about California residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services."

After cleaning this data set to make it easier to work with, I will make some initial observations, then move onto the analysis of this cleaned data in a separate notebook within the data_analysis folder. I will be focusing mainly on the differences among age groups in percentages of those meeting physical activity guidelines. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math

In [2]:
# Read in data set:

activity_df = pd.read_csv('../data/Raw/adults_meeting_physical_activity_guidelines.csv')

In [3]:
activity_df

Unnamed: 0,Geography,Year,Strata,Strata Name,Percent,Lower 95% CL,Upper 95% CL,Standard Error
0,California,2013,Total population,Total population,69.0782,67.6683,70.4881,0.7192
1,California,2013,Race-Ethnicity,White,75.8986,74.3168,77.4804,0.8069
2,California,2013,Race-Ethnicity,African-American,63.4425,56.9960,69.8891,3.2885
3,California,2013,Race-Ethnicity,Asian/Pacific Islander,61.3377,58.6958,63.9795,1.3477
4,California,2013,Race-Ethnicity,Hispanic,65.7066,60.1734,71.2398,2.8226
...,...,...,...,...,...,...,...,...
64,California,2017,Income,"$50,000 to $74,999",73.3338,66.5754,80.0922,3.4473
65,California,2017,Income,"$75,000 to $99,999",77.9233,70.4109,85.4357,3.8319
66,California,2017,Income,"$100,000 and above",85.7647,82.3292,89.2002,1.7523
67,California,2017,Sex,Male,71.7643,67.5703,75.9583,2.1393


Observations:

* It looks like we have 69 rows and 8 columns. 
* Similar to the other data sets collected through the Let's Get Healthy California indicator, this data set displays familiar columns I've worked with in the previous exploration folders. 

In [4]:
activity_df.sample(15)

Unnamed: 0,Geography,Year,Strata,Strata Name,Percent,Lower 95% CL,Upper 95% CL,Standard Error
66,California,2017,Income,"$100,000 and above",85.7647,82.3292,89.2002,1.7523
57,California,2017,Education,Less than high school,49.5693,42.6958,56.4427,3.5061
21,California,2013,Sex,Male,70.0242,67.8993,72.1491,1.0839
54,California,2017,Age,45 to 54 years,69.956,63.6617,76.2504,3.2107
36,California,2015,Education,Some college,71.9967,68.4909,75.5026,1.7884
11,California,2013,Education,Less than high school,50.9941,47.1521,54.836,1.9599
30,California,2015,Age,35 to 44 years,69.6839,65.0538,74.314,2.362
23,California,2015,Total population,Total population,71.62,69.8336,73.4065,0.9113
31,California,2015,Age,45 to 54 years,69.2704,65.0923,73.4485,2.1314
2,California,2013,Race-Ethnicity,African-American,63.4425,56.996,69.8891,3.2885


In [5]:
# Getting quick information about the data we're working with:

activity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Geography       69 non-null     object 
 1   Year            69 non-null     int64  
 2   Strata          69 non-null     object 
 3   Strata Name     69 non-null     object 
 4   Percent         69 non-null     float64
 5   Lower 95% CL    69 non-null     float64
 6   Upper 95% CL    69 non-null     float64
 7   Standard Error  69 non-null     float64
dtypes: float64(4), int64(1), object(3)
memory usage: 4.4+ KB


In [6]:
# Look at columns:

activity_df.columns

Index(['Geography', 'Year', 'Strata', 'Strata Name', 'Percent', 'Lower 95% CL',
       'Upper 95% CL', 'Standard Error'],
      dtype='object')

Since the rows are the same as the ones for the adult depression rates in CA data set, I will be renaming these columns to match what I've done previously:

In [7]:
# Renaming the columns to fit my preferences and for ease when coding:

cname_dict = {
    'Geography' : 'geo',
    'Year' : 'year',
    'Strata' : 'category',
    'Strata Name' : 'category_name',
    'Percent' : 'percent',
    'Lower 95% CL' : 'lower_cl',
    'Upper 95% CL' : 'upper_cl'
}

In [8]:
activity_df.rename(columns=cname_dict)

Unnamed: 0,geo,year,category,category_name,percent,lower_cl,upper_cl,Standard Error
0,California,2013,Total population,Total population,69.0782,67.6683,70.4881,0.7192
1,California,2013,Race-Ethnicity,White,75.8986,74.3168,77.4804,0.8069
2,California,2013,Race-Ethnicity,African-American,63.4425,56.9960,69.8891,3.2885
3,California,2013,Race-Ethnicity,Asian/Pacific Islander,61.3377,58.6958,63.9795,1.3477
4,California,2013,Race-Ethnicity,Hispanic,65.7066,60.1734,71.2398,2.8226
...,...,...,...,...,...,...,...,...
64,California,2017,Income,"$50,000 to $74,999",73.3338,66.5754,80.0922,3.4473
65,California,2017,Income,"$75,000 to $99,999",77.9233,70.4109,85.4357,3.8319
66,California,2017,Income,"$100,000 and above",85.7647,82.3292,89.2002,1.7523
67,California,2017,Sex,Male,71.7643,67.5703,75.9583,2.1393


In [9]:
activity_df = activity_df.rename(columns=cname_dict)

In [11]:
activity_df.head()

Unnamed: 0,geo,year,category,category_name,percent,lower_cl,upper_cl,Standard Error
0,California,2013,Total population,Total population,69.0782,67.6683,70.4881,0.7192
1,California,2013,Race-Ethnicity,White,75.8986,74.3168,77.4804,0.8069
2,California,2013,Race-Ethnicity,African-American,63.4425,56.996,69.8891,3.2885
3,California,2013,Race-Ethnicity,Asian/Pacific Islander,61.3377,58.6958,63.9795,1.3477
4,California,2013,Race-Ethnicity,Hispanic,65.7066,60.1734,71.2398,2.8226


In [12]:
activity_df.tail()

Unnamed: 0,geo,year,category,category_name,percent,lower_cl,upper_cl,Standard Error
64,California,2017,Income,"$50,000 to $74,999",73.3338,66.5754,80.0922,3.4473
65,California,2017,Income,"$75,000 to $99,999",77.9233,70.4109,85.4357,3.8319
66,California,2017,Income,"$100,000 and above",85.7647,82.3292,89.2002,1.7523
67,California,2017,Sex,Male,71.7643,67.5703,75.9583,2.1393
68,California,2017,Sex,Female,69.4562,65.9808,72.9316,1.7728


In [13]:
# Look through the different demographic categories:

activity_df['category_name'].unique()

array(['Total population', 'White', 'African-American',
       'Asian/Pacific Islander', 'Hispanic', 'Other', '18 to 34 years',
       '35 to 44 years', '45 to 54 years', '55 to 64 years',
       '65 years and above', 'Less than high school',
       'High school graduate', 'Some college', 'College graduate',
       'Less than $20,000', '$20,000 to $34,999', '$35,000 to $49,999',
       '$50,000 to $74,999', '$75,000 to $99,999', '$100,000 and above',
       'Male', 'Female'], dtype=object)

Since I am working with the column names I want, and I know that I'm interested in differences among age group with a specific focus on the 55 to 64 age group, I will move into my analysis of this data set in the other notebook I mentioned. 

Before that, I am going to make sure I save this cleaned data set in my data folder so that I can easily reference it in my analysis notebook:

In [14]:
# Saving the cleaned data to my folder:

activity_df.to_csv('../data/Cleaned/physical_activity_CLEANED.csv', index=False)

With that, I will go into making deeper observations through data analysis and hopefully arriving at some interesting insights in the other notebook!

Thank you.