# Adults with Diabetes in CA, per 100 people 
https://data.ca.gov/dataset/adults-with-diabetes-per-100-lghc-indicator/resource/aa9555ee-5e60-4b42-a58d-2dec14f4ce8f

This data set is from the same LGHC indicator and was collected through the CA BRFSS. This data shows adults with diabetes per 100 people, based on the question: "Has a doctor, or nurse or other health professional ever told you that you have diabetes?"

I will look through the data set and find any ways I can organize or clean the columns to make them easier for me to work with in my data analysis. Then I will make some initial observations.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math

In [2]:
# Read in the data set:

diabetes_df = pd.read_csv('../data/Raw/adult_diabetes.csv')

In [3]:
diabetes_df

Unnamed: 0,Geography,Year,Strata,Strata Name,Percent,Lower 95% CL,Upper 95% CL,Standard Error
0,California,2018,Total population,Total population,10.4,8.9,11.9,0.8
1,California,2018,Race-Ethnicity,White,8.4,6.9,9.9,0.8
2,California,2018,Race-Ethnicity,African-American,12.3,6.0,18.6,3.2
3,California,2018,Race-Ethnicity,Asian,8.5,3.2,13.9,2.7
4,California,2018,Race-Ethnicity,Hispanic,12.1,9.0,15.1,1.6
...,...,...,...,...,...,...,...,...
142,California,2012,Income,"$25,000 to $34,999",11.9,9.8,14.1,1.1
143,California,2012,Income,"$35,000 to $49,999",9.5,7.9,11.0,0.8
144,California,2012,Income,"$50,000 and above",6.3,5.6,7.0,0.4
145,California,2012,Sex,Male,9.7,8.9,10.5,0.4


In [4]:
# Look at columns:

diabetes_df.columns

Index(['Geography', 'Year', 'Strata', 'Strata Name', 'Percent', 'Lower 95% CL',
       'Upper 95% CL', 'Standard Error'],
      dtype='object')

* The "Percent" shows us the percentage of adults with diabetes per 100 people in each Strata

In [5]:
# Renaming columns:

col_dict = {
    'Year' : 'year',
    'Strata' : 'category',
    'Strata Name' : 'category_name',
    'Percent' : 'percent',
    'Lower 95% CL' : 'lower_cl',
    'Upper 95% CL' : 'upper_cl'
}

diabetes_df = diabetes_df.rename(columns=col_dict)

In [6]:
diabetes_df

Unnamed: 0,Geography,year,category,category_name,percent,lower_cl,upper_cl,Standard Error
0,California,2018,Total population,Total population,10.4,8.9,11.9,0.8
1,California,2018,Race-Ethnicity,White,8.4,6.9,9.9,0.8
2,California,2018,Race-Ethnicity,African-American,12.3,6.0,18.6,3.2
3,California,2018,Race-Ethnicity,Asian,8.5,3.2,13.9,2.7
4,California,2018,Race-Ethnicity,Hispanic,12.1,9.0,15.1,1.6
...,...,...,...,...,...,...,...,...
142,California,2012,Income,"$25,000 to $34,999",11.9,9.8,14.1,1.1
143,California,2012,Income,"$35,000 to $49,999",9.5,7.9,11.0,0.8
144,California,2012,Income,"$50,000 and above",6.3,5.6,7.0,0.4
145,California,2012,Sex,Male,9.7,8.9,10.5,0.4


In [7]:
diabetes_df.sample(15)

Unnamed: 0,Geography,year,category,category_name,percent,lower_cl,upper_cl,Standard Error
49,California,2016,Age,45 to 54 years,10.9,9.6,12.1,0.2
24,California,2017,Race-Ethnicity,Asian,4.4,2.4,6.3,1.0
62,California,2016,Sex,Female,9.9,9.0,10.7,0.3
90,California,2014,Age,35 to 44 years,5.4,3.3,7.5,1.1
98,California,2014,Income,"Less than $15,000",13.2,10.8,15.6,1.2
112,California,2013,Age,45 to 54 years,12.0,10.2,13.9,0.9
64,California,2015,Race-Ethnicity,White,7.4,6.4,8.3,0.5
68,California,2015,Age,18 to 34 years,1.6,0.7,2.4,0.4
37,California,2017,Income,"$25,000 to $34,999",11.0,7.3,14.7,1.9
143,California,2012,Income,"$35,000 to $49,999",9.5,7.9,11.0,0.8


In [8]:
# Saving cleaned data to my folder:

diabetes_df.to_csv('../data/Cleaned/diabetes_CLEANED.csv', index=False)

Looking through the initial data set, I have cleaned and renamed the columns to fit my needs. It looks like we have similar category names as the ones for data on adult depression rates, so this will be smoother for me in the data analysis. 

I'm going to look deeper into differences among age groups and the historical trends in prevalence of diabetes, and then figure out if this will be a good factor for me to include in my data story.

Thank you, let's move on now!