# Proportion of Adults in CA Who Are Current Smokers
https://data.ca.gov/dataset/proportion-of-adults-who-are-current-smokers-lghc-indicator1/resource/cd8baf9d-40ff-49b0-9282-550313f76a2a

This data set gives us information on the percentage of people who are current smokers within each surveyed category. It is taken from the same LGHC indicator that I used for the adult depression rates in CA, so the category and category names (Strata and Strata Name) are all similar to the ones I've already worked with in the depression_cleaned notebook. 

Since I know with my research and data story that I want to focus solely on differences among age group, I will clean this data set as needed, make some initial observations, and then proceed into analysis in a separate notebook within the data_analysis folder.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math

In [2]:
# Read in the data set:

smoke_df = pd.read_csv('../data/Raw/adult_current_smokers.csv')

In [3]:
smoke_df

Unnamed: 0,Geography,Year,Strata,Strata Name,Percent,Standard Error,Lower 95% CL,Upper 95% CL
0,California,2012,Total Population,Total Population,12.7,0.281,12.2,13.3
1,California,2012,Race-Ethnicity,Hispanic,11.7,0.539,10.7,12.8
2,California,2012,Race-Ethnicity,African-American,15.7,1.374,13,18.4
3,California,2012,Race-Ethnicity,Asian/Pacific Islander,10,0.986,8,11.9
4,California,2012,Race-Ethnicity,White,13.3,0.365,12.6,14
...,...,...,...,...,...,...,...,...
135,California,2018,Education,College graduate,6.5,0.955,4.6,8.4
136,California,2018,Health Insurance,Insured,9.2,0.663,7.9,10.5
137,California,2018,Health Insurance,Uninsured,13.3,1.99,9.4,17.2
138,California,2018,Sex,Male,12.1,0.91,10.3,13.9


### Observations

* We have 140 rows and 8 columns
* The years are the same as the ones for my data set on adult depression rates, the strata seem to be similar besides a "Health Insurance" Strata, but overall I'm going to focus on looking at Age.

Now I'm going to quickly look at the columns, then rename them so they fit my other ones in different notebooks:

In [4]:
# Look at the columns:

smoke_df.columns

Index(['Geography', 'Year', 'Strata', 'Strata Name', 'Percent',
       'Standard Error', 'Lower 95% CL', 'Upper 95% CL'],
      dtype='object')

In [5]:
# Renaming the columns:

cname_dict = {
    'Year' : 'year',
    'Strata' : 'category',
    'Strata Name' : 'category_name',
    'Percent' : 'percent',
    'Lower 95% CL' : 'lower_cl',
    'Upper 95% CL' : 'upper_cl'
}

smoke_df = smoke_df.rename(columns=cname_dict)

In [6]:
smoke_df

Unnamed: 0,Geography,year,category,category_name,percent,Standard Error,lower_cl,upper_cl
0,California,2012,Total Population,Total Population,12.7,0.281,12.2,13.3
1,California,2012,Race-Ethnicity,Hispanic,11.7,0.539,10.7,12.8
2,California,2012,Race-Ethnicity,African-American,15.7,1.374,13,18.4
3,California,2012,Race-Ethnicity,Asian/Pacific Islander,10,0.986,8,11.9
4,California,2012,Race-Ethnicity,White,13.3,0.365,12.6,14
...,...,...,...,...,...,...,...,...
135,California,2018,Education,College graduate,6.5,0.955,4.6,8.4
136,California,2018,Health Insurance,Insured,9.2,0.663,7.9,10.5
137,California,2018,Health Insurance,Uninsured,13.3,1.99,9.4,17.2
138,California,2018,Sex,Male,12.1,0.91,10.3,13.9


Now, for the columns I'm interested in, which include year, category, category_name, and percent, are all successfully renamed. 

I'm going to take a look at a sample of my data set:

In [7]:
smoke_df.sample(15)

Unnamed: 0,Geography,year,category,category_name,percent,Standard Error,lower_cl,upper_cl
31,California,2013,Age,60 years and above,8.0,0.432,7.2,8.9
138,California,2018,Sex,Male,12.1,0.91,10.3,13.9
86,California,2016,Age,Less than 20 years,9.5,3.685,2.3,16.7
69,California,2015,Age,40 to 49 years,10.7,1.213,8.3,13.0
94,California,2016,Education,Some college,14.7,1.436,11.9,17.5
91,California,2016,Age,60 years and above,7.6,0.645,6.3,8.8
131,California,2018,Age,60 years and above,5.8,0.602,4.6,7.0
114,California,2017,Education,Some college,12.8,1.601,9.7,16.0
26,California,2013,Age,Less than 20 years,10.5,2.166,6.2,14.8
130,California,2018,Age,50 to 59 years,12.9,1.608,9.8,16.1


To make sure that the percent is a float data type (so I can make calculations and take the averages), I will run the following code: .info

In [8]:
# Checking the data type:

smoke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Geography       140 non-null    object
 1   year            140 non-null    int64 
 2   category        140 non-null    object
 3   category_name   140 non-null    object
 4   percent         140 non-null    object
 5   Standard Error  140 non-null    object
 6   lower_cl        140 non-null    object
 7   upper_cl        140 non-null    object
dtypes: int64(1), object(7)
memory usage: 8.9+ KB


Oops! It looks like the data type for the 'percent' column is listed as an object! We will have to change that to numeric using the following pd.to_numeric function, and force it to change ('coerce') in case it doesn't work and I want to override the error:

In [9]:
# Changing the 'percent' data type to numeric:

smoke_df['percent'] = pd.to_numeric(smoke_df['percent'], errors='coerce')

In [10]:
smoke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Geography       140 non-null    object 
 1   year            140 non-null    int64  
 2   category        140 non-null    object 
 3   category_name   140 non-null    object 
 4   percent         123 non-null    float64
 5   Standard Error  140 non-null    object 
 6   lower_cl        140 non-null    object 
 7   upper_cl        140 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 8.9+ KB


So now we have the data set that I want to work with! I've renamed the columns, made sure the 'percent' column is listed as float data type, and before I move into analysis with this data in the analysis folder, I will save this out as a new, cleaned data file:

In [11]:
# Saving cleaned data to my folder:

smoke_df.to_csv('../data/Cleaned/smoke_CLEANED.csv', index=False)

I have saved the cleaned data set to my data folder! This process is a lot easier now that I've worked with similar data sets so I know how to clean it accordingly and then move more smoothly into the analysis. 

Now, please refer to the data_analysis folder, and specifically the "current_smokers_analysis" notebook within that folder to look at how I carried out my analysis to break down the percentages of current smokers among age group. Thank you!