## Python Data Analysis Project

This is a project work to demonstrate data analysis using Python. 

Obesity is a common, serious, and costly disease. Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer. These are among the leading causes of preventable, premature death. The estimated annual medical cost of obesity in the United States was nearly 173 billion in 2019 dollars. Medical costs for adults who had obesity were $1,861 higher than medical costs for people with healthy weight.

In this project we will explore Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System data. We will also explore whether there is a strong correlation between obesity and age, gender, race, income and how these trends vary across United States.

**About Data**:
This dataset includes data on adult's diet, physical activity, and weight status from Behavioral Risk Factor Surveillance System. This data is used for DNPAO's Data, Trends, and Maps database, which provides national and state specific data on obesity, nutrition, physical activity, and breastfeeding.
https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7

**Updated**: December 7, 2021

**Data Provided by**: Centers for Disease Control and Prevention (CDC), National Center for Chronic Disease Prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity

### Reserach Questions
Which state has highest precentage of obese adult population?

In [2]:
# Import key libraries for analysis
import numpy as np
import pandas as pd

Read data from csv. This is data can be downloaded from CDC website as discussed in the intro.

In [3]:
df = pd.read_csv('Nutrition_Physical_Activity_and_Obesity.csv')

Ensure data is loaded correctly and total row count is as expected.

In [4]:
original_row_count = len(df.index)
df.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,Data_Value_Type,...,GeoLocation,ClassID,TopicID,QuestionID,DataValueTypeID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
0,2014,2014,GU,Guam,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(13.444304, 144.793731)",OWS,OWS1,Q036,VALUE,66,Education,High school graduate,EDU,EDUHSGRAD
1,2013,2013,US,National,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,,OWS,OWS1,Q036,VALUE,59,Income,"$50,000 - $74,999",INC,INC5075
2,2013,2013,US,National,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,,OWS,OWS1,Q037,VALUE,59,Income,Data not reported,INC,INCNR
3,2015,2015,US,National,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 300 min...,,Value,...,,PA,PA1,Q045,VALUE,59,Income,"Less than $15,000",INC,INCLESS15
4,2015,2015,GU,Guam,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 150 min...,,Value,...,"(13.444304, 144.793731)",PA,PA1,Q044,VALUE,66,Race/Ethnicity,Hispanic,RACE,RACEHIS


In [5]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
YearStart,80929.0,,,,2015.536717,2.841623,2011.0,2013.0,2016.0,2018.0,2020.0
YearEnd,80929.0,,,,2015.536717,2.841623,2011.0,2013.0,2016.0,2018.0,2020.0
LocationAbbr,80929.0,55.0,CO,1512.0,,,,,,,
LocationDesc,80929.0,55.0,Colorado,1512.0,,,,,,,
Datasource,80929.0,1.0,Behavioral Risk Factor Surveillance System,80929.0,,,,,,,
Class,80929.0,3.0,Physical Activity,44805.0,,,,,,,
Topic,80929.0,3.0,Physical Activity - Behavior,44805.0,,,,,,,
Question,80929.0,9.0,Percent of adults aged 18 years and older who ...,15037.0,,,,,,,
Data_Value_Unit,0.0,,,,,,,,,,
Data_Value_Type,80929.0,1.0,Value,80929.0,,,,,,,


#### Data Cleaning
The strategy to clean data includes following tasks:
- Understand the data quality by exploring unique and null values
- Remove columns that are redundant or not needed
- Assess missing data rationale and handle missing data
- Certain columns appear to have only one value
- Rename columns to be more intuitive 

In [6]:
# The data value and alternative data values appear to be the same
sum(df.Data_Value_Alt != df.Data_Value)

7964

In [7]:
# values that look redundant 
print(df.Data_Value_Unit.unique())
print(df.Data_Value_Type.unique())
print(df.Data_Value_Footnote_Symbol.unique())
print(df.Total.unique())
print(df.DataValueTypeID.unique())
print(df.Datasource.unique())

[nan]
['Value']
[nan '~']
[nan 'Total']
['VALUE']
['Behavioral Risk Factor Surveillance System']


In [8]:
df.drop(['Data_Value_Unit', 'Data_Value_Type', 
         'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
         'Total','DataValueTypeID','Datasource'], axis=1, inplace = True)

In [9]:
len(df.index)

80929

In [10]:
print(df.Data_Value_Footnote.unique())

[nan 'Data not available because sample size is insufficient.']


In [11]:
df.drop(df[df['Data_Value_Footnote'] == 
           'Data not available because sample size is insufficient.'].index , inplace = True)

In [19]:
print(f'{round((original_row_count - len(df.index))/original_row_count*100,2)}\
      percent was deleted due to insifficient sample size')

9.84      percent was deleted due to insifficient sample size


In [20]:
len(df.index)

72965

In [21]:
# Data_Value_Footnote column is no longer needed, we used this data to delete missing values
df.drop(['Data_Value_Footnote'], axis=1, inplace = True)

In [22]:
# The year start and end have same values, this is likely due to data collection period was within the year.
sum(df.YearStart != df.YearEnd)

0

In [23]:
# We do not need both columns, so we will keep Year End
df.drop(['YearStart'], axis=1, inplace = True)

In [24]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
YearEnd,72965.0,,,,2015.522415,2.849482,2011.0,2013.0,2016.0,2018.0,2020.0
LocationAbbr,72965.0,55.0,US,1512.0,,,,,,,
LocationDesc,72965.0,55.0,National,1512.0,,,,,,,
Class,72965.0,3.0,Physical Activity,40396.0,,,,,,,
Topic,72965.0,3.0,Physical Activity - Behavior,40396.0,,,,,,,
Question,72965.0,9.0,Percent of adults who engage in no leisure-tim...,13596.0,,,,,,,
Data_Value,72965.0,,,,31.238672,10.156464,0.9,24.4,31.1,36.9,77.6
Low_Confidence_Limit,72965.0,,,,26.933101,9.960837,0.3,20.1,26.7,32.8,70.2
High_Confidence_Limit,72965.0,,,,36.109866,11.084842,3.0,28.6,35.8,42.1,87.7
Sample_Size,72965.0,9269.0,54,140.0,,,,,,,


#### Questions are important data, these questions provide context to data value
The questions appear to quantify data value in terms of:
- What percentage of adults have obesity? or
- What percentage of adults exercise?

In [25]:
print(len(df.Question.unique()))
df.Question.unique()

9


array(['Percent of adults aged 18 years and older who have obesity',
       'Percent of adults aged 18 years and older who have an overweight classification',
       'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who engage in no leisure-time physical activity',
       'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who report consuming fruit less than one time daily',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 mi

### We use a function to create a summary column that represents questions
This step is data transformation, which may help us with readability and plotting the data

In [26]:
def qf_update(row):
    if row['Question'] == 'Percent of adults aged 18 years and older who have obesity':
        val = 'obese'
    elif row['Question'] == 'Percent of adults aged 18 years and older who have an overweight classification':
        val = 'overweight'
    elif row['Question'] == 'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)':
        val = 'very active'
    elif row['Question'] == 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week':
        val = 'active and physical training' 
    elif row['Question'] == 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)':
        val = 'active'     
    elif row['Question'] == 'Percent of adults who engage in no leisure-time physical activity':
        val = 'inactive'  
    elif row['Question'] == 'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week':
        val = 'physical training'
    elif row['Question'] == 'Percent of adults who report consuming fruit less than one time daily':
        val = 'fruits deficient'
    elif row['Question'] == 'Percent of adults who report consuming vegetables less than one time daily':
        val = 'veggies deficient'
    else:
        val = 'no data'
    return val

In [27]:
# Run the function, which will create a new column called "status"
df['status'] = df.apply(qf_update, axis=1)

### The data may be imbalanced due to the independence of questions and timing of surveys

In [28]:
df.status.value_counts()

inactive                        13596
obese                           13579
overweight                      13579
physical training                6712
active                           6700
very active                      6696
active and physical training     6692
fruits deficient                 2708
veggies deficient                2703
Name: status, dtype: int64

In [29]:
# The following columns are not needed for analysis 
df.drop(['Class', 'Topic', 'Question','ClassID','TopicID','QuestionID',], axis=1, inplace=True)

In [30]:
# Renaming columns for ease of use during the code
df.rename(columns={'Data_Value': 'Percent_Adults', 
                   'Low_Confidence_Limit': 'LowCI', 
                   'High_Confidence_Limit ': 'HighCI' }, inplace=True)




### The categories and subcategories are key dimensions that are required to answer research questions
The category and subcategory provide us further context of data, for example age is the category and various ranges of ages such as 18 to 24 years is a subcategory of age. We will use these dimensions to compare obesity rates within each category such as age groups or male vs female obesity rates.

In [34]:
df.groupby(['StratificationCategory1','Stratification1'])['status'].count()

StratificationCategory1  Stratification1                 
Age (years)              18 - 24                             2878
                         25 - 34                             2878
                         35 - 44                             2878
                         45 - 54                             2878
                         55 - 64                             2878
                         65 or older                         2878
Education                College graduate                    2878
                         High school graduate                2878
                         Less than high school               2878
                         Some college or technical school    2878
Gender                   Female                              2878
                         Male                                2878
Income                   $15,000 - $24,999                   2878
                         $25,000 - $34,999                   2878
                  

In [35]:
print(df.StratificationCategory1.value_counts())
print(df.StratificationCategoryId1.value_counts())
print(df.Stratification1.value_counts())
print(df.StratificationID1.value_counts())

Income            20146
Age (years)       17268
Race/Ethnicity    15405
Education         11512
Gender             5756
Total              2878
Name: StratificationCategory1, dtype: int64
INC      20146
AGEYR    17268
RACE     15405
EDU      11512
GEN       5756
OVR       2878
Name: StratificationCategoryId1, dtype: int64
High school graduate                2878
Female                              2878
65 or older                         2878
$75,000 or greater                  2878
$35,000 - $49,999                   2878
College graduate                    2878
Male                                2878
35 - 44                             2878
Total                               2878
$15,000 - $24,999                   2878
$50,000 - $74,999                   2878
18 - 24                             2878
45 - 54                             2878
Some college or technical school    2878
55 - 64                             2878
25 - 34                             2878
$25,000 - $34,999   

In [36]:
# The ID fields do not provide additional information, so we will work with category and subcategories 
df.drop(['StratificationCategoryId1', 'StratificationID1'], axis=1, inplace=True)

In [37]:
# renaming the columns for ease of use
df.rename(columns={'StratificationCategory1': 'Category', 
                   'Stratification1': 'Sub_Category'}, inplace=True)

### The columns Age, Edcation, Gender, Income, Race/Ethnicity have null values
These columns are not fully populated because the data in these columns is only populated when relevant, for example Gender column is not relevant when the row is populated for a particular age group or race.

In [38]:
df.isnull().sum(axis = 0)

YearEnd               0
LocationAbbr          0
LocationDesc          0
Percent_Adults        0
LowCI                 0
HighCI                0
Sample_Size           0
Age(years)        55697
Education         61453
Gender            67209
Income            52819
Race/Ethnicity    57560
GeoLocation        1512
LocationID            0
Category              0
Sub_Category          0
status                0
dtype: int64

### Let's explore location data

In [39]:
# the data here is represented at state and national level
df.LocationDesc.unique()

array(['Guam', 'National', 'Wyoming', 'District of Columbia',
       'Puerto Rico', 'Alabama', 'Rhode Island', 'New Jersey',
       'Washington', 'Michigan', 'Virginia', 'California', 'Utah',
       'New York', 'Massachusetts', 'Delaware', 'Arkansas', 'Illinois',
       'New Hampshire', 'New Mexico', 'Maryland', 'Hawaii', 'Louisiana',
       'Texas', 'South Dakota', 'Colorado', 'Oklahoma', 'Mississippi',
       'Oregon', 'West Virginia', 'Wisconsin', 'Kansas', 'Florida',
       'Idaho', 'Arizona', 'Virgin Islands', 'Montana', 'Minnesota',
       'Georgia', 'North Carolina', 'Pennsylvania', 'Kentucky',
       'North Dakota', 'South Carolina', 'Nebraska', 'Missouri', 'Nevada',
       'Iowa', 'Indiana', 'Ohio', 'Vermont', 'Tennessee', 'Connecticut',
       'Alaska', 'Maine'], dtype=object)

In [40]:
usa_df = df.loc[df['LocationDesc'] == 'National']
state_df = df.loc[df['LocationDesc'] != 'National']

In [41]:
print('National data: ',len(usa_df))
print('State data: ',len(state_df))
print('Total data: ',len(usa_df)+len(state_df))

National data:  1512
State data:  71453
Total data:  72965


In [42]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
YearEnd,72965.0,,,,2015.522415,2.849482,2011.0,2013.0,2016.0,2018.0,2020.0
LocationAbbr,72965.0,55.0,US,1512.0,,,,,,,
LocationDesc,72965.0,55.0,National,1512.0,,,,,,,
Percent_Adults,72965.0,,,,31.238672,10.156464,0.9,24.4,31.1,36.9,77.6
LowCI,72965.0,,,,26.933101,9.960837,0.3,20.1,26.7,32.8,70.2
HighCI,72965.0,,,,36.109866,11.084842,3.0,28.6,35.8,42.1,87.7
Sample_Size,72965.0,9269.0,54,140.0,,,,,,,
Age(years),17268.0,6.0,25 - 34,2878.0,,,,,,,
Education,11512.0,4.0,High school graduate,2878.0,,,,,,,
Gender,5756.0,2.0,Female,2878.0,,,,,,,


### Let's explore how data is structed by year

In [43]:
print(df.YearEnd.value_counts())

2017    12329
2019    12051
2015     9497
2011     9307
2013     9257
2016     4177
2020     4118
2018     4115
2014     4114
2012     4000
Name: YearEnd, dtype: int64
