## Evolution of obesity rates in United States, over the last decade

This is a project to demonstrate data analysis using Python. 

Obesity is a common, serious, and costly disease. Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer. These are among the leading causes of preventable, premature death. The estimated annual medical cost of obesity in the United States was nearly 173 billion in 2019 dollars. Medical costs for adults who had obesity were $1,861 higher than medical costs for people with healthy weight.

In this project we will explore Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System data. We will also explore whether there is a strong correlation between obesity and age, gender, race, income and how these trends vary across United States.

**About Data**:
This dataset includes data on adult's diet, physical activity, and weight status from Behavioral Risk Factor Surveillance System. This data is used for DNPAO's Data, Trends, and Maps database, which provides national and state specific data on obesity, nutrition, physical activity, and breastfeeding.
https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7

**Updated**: December 7, 2021

**Data Provided by**: Centers for Disease Control and Prevention (CDC), National Center for Chronic Disease Prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity

### Research Questions
1. Which state has highest precentage of obese adult population?
2. Are there any state and national trends related to obesity in adults? 
3. Are there differences in obesity rates across Gender, Age, Race, Income and Education?
4. Do low levels of exercise and poor nutrition correlate with higher levels of obesity?

### Conclusions

1. For the last 2 years (2019, 2020) Mississippi had the highest rates of obesity in United States. We also see West Virginia and Louisiana appear on the list a few times.


2. Over the last decade (2011-2020), the USA obesity rates have gone up from 27.4% to 31.9%
 * Over the last decade (2011-2020), the state level obesity rates have gone up as well. In 2020, the state level chart confirms that the highest obesity rates were in Mississippi, West Virginia was a close second, followed by Alabama. Aggregating the data to regional level shows that the East South Central region shows highest obesity rates followed by the West South Central region. The East South Central region also showed the largest change (~20%) in obesity rates since 2011, from 31.7% to 38%
 * The high obesity rates are concentrated in the south central region. The lowest obesity rates appear to be in Colorado and Massachusetts.


3. There are significant differences across Gender, Race, Age, Income and Education levels:
 * The male obesity rates were higher till 2017, however the female obesity rates accelerated and are now higher than male obesity rates. In 2020, Female Obesity rates were 32.1% compared to Male obesity rates of 31.7%. This may require further research.
 * In 2020, the obesity rates are highest among age 45 to 54: 38.1%. The lowest obesity rates (19.5%) were in the age group 18 to 24 years. However, this younger group grew from 15.2% in 2011 to 19.5% in 2020. This is a concerning trend. 
 * Black population showed the highest rates of obesity, 41.6%, in 2020, which is more than 3.5x compared to Asian population obesity rates of 11.8%.
 * Lower income levels correlates with higher obesity rates. In 2020, population earning higher than 75,000 dollars showed the lowest levels of obesity rates, at 29% vs 37.9% for the population earning less than 15,000 dollars.
 * Lower levels of education correlates with higher obesity rates. In 2020, the population with less than a high school diploma shows obesity rates of 38.8% compared to 25% for college graduates.


4. Higher percentage of inactive population correlates with higher obesity rates and poor nutrition correlates with higher obesity rates:
 * Mississippi showed a higher percentage of inactive population. We see some noise in data, for example the data from Puerto Rico shows higher rates of inactiveness does not strongly correlate with higher obesity rates. 
 * Higher levels of fruit deficiency shows stronger correlation with higher rates of obesity. However, the relationship is less strong for vegetables deficiency. Also, some data points such as Puerto Rico are introducing noise in data, for example removing the data from Puerto Rico shows that obesity rates are not correlated with Vegetable deficiency rates.

**GitHub:** https://github.com/nsharma73/python_data_analysis


In [43]:
# Import key libraries for analysis
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px
from IPython.display import display

Read clean/processed data from csv. This is data was processed using Data_Clean_Up notebook

In [44]:
us_df = pd.read_csv('us_df.csv')
st_df = pd.read_csv('st_df.csv')

In [45]:
print(len(us_df))
print(len(st_df))

1512
71453


Ensure data is loaded correctly and total row count is as expected.

### <span style="color:red"> Question #1: Which State has highest obesity? <span>
We create a weighted average function to aggregate percent adults, the weights are based on sample size.

In [81]:
def weighted_average(df, values, weights):
    return sum(df[weights] * df[values]) / df[weights].sum()


We use total at state level (state dataframe) to filter on obese population and aggregate percent adults to get the maximum rates by state. 

In [82]:
st_ob_df = st_df[(st_df['status'] == 'Obese') & (st_df['Category'] == 'Total')]
st_ob_df = st_ob_df[['Year', 'State','status','Category','Sub_Category',
                     'Age','Education','Gender','Income','Race',
                     'Percent_Adults','Sample_Size']]


In [83]:
grp_tmp = pd.DataFrame(st_ob_df.groupby(['Year',
                                         'State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
grp_tmp.columns = ['Year', 'State', 'Percent_Adults']

In [84]:
#numpy.average(a, axis=None, weights=None, returned=False)
#df['newcol']=df.groupby(['group1','group2']).colname.transform('count') 
tmp1 = pd.DataFrame(st_ob_df.groupby(['Year',
                                         'State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp1.columns = ['Year', 'State', 'Percent_Adults']

In [85]:
tmp1['max_state_level'] = tmp1.groupby(['Year']).Percent_Adults.transform('max')

In [86]:
tmp1[tmp1['Percent_Adults']==tmp1['max_state_level']].sort_values(['Year','max_state_level'])

Unnamed: 0,Year,State,Percent_Adults,max_state_level
25,2011,MS,34.9,34.9
69,2012,LA,34.7,34.7
127,2013,MS,35.1,35.1
152,2013,WV,35.1,35.1
156,2014,AR,35.9,35.9
226,2015,LA,36.2,36.2
312,2016,WV,37.7,37.7
365,2017,WV,38.1,38.1
393,2018,MS,39.5,39.5
418,2018,WV,39.5,39.5


### <span style="color:blue">Conclusion 1: For the last 2 years (2019, 2020) Mississippi had the highest rates of obesity in United States. We also see West Virginia and Lousiana appear on the list a few times.
</span>

### <span style="color:red"> Question 2: Are there any state and national trends related to obesity/overweight/inactive adults? <span>

We look at the national (USA) level data to see what the overall obesity trends are

In [87]:
df_us_obesity = us_df[(us_df['status']=='Obese') & (us_df['Category']=='Total')]

In [88]:
tmp2 = pd.DataFrame(df_us_obesity.groupby(['Year']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp2.columns = ['Year','Percent_Adults']

In [89]:
fig = px.line(tmp2, x="Year", y="Percent_Adults", title='US Obesity Rate',
             labels={
                     "Year": "Survey Year",
                     "Percent_Adults": "Percentage of Adults with Obesity"
                 })
fig.show()

### <span style="color:blue">Conclusion 2a: Over the last decade (2011-202), the USA obesity rates have gone up from 27.4% to 31.9%.
</span>

We look at state level data next

In [90]:
df_state_obesity = st_df[(st_df['status']=='Obese') & (st_df['Category']=='Total')]

In [91]:
df_state_obesity

Unnamed: 0,Year,State,State_Name,Percent_Adults,LowCI,HighCI,Sample_Size,Age,Education,Gender,Income,Race,Category,Sub_Category,status,Region
17,2015,PR,Puerto Rico,29.5,28.0,31.1,5154.0,,,,,,Total,Total,Obese,Other
35,2016,NM,New Mexico,28.3,26.6,30.1,5531.0,,,,,,Total,Total,Obese,Mountain
56,2016,NH,New Hampshire,26.6,25.0,28.2,5888.0,,,,,,Total,Total,Obese,New England
113,2016,MT,Montana,25.5,23.9,27.2,5483.0,,,,,,Total,Total,Obese,Mountain
297,2018,AK,Alaska,29.5,27.0,32.2,2600.0,,,,,,Total,Total,Obese,Pacific
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69129,2019,WV,West Virginia,39.7,38.0,41.5,4988.0,,,,,,Total,Total,Obese,South Atlantic
69376,2019,ME,Maine,31.7,30.3,33.1,10455.0,,,,,,Total,Total,Obese,New England
69394,2019,MS,Mississippi,40.8,39.0,42.7,4737.0,,,,,,Total,Total,Obese,East South Central
69431,2019,OR,Oregon,29.0,27.5,30.4,5548.0,,,,,,Total,Total,Obese,Pacific


In [92]:
tmp3 = pd.DataFrame(df_state_obesity.groupby(['Year','State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp3.columns = ['Year', 'State', 'Percent_Adults']

In [93]:
fig = px.line(tmp3, x="Year", y="Percent_Adults", color='State', title='Adult Obesity Rates by State',
             labels={
                     "Year": "Survey Year",
                     "Percent_Adults": "Percentage of Adults with Obesity"
                 })
fig.show()

In [94]:
tmp_region = pd.DataFrame(df_state_obesity.groupby(['Year','Region']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp_region.columns = ['Year', 'Region', 'Percent_Adults']

In [115]:
tmp_region
fig = px.line(tmp_region, x="Year", y="Percent_Adults", color='Region', title='Adult Obesity Rates by Region',
             labels={
                     "Year": "Survey Year",
                     "Percent_Adults": "Percentage of Adults with Obesity"
                 })
fig.show()

In [96]:
pvt_tmp_region=tmp_region.pivot(index='Region', columns='Year', values='Percent_Adults')
pvt_tmp_region=pvt_tmp_region.drop('Other')
heat_pvt_tmp_region=pvt_tmp_region.pct_change(periods = 9, axis=1)

heat_pvt_tmp_region.style.background_gradient(cmap ='Blues')\
        .set_properties(**{'font-size': '10px'})


All-NaN slice encountered


All-NaN slice encountered



Year,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
East North Central,,,,,,,,,,0.177757
East South Central,,,,,,,,,,0.196315
Middle Atlantic,,,,,,,,,,0.087126
Mountain,,,,,,,,,,0.189822
New England,,,,,,,,,,0.149145
Pacific,,,,,,,,,,0.121932
South Atlantic,,,,,,,,,,0.131279
West North Central,,,,,,,,,,0.187976
West South Central,,,,,,,,,,0.159315


### <span style="color:blue">Conclusion 2b: Over the last decade (2011-2020), the state level obesity rates have gone up as well. In 2020, the state level chart confirms higest obesity rates for Mississippi, West Virginia is a close second, followed by Alabama. Looking at higher-level regional data East South Central region shows higest obesity rates followed by West South Central region. The East South Central region also showed the largest change (~20%) in obesity rates since 2011, from 31.7% to 38%.
</span>

We also look at the map view see how obesity rates have evolved over the last decade

In [97]:
fig = px.choropleth(tmp3,
                    locations='State',
                    color='Percent_Adults',
                    color_continuous_scale='spectral_r',
                    hover_name='State',
                    locationmode='USA-states',
                    labels={'Percent_Adults':' Obesity Rate'},
                    scope='usa',
                    animation_frame='Year', 
                    animation_group='State')
fig.show()

### <span style="color:blue">Conclusion 2c: The high obesity rates are concentrated in the south central region. The lowest obesity rates appear to be in Colorado and Massachusetts.
</span>

### <span style="color:red"> Question 3: Are there differences in obesity rates across Gender, Age, Race, Income and Education?</span>

#### This question requires us to create multiple views or charts of data that aggregates data by various categories. To streamline the analysis we will create a function that can take two arguments: status and category, and return a chart as well as table.

In [98]:
def plot_data(statusv,colnamev,exclude=['Other', '2 or more races','Data not reported']):
    """
    This function needs to arguments status variable and column name variable
    Status can be Obese, Overweight etc.
    Column Name refers to Age, Race, Income in the dataframe 
    Optional argument: exclude, which can be used to exclude certain values
    such as 'Other' and 'Data not reported' 
    """
    df0 = us_df[(us_df['status']==statusv) & (us_df[colnamev].notnull()) & (~us_df[colnamev].isin(exclude))]
    df1 = pd.DataFrame(df0.groupby(['Year',colnamev]).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
    df1.columns = ['Year', colnamev, 'Percent_Adults']
    dfp = df1.pivot(index='Year', columns=colnamev, values='Percent_Adults')
    fig = px.bar(df1, x='Year', y='Percent_Adults', color=colnamev, barmode='group')
    
    return fig.show(), display(dfp) 

#### Part A: Obesity Rates by Gender

In [99]:
plot_data('Obese','Gender')

Gender,Female,Male
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,27.1,27.8
2012,27.4,28.0
2013,28.3,28.3
2014,28.8,29.0
2015,28.6,29.1
2016,29.5,29.6
2017,30.0,30.2
2018,31.3,30.6
2019,32.1,30.6
2020,32.1,31.7


(None, None)

#### Part B: Obesity Rates by Age

In [100]:
plot_data('Obese','Age')

Age,18 - 24,25 - 34,35 - 44,45 - 54,55 - 64,65 or older
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011,15.2,25.9,29.9,32.6,32.6,25.3
2012,15.0,25.6,31.3,32.4,33.3,25.8
2013,15.4,26.4,31.7,33.3,33.5,26.5
2014,15.9,27.0,32.1,33.7,34.2,27.5
2015,16.7,26.7,32.1,34.0,33.4,27.6
2016,17.3,27.2,33.1,35.1,34.2,28.0
2017,16.5,28.2,33.0,35.9,35.4,28.5
2018,18.1,29.5,34.5,36.9,35.1,28.9
2019,18.9,29.5,34.6,37.6,36.0,29.3
2020,19.5,30.9,35.5,38.1,36.3,29.3


(None, None)

#### Part C: Obesity Rates by Race

In [101]:
plot_data('Obese','Race')

Race,American Indian/Alaska Native,Asian,Hawaiian/Pacific Islander,Hispanic,Non-Hispanic Black,Non-Hispanic White
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011,35.4,8.7,25.0,30.1,37.3,26.2
2012,33.7,9.7,25.8,30.3,37.7,26.5
2013,34.9,9.6,30.1,31.5,37.8,27.1
2014,33.4,9.4,29.1,32.2,38.9,27.8
2015,36.0,10.2,32.6,32.2,37.7,27.9
2016,38.1,9.8,30.6,33.1,38.3,28.6
2017,38.7,11.2,32.5,32.4,39.0,29.3
2018,39.1,11.5,35.2,34.2,39.9,29.9
2019,37.9,11.4,42.8,34.7,40.7,30.4
2020,38.8,11.8,38.5,36.6,41.6,30.7


(None, None)

#### Part D: Obesity Rates by Income

In [102]:
plot_data('Obese','Income')

Income,"$15,000 - $24,999","$25,000 - $34,999","$35,000 - $49,999","$50,000 - $74,999","$75,000 or greater","Less than $15,000"
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011,31.3,29.5,28.7,28.4,22.8,32.3
2012,31.6,29.2,28.5,28.9,23.4,33.1
2013,32.3,29.8,30.0,28.8,23.9,33.6
2014,32.2,31.1,30.7,29.2,25.1,35.2
2015,33.2,32.1,30.6,29.9,24.6,34.8
2016,33.4,31.9,32.0,31.1,25.4,35.4
2017,34.5,32.8,31.9,31.3,26.0,36.2
2018,35.6,33.2,32.7,33.7,27.4,35.4
2019,36.3,34.0,33.8,32.3,28.3,36.1
2020,36.1,35.1,34.3,34.0,29.0,37.9


(None, None)

#### Part E: Obesity Rates by Education

In [103]:
plot_data('Obese','Education')

Education,College graduate,High school graduate,Less than high school,Some college or technical school
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011,20.6,29.7,32.4,28.6
2012,21.2,29.9,32.9,28.7
2013,20.9,30.7,34.1,29.6
2014,21.8,31.4,34.4,30.2
2015,21.7,31.7,34.0,30.2
2016,22.2,32.3,35.5,31.0
2017,22.7,32.9,35.6,31.9
2018,24.7,33.1,35.0,33.0
2019,25.0,34.3,36.2,32.8
2020,25.0,34.0,38.8,34.1


(None, None)

### <span style="color:blue"> Conclusion 3 
#### A. The male obesity rates were higher till 2017, however the female obesity rates accelerated and are now higher than male obesity rates. This may require further research. 
#### B. The obesity rates are higest among age 45 to 54
#### C. Black population show highest rates of obesity
#### D. Lower income levels correlates with higher obesity rates
#### E. Lower levels of education correlates with higher obesity rates
</span>

### <span style="color:red"> Question 4: Does low levels of excerise and poor nutrition correlate with higher levels of obesity? <span>

#### This part of analysis looks at the percentage inactive population and percent obese by states, the idea is to plot the two metrics and visually inspect for any correlation.

In [104]:
# we will use the existing table with percent obese adults by state and rename the column for convenience 
tmp3.rename(columns = {'Percent_Adults':'Percent_Obese_Adults'}, inplace = True)

In [105]:
df_state_inactive = st_df[(st_df['status']=='Inactive') & (st_df['Category']=='Total')]

In [106]:
tmp_inactive = pd.DataFrame(df_state_inactive.groupby(['Year','State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp_inactive.columns = ['Year', 'State', 'Percent_Inactive_Adults']

In [107]:
mdf_st=tmp3.merge(tmp_inactive, how='inner', on=['Year','State'])

In [108]:
fig = px.scatter(x=mdf_st['Percent_Inactive_Adults'], 
                 y=mdf_st['Percent_Obese_Adults'], 
                 color=mdf_st['State'],
                 labels={
                     'x': 'Percent Inactive Adults',
                     'y': 'Percent Obese Adults'
                 },
                title='Relationship between Inactive and Obese Population')
fig.show()

### <span style="color:blue">Conclusion A: Higher rates of inactive rates correlate positively with higher rates of obesity. We do see some noise in data, for example the data from Puerto Rico shows higher rates of inactive do not strongly correlate with higher obsesity rates.</span> 

In [109]:
df_state_nfv = st_df[(st_df['status']=='Veggies Deficient') & (st_df['Category']=='Total')]
df_state_nff = st_df[(st_df['status']=='Fruits Deficient') & (st_df['Category']=='Total')]

In [110]:
tmp_nfv = pd.DataFrame(df_state_nfv.groupby(['Year','State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp_nfv.columns = ['Year', 'State', 'Percent_Adults_Deficient_Veggies']

tmp_nff = pd.DataFrame(df_state_nff.groupby(['Year','State']).apply(weighted_average, 
                                                        'Percent_Adults', 
                                                         'Sample_Size')).reset_index()
tmp_nff.columns = ['Year', 'State', 'Percent_Adults_Deficient_Fruits']

In [111]:
ndf_v=tmp3.merge(tmp_nfv, how='inner', on=['Year','State'])

In [112]:
ndf_f=tmp3.merge(tmp_nff, how='inner', on=['Year','State'])

In [113]:
fig = px.scatter(x=ndf_f['Percent_Obese_Adults'], 
                 y=ndf_f['Percent_Adults_Deficient_Fruits'], 
                 color=ndf_f['State'],
                 labels={
                     'x': 'Percent Adults Obese',
                     'y': 'Percent Adults Deficient in Fruits'
                 },
                title='Relationship between Obesity and Fruit Deficiency among Population')
fig.show()

fig = px.scatter(x=ndf_v['Percent_Obese_Adults'], 
                 y=ndf_v['Percent_Adults_Deficient_Veggies'], 
                 color=ndf_v['State'],
                 labels={
                     'x': 'Percent Adults Obese',
                     'y': 'Percent Adults Deficient in Veggies'
                 },
                title='Relationship between Obesity and Vegetable Deficiency among Population')
fig.show()

### <span style="color:blue">Conclusion B: Higher level of fruits deficiency shows stronger correlation with higher rates of obesity. However, the relationship is less stronger for vegetables deficiency. Also, some data points such as Puerto Rico are introducing noise in data, for example removing the data from Puerto Rico shows that obesity rates are not correlated with Vegetable deficiency rates.</span> 

In [114]:
print('End of Analysis')

End of Analysis
