# Obesity in the US Initial Exploration
- This is a dataset that looks at obesity rates for adults 18 years and older across the United States.
- This data looks at national obesity rates by age, income, education, gender, and race/ethnicity. 
- In this notebook, I will explore the dataset in order to do deeper analysis in other notebooks
  



# Setup

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
import matplotlib 
import seaborn as sns
import numpy as np


%matplotlib inline

# Load & Inspect Data

In [2]:
obesity_in_US_df = pd.read_csv('../data/Raw Data/Nutrition_Physical_Activity_and_Obesity_Behavioral_Risk_Factor_Surveillance_System.csv')
obesity_in_US_df.shape

(53392, 33)

There are 53392 rows and 33 columns in this data frame

In [3]:
obesity_in_US_df.sample(20)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,Data_Value_Type,...,GeoLocation,ClassID,TopicID,QuestionID,DataValueTypeID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
14852,2014,2014,MA,Massachusetts,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in no leisure-tim...,,Value,...,"(42.27687047000046, -72.08269067499964)",PA,PA1,Q047,VALUE,25,Gender,Male,GEN,MALE
31511,2013,2013,VT,Vermont,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(43.62538123900049, -72.51764079099962)",OWS,OWS1,Q037,VALUE,50,Age (years),18 - 24,AGEYR,AGEYR1824
10800,2014,2014,IA,Iowa,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(42.46940091300047, -93.81649055599968)",OWS,OWS1,Q037,VALUE,19,Age (years),55 - 64,AGEYR,AGEYR5564
33809,2014,2014,WV,West Virginia,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in no leisure-tim...,,Value,...,"(38.66551020200046, -80.71264013499967)",PA,PA1,Q047,VALUE,54,Gender,Female,GEN,FEMALE
21606,2011,2011,NM,New Mexico,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(34.52088095200048, -106.24058098499967)",OWS,OWS1,Q036,VALUE,35,Income,"$25,000 - $34,999",INC,INC2535
13008,2011,2011,ME,Maine,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 150 min...,,Value,...,"(45.254228894000505, -68.98503133599962)",PA,PA1,Q044,VALUE,23,Education,College graduate,EDU,EDUCOGRAD
1180,2013,2013,AK,Alaska,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 150 min...,,Value,...,"(64.84507995700051, -147.72205903599973)",PA,PA1,Q044,VALUE,2,Race/Ethnicity,Non-Hispanic White,RACE,RACEWHT
9761,2012,2012,IN,Indiana,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(39.766910452000445, -86.14996019399968)",OWS,OWS1,Q036,VALUE,18,Education,Some college or technical school,EDU,EDUCOTEC
7417,2014,2014,GU,Guam,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(13.444304, 144.793731)",OWS,OWS1,Q036,VALUE,66,Income,"$35,000 - $49,999",INC,INC3550
45192,2015,2015,WI,Wisconsin,Behavioral Risk Factor Surveillance System,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in no leisure-tim...,,Value,...,"(44.39319117400049, -89.81637074199966)",PA,PA1,Q047,VALUE,55,Age (years),55 - 64,AGEYR,AGEYR5564


* YearStart/Year End - ranges from 2011 to 2016 (will be focusing only on 2014) 

* LocationAbbr - Location abbreviation (i.e. state abbreviation) 

* LocationDesc - full state name

* Class/Topic- only focusing on obesity/weight status

* Questions - only want 
       'Percent of adults aged 18 years and older who have obesity',
       'Percent of adults aged 18 years and older who have an overweight classification'

* Topic- only focusing on only percent of adults aged 18 years and older who are obese

* StratificationCategory1 - Shows the category: Age, Income, Education, Gender, or Race/ethnicity 

* Stratificationy1 -  Shows more specific category info

### Exploring the columns

In [4]:
#look at columns
obesity_in_US_df.columns

Index(['YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc', 'Datasource',
       'Class', 'Topic', 'Question', 'Data_Value_Unit', 'Data_Value_Type',
       'Data_Value', 'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
       'Data_Value_Footnote', 'Low_Confidence_Limit', 'High_Confidence_Limit ',
       'Sample_Size', 'Total', 'Age(years)', 'Education', 'Gender', 'Income',
       'Race/Ethnicity', 'GeoLocation', 'ClassID', 'TopicID', 'QuestionID',
       'DataValueTypeID', 'LocationID', 'StratificationCategory1',
       'Stratification1', 'StratificationCategoryId1', 'StratificationID1'],
      dtype='object')

In [5]:
# find the years in the data
obesity_in_US_df['YearStart'].unique()

array([2011, 2012, 2014, 2013, 2015, 2016])

In [6]:
# find what states are in the data
obesity_in_US_df['LocationAbbr'].unique()

array(['AL', 'US', 'AK', 'AZ', 'AR', 'CA', 'CT', 'CO', 'DE', 'FL', 'DC',
       'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME',
       'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
       'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC',
       'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WV', 'WA', 'WI', 'WY', 'VI'],
      dtype=object)

- eventually will get rid of GU, LA, VI

In [7]:
obesity_in_US_df['LocationDesc'].unique()

array(['Alabama', 'National', 'Alaska', 'Arizona', 'Arkansas',
       'California', 'Connecticut', 'Colorado', 'Delaware', 'Florida',
       'District of Columbia', 'Georgia', 'Guam', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'West Virginia', 'Washington', 'Wisconsin', 'Wyoming',
       'Virgin Islands'], dtype=object)

In [8]:
# explore the ages column
obesity_in_US_df['Age(years)'].unique()

array([nan, '18 - 24', '25 - 34', '35 - 44', '45 - 54', '55 - 64',
       '65 or older'], dtype=object)

- There are 5 age groups 

In [9]:
#explore the education column
obesity_in_US_df['Education'].unique()

array([nan, 'Less than high school', 'High school graduate',
       'Some college or technical school', 'College graduate'],
      dtype=object)

- there are 4 education levels 

In [10]:
# explore the gender column
obesity_in_US_df['Gender'].unique()

array([nan, 'Male', 'Female'], dtype=object)

In [11]:
# explore the income column
obesity_in_US_df['Income'].unique()

array([nan, 'Less than $15,000', '$15,000 - $24,999', '$25,000 - $34,999',
       '$35,000 - $49,999', '$50,000 - $74,999', '$75,000 or greater',
       'Data not reported'], dtype=object)

- there are 6 income groups

In [12]:
# explore the race/ethnicity column
obesity_in_US_df['Race/Ethnicity'].unique()

array([nan, 'Non-Hispanic White', 'Non-Hispanic Black', 'Hispanic',
       'Asian', 'Hawaiian/Pacific Islander',
       'American Indian/Alaska Native', '2 or more races', 'Other'],
      dtype=object)

- there are 7 race/ethnicities

In [13]:
# explore the stratification category 
obesity_in_US_df['StratificationCategory1'].unique()

array(['Total', 'Gender', 'Education', 'Age (years)', 'Income',
       'Race/Ethnicity'], dtype=object)

In [14]:
# look at what a row consists of
obesity_in_US_df.iloc[12]

YearStart                                                                  2011
YearEnd                                                                    2011
LocationAbbr                                                                 AL
LocationDesc                                                            Alabama
Datasource                           Behavioral Risk Factor Surveillance System
Class                                                   Obesity / Weight Status
Topic                                                   Obesity / Weight Status
Question                      Percent of adults aged 18 years and older who ...
Data_Value_Unit                                                             NaN
Data_Value_Type                                                           Value
Data_Value                                                                 27.1
Data_Value_Alt                                                             27.1
Data_Value_Footnote_Symbol              