## Data Processing Notebook
### Evolution of obesity rates in United States, over the last decade

This is a project work to demonstrate data analysis using Python. 

Obesity is a common, serious, and costly disease. Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer. These are among the leading causes of preventable, premature death. The estimated annual medical cost of obesity in the United States was nearly 173 billion in 2019 dollars. Medical costs for adults who had obesity were $1,861 higher than medical costs for people with healthy weight.

In this project we will explore Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System data. We will also explore whether there is a strong correlation between obesity and age, gender, race, income and how these trends vary across United States.

**About Data**:
This dataset includes data on adult's diet, physical activity, and weight status from Behavioral Risk Factor Surveillance System. This data is used for DNPAO's Data, Trends, and Maps database, which provides national and state specific data on obesity, nutrition, physical activity, and breastfeeding.
https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7

**Updated**: December 7, 2021

**Data Provided by**: Centers for Disease Control and Prevention (CDC), National Center for Chronic Disease Prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity

Please see Project-Obesity-Analysis notebook for the conclusions. 

### Research Questions
1. Which state has highest precentage of obese adult population?
2. Are there any state and national trends related to obesity/overweight/inactive adults? 
3. Are there differences in obesity rates across Gender, Age, Race, Income and Education?
4. Does low levels of excerise and poor nutrition correlate with higher levels of obesity?

**GitHub:** https://github.com/nsharma73/python_data_analysis


In [None]:
# Import key libraries for analysis
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px
from IPython.display import display

Read data from csv. This is data can be downloaded from CDC website as discussed in the intro.

In [None]:
df = pd.read_csv('Nutrition_Physical_Activity_and_Obesity.csv')

Ensure data is loaded correctly and total row count is as expected.

In [None]:
original_row_count = len(df.index)
df.head()

In [None]:
df.describe(include='all').transpose()

### Data Exploration 
This is a survey data so we will start by looking at questions

In [None]:
qs = set(df['Question'])

In [None]:
qs

#### Explore non-numeric data to better understand the categorical variables
We will create a list of columns and then pass this list to a function, 
which will create a set of unique values for each attribute

In [None]:
col_dtype_n = pd.DataFrame(df.describe().transpose()).index
col_names = pd.DataFrame(df.columns)
print(col_dtype_n)
print(col_names)

In [None]:
str_col = col_names[pd.isna(col_names[col_names.isin(col_dtype_n)])].dropna(axis=0)

In [None]:
i_cols_array = np.array(str_col)
# Let's test whether the data are appropriately organized in the set with unique values 
# we are testing the code with questions set
set(np.array(df[['Question']])[:,0])

In [None]:
for i in i_cols_array:
    print(i)
    print(set(np.array(df[i])[:,0]))

In [None]:
df[['LocationID','LocationDesc']].head(15)

### Key Takeaways
1. LocationAbbr and LocationDesc are state codes and state names, we can drop LocationID
2. Datasource has only one value "Behavioral Risk Factor Surveillance System" and can be deleted
3. Class and Topic work together: both are important and provide data about the question category
    'Fruits and Vegetables', 'Obesity / Weight Status', 'Physical Activity' for class, and
    'Physical Activity - Behavior', 'Fruits and Vegetables - Behavior', 'Obesity / Weight Status' for topic
4. ClassID and TopicID may not be needed
5. There are 9 questions and these are key to this analysis and we may not need QuestionID
6. DataValueType and DataValueType are not needed
7. Data_Value_Footnote_Symbol is not needed
8. Data_Value_Footnote should be used to delete rows where sample size is not sufficient 
9. Sample_Size should be converted to numeric data type
10. Total column can be used to get totals without any stratification, we will use StratificationCategory1 instead
11. Age, Eduction, Gender, Income, Race/Ethnicity are key dimensions for this analysis
12. We will not need GeoLocation, and we will work with State data
13. Stratification1 is more granular than StratificationCategory1; Hierarchical Data
14. StratificationCategoryId1 and StratificationID1 are not needed


In [None]:
df.info()

### Data Cleaning
The strategy to clean data includes following tasks:
- Understand the data quality by exploring unique and null values
- Remove columns that are redundant or not needed
- Assess missing data rationale and handle missing data
- Certain columns appear to have only one value
- Rename columns to be more intuitive 

In [None]:
# The data value and alternative data values appear to be the same
sum(df.Data_Value_Alt != df.Data_Value)

In [None]:
# Double check the data values before dropping the columns 
print(df.Data_Value_Unit.unique())
print(df.Data_Value_Type.unique())
print(df.Data_Value_Footnote_Symbol.unique())
print(df.Total.unique())
print(df.DataValueTypeID.unique())
print(df.Datasource.unique())
print(df.Data_Value_Footnote.unique())

In [None]:
df.drop(['Data_Value_Unit', 'Data_Value_Type', 'LocationID', 'GeoLocation',
         'Data_Value_Alt', 'Data_Value_Footnote_Symbol','Total',
         'DataValueTypeID','Datasource'], axis=1, inplace = True)

In [None]:
len(df.index)

#### Delete records where 'Data not available because sample size is insufficient.'

In [None]:
df.drop(df[df['Data_Value_Footnote'] == 
           'Data not available because sample size is insufficient.'].index , inplace = True)

In [None]:
print(f'{round((original_row_count - len(df.index))/original_row_count*100,2)}\
 percent was deleted due to insifficient sample size')

In [None]:
len(df.index)

#### The Data_Value_Footnote can be deleted now and we can simplify Year dimension

In [None]:
print(df.Data_Value_Footnote.unique())
# Data_Value_Footnote column is no longer needed, we used this data to delete missing values
df.drop(['Data_Value_Footnote'], axis=1, inplace = True)

In [None]:
# The year start and end have same values, this is likely due to data collection period was within the year.
print(sum(df.YearStart != df.YearEnd))
# We do not need both columns, so we will keep Year End
df.drop(['YearStart'], axis=1, inplace = True)

#### The sample size will be used for weighted average, so we convert it to number

In [None]:
df['Sample_Size'] = df['Sample_Size'].str.replace(',', '').astype(float)

In [None]:
df.describe(include='all').transpose()

#### Questions are important data, these questions provide context to data value. There are 9 unique questions
The questions appear to quantify data value in terms of:
- What percentage of adults have obesity? or
- What percentage of adults exercise?

In [None]:
print(len(df.Question.unique()))
df.Question.unique()

### We use a function to create a "status" column that represents questions
This step is data transformation, which may help us with readability and plotting the data

In [None]:
def qf_update(row):
    if row['Question'] == 'Percent of adults aged 18 years and older who have obesity':
        val = 'Obese'
    elif row['Question'] == 'Percent of adults aged 18 years and older who have an overweight classification':
        val = 'Overweight'
    elif row['Question'] == 'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)':
        val = 'Very Active'
    elif row['Question'] == 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week':
        val = 'Active and Physical Training' 
    elif row['Question'] == 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)':
        val = 'Active'     
    elif row['Question'] == 'Percent of adults who engage in no leisure-time physical activity':
        val = 'Inactive'  
    elif row['Question'] == 'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week':
        val = 'Physical Training'
    elif row['Question'] == 'Percent of adults who report consuming fruit less than one time daily':
        val = 'Fruits Deficient'
    elif row['Question'] == 'Percent of adults who report consuming vegetables less than one time daily':
        val = 'Veggies Deficient'
    else:
        val = 'no data available'
    return val

In [None]:
# Run the function, which will create a new column called "status"
df['status'] = df.apply(qf_update, axis=1)

### The data may be imbalanced due to the independence of questions and timing of surveys

In [None]:
df.status.value_counts()

In [None]:
# The following columns are not needed for analysis 
df.drop(['Class', 'Topic', 'Question','ClassID','TopicID','QuestionID',], axis=1, inplace=True)

In [None]:
# Renaming columns for ease of use during the code
df.rename(columns={'Data_Value': 'Percent_Adults', 
                   'Low_Confidence_Limit': 'LowCI', 
                   'High_Confidence_Limit ': 'HighCI' }, inplace=True)


### The categories and subcategories are key dimensions that are required to answer research questions
The category and subcategory provide us further context of data, for example age is the category and various ranges of ages such as 18 to 24 years is a subcategory of age. We will use these dimensions to compare obesity rates within each category such as age groups or male vs female obesity rates.

In [None]:
df.groupby(['StratificationCategory1','Stratification1'])['status'].count()

In [None]:
print(df.StratificationCategory1.value_counts())
print(df.StratificationCategoryId1.value_counts())
print(df.Stratification1.value_counts())
print(df.StratificationID1.value_counts())

In [None]:
# The ID fields do not provide additional information, so we will work with category and subcategories 
df.drop(['StratificationCategoryId1', 'StratificationID1'], axis=1, inplace=True)

In [None]:
# renaming the columns for ease of use
df.rename(columns={'StratificationCategory1': 'Category', 
                   'Stratification1': 'Sub_Category',
                   'YearEnd':'Year',
                   'LocationAbbr':'State',
                   'LocationDesc':'State_Name',
                   'Age(years)':'Age',
                   'Race/Ethnicity':'Race'
                  }, inplace=True)

### The columns Age, Edcation, Gender, Income, Race/Ethnicity have null values
These columns are not fully populated because the data in these columns is only populated when relevant, for example Gender column is not relevant when the row is populated for a particular age group or race.

In [None]:
df.isnull().sum(axis = 0)

### Let's explore location data

In [None]:
# the data here is represented at state and national level
df.State.unique()

In [None]:
st_lookup = pd.read_csv('states.csv')

In [None]:
st_lookup_dict = dict(zip(st_lookup.State, st_lookup.Region))

In [None]:
len(st_lookup_dict)

In [None]:
st_lookup_dict.update(GU ='Other') 
st_lookup_dict.update(PR ='Other') 
st_lookup_dict.update(US ='Other') 
st_lookup_dict.update(VI ='Other') 

In [None]:
df['Region'] = df['State'].apply(lambda x : st_lookup_dict[x])

In [None]:
us_df = df.loc[df['State_Name'] == 'National']
st_df = df.loc[df['State_Name'] != 'National']

In [None]:
print('National data: ',len(us_df))
print('State data: ',len(st_df))
print('Total data: ',len(us_df)+len(st_df))

In [None]:
df.describe(include='all').transpose()

### Let's explore how data is structed by year

In [None]:
print(df.Year.value_counts())
print(len(us_df))
print(len(st_df))

#### The data is ready for analysis
Clean data will be split by national (USA) level and state level into two csv files.

In [None]:
us_df.to_csv('us_df.csv', sep=',', index=False, encoding='utf-8')
st_df.to_csv('st_df.csv', sep=',', index=False, encoding='utf-8')

In [None]:
print('End of Data Clean Up')