<a href="https://colab.research.google.com/github/makayla-ma/Obesity_US/blob/main/Obesity_Subset_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating the Young Adult Obesity Subset
## Overview
The instructions below will guide you through the steps required to create the Young Adult Obesity Subset from the public dataset [Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System](https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system) by the U.S. Department of Health & Human Services.
- The final subset will be a `.csv` file that contains the percentages of adults in the U.S. between ages 18 and 24 who had obesity in 2013 and 2023.
- The subset will be created with Python in [Google Colab Notebooks](https://colab.research.google.com/) and will require the user to have access to a **Google Account** and a **Google Drive**.
- This tutorial will assume little to no experience with data analysis and programming software.

**The overall process is outlined here**:
1. Create a folder in your **Google Drive** to house the materials for this project.
2. Create a new **Colab Notebook** and import the required **packages**.
3. Create a **DataFrame** from our public dataset.
4. **Filter** the dataframe to create a subset of only our desired data values.
5. **Rename** the subset and **export** as a `.csv` file.

### Getting Started
1. In your Google Drive, create a folder to house the materials for this project.
> You can call the folder whatever you want, but ideally, name it something that you will be able to find easily. For this tutorial, this folder will be referred to as "Obesity Subset."
2. Download the [Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System](https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system) public dataset as a `.csv` file to the Obesity Subset folder (or whatever you named yours).
3. Create a new Colab Notebook in your folder.
4. Run the following code in your Colab Notebook to mount your Google Drive and follow the prompted instructions.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


5. Import the *pandas* and *numpy* packages in Python with the following code. This will provide additional tools for data analysis.

In [None]:
import numpy as np
import pandas as pd

6. Create a dataframe from the public dataset with the following code, ensuring that everything is spelled correctly. I have named the dataframe "df" for simplicity, but you can name it whatever you'd like. Use the name of your folder instead of `Obesity Subset` in the line of code below.

In [None]:
df=pd.read_csv('drive/My Drive/Obesity Subset/Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv')

If you don't get an error message, you have successfully created a dataframe from the downloaded `.csv` file. If you get an error message, the most plausible reason is that something has been spelled incorrectly.
### Filtering the Dataframe
This dataset contains information about much more than what we're looking for, so we need to filter out just what we want, which are the following criteria:
- The percent of adults aged 18 years and older who have obesity
- The age group of young adults from 18 - 24
- The values from the years 2013 and 2023
7. Use the following line of code to create a subset of only the values related to obesity (replacing `df` with whatever you named your dataframe). I have chosen to name this subset `Obesity_subset` for clarity, you can name it whatever makes sense to you.

In [None]:
Obesity_subset=df[df["Question"]=="Percent of adults aged 18 years and older who have obesity"].copy()

8. Use the following line of code to create a subset of the `Obesity_subset` that contains only the values for ages 18 through 24, again replacing the names with your own.

In [None]:
YoungAdult_Obesity_subset=Obesity_subset[Obesity_subset["Stratification1"]=="18 - 24"].copy()

9. Use the following line of code to create a subset of the `YoungAdult_Obesity_subset` that contains only the values from 2013 and 2023.

In [None]:
Decade_YoungAdult_Obesity_subset=YoungAdult_Obesity_subset[(YoungAdult_Obesity_subset["YearStart"]==2013) | (YoungAdult_Obesity_subset["YearStart"]==2023)].copy()

10. The public dataframe also contains a lot of columns that we don't need. Use the line of code below to filter out the columns displaying the year, state, and percentage of young adults who had obesity. Once again, I have chosen to name the final subset `USA_Decade_YoungAdult_Obesity_subset`, but name the subset whatever makes sense to you.

In [None]:
USA_Decade_YoungAdult_Obesity_subset=Decade_YoungAdult_Obesity_subset[["YearStart","LocationDesc","Data_Value"]].copy()

Now, our subset should be successfully completed with only our desired values. To view the subset, run a line of code with the name of your subset:

In [None]:
USA_Decade_YoungAdult_Obesity_subset

Unnamed: 0,YearStart,LocationDesc,Data_Value
14703,2013,Alaska,12.6
14846,2013,Alabama,19.0
15130,2013,Arkansas,29.6
15156,2013,Arizona,13.4
15399,2013,California,12.2
...,...,...,...
103396,2023,Vermont,16.0
103620,2023,Washington,18.9
103729,2023,Wisconsin,20.0
104019,2023,West Virginia,26.7


### Exporting the Subset
11. Our final step is to export the subset as a `.csv` file. To do so, run the following line of code:

In [None]:
USA_Decade_YoungAdult_Obesity_subset.to_csv("USA_Decade_YoungAdult_Obesity_subset.csv", index=False)

To download the subset to your local computer, click on the folder icon on the left sidebar, then click on the three dots next to the subset file name and click "Download."