# Final Project Phase 3 Summary
This Jupyter Notebook (.ipynb) will serve as the skeleton file for your submission for Phase 3 of the Final Project. Complete all sections below as specified in the instructions for the project, covering all necessary details. We will use this to grade your individual code (Do this whether you are in a group or not). Good luck! <br><br>

Note: To edit a Markdown cell, double-click on its text.

## Jupyter Notebook Quick Tips
Here are some quick formatting tips to get you started with Jupyter Notebooks. This is by no means exhaustive, and there are plenty of articles to highlight other things that can be done. We recommend using HTML syntax for Markdown but there is also Markdown syntax that is more streamlined and might be preferable. 
<a href = "https://towardsdatascience.com/markdown-cells-jupyter-notebook-d3bea8416671">Here's an article</a> that goes into more detail. (Double-click on cell to see syntax)

# Heading 1
## Heading 2
### Heading 3
#### Heading 4
<br>
<b>BoldText</b> or <i>ItalicText</i>
<br> <br>
Math Formulas: $x^2 + y^2 = 1$
<br> <br>
Line Breaks are done using br enclosed in < >.
<br><br>
Hyperlinks are done with: <a> https://www.google.com </a> or 
<a href="http://www.google.com">Google</a><br>

# Data Collection and Cleaning


Transfer/update the data collection and cleaning you created for Phase II below. You may include additional cleaning functions if you have extra datasets. If no changes are necessary, simply copy and paste your phase II parsing/cleaning functions.


In [127]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import re
from sklearn.impute import KNNImputer
import numpy as np
import plotly.express as px

## Downloaded Dataset Requirement



In [65]:
def data_parser():
    with open("CDC_Nutrition__Physical_Activity__and_Obesity_-_Legislation.csv", encoding='utf-8') as f:
        reader = csv.reader(f, delimiter=",")
        reader_list = list(reader)
    df = pd.DataFrame(reader_list)
    df2 = df.iloc[1:, [0, 2, 3, 4, 5, 7, 8, 9]]
    df2.columns = df.iloc[0, [0, 2, 3, 4, 5, 7, 8, 9]]
    df3 = df2[df2["Status"] == "Enacted"]
    final_df = df3.iloc[:,[0, 1, 2, 3, 4, 5, 6]].reset_index(drop=True)
    return final_df

############ Function Call ############
data_parser()

Unnamed: 0,Year,LocationAbbr,LocationDesc,HealthTopic,PolicyTopic,Setting,Title
0,2009,CO,Colorado,Nutrition,Appropriations,School/After School,Beverage Policy
1,2010,KY,Kentucky,Obesity,Food Restrictions,Community,Honor Farmers/Food Checkout Week
2,2010,KY,Kentucky,Physical Activity,Bicycling,Community,Bicycle and Bikeway Program
3,2010,TX,Texas,Obesity,Sugar Sweetened Beverages,Early Care and Education,Nutrition And Food Service
4,2009,CO,Colorado,Nutrition,Agriculture and Farming,Community,Long Appropriations Bill
...,...,...,...,...,...,...,...
11709,2011,MO,Missouri,Nutrition,Task Forces/Councils,School/After School,An Act Relating to Farming
11710,2008,NC,North Carolina,Physical Activity,Initiatives and Programs,School/After School,An act to provide for studies by the legislati...
11711,2010,MA,Massachusetts,Nutrition,Liability and Indemnity,Medical and/or Hospital,Fiscal Year 2011 Budget
11712,2011,AR,Arkansas,Nutrition,Access to Healthy Foods,Restaurant/Retail,Cottage Food and Farmers Market Permit Exemptions


## Web Collection Requirement \#1


In [67]:

def web_parser1():
    r = requests.get("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")
    soup = BeautifulSoup(r.text, 'html.parser')
    tr_list = soup.find_all('tr')
    overall_td_list = []
    obesity_rank_dict = {}
    for tr in tr_list:
        overall_td_list.append(tr.find_all('td'))
        for each_td_list in overall_td_list:
            if len(each_td_list) > 1:
                if bool(each_td_list[0].find("span", {"class": "flagicon"})):
                    state = each_td_list[0].text[1:]
                    values = []
                    for data in each_td_list[1:]:
                        text = data.text.strip()
                        if re.search(r"%", text) == None:
                            values.append(text)
                        else:
                            values.append(re.findall(r"([\d.]+)%", text)[0])
                    obesity_rank_dict[state] = values
    df = pd.DataFrame(dict(obesity_rank_dict)).T
    df.columns = ["Obesity Rank", "Obese adults(mid-2000s) [Percentage %]", "Obese adults (2020) [Percentage %]", "Overweight(incl. obese) adults (mid-2000s) [Percentage %]", "Obese children and adolescents (mid-2000s) [Percentage %]"]
    for column in df.columns:
        df[column] = pd.to_numeric(df[column], errors='coerce')
    
    imputer = KNNImputer(n_neighbors=3)
    df_to_impute = df.iloc[:, 1:]
    df_imputed = pd.DataFrame(imputer.fit_transform(df_to_impute), columns=df_to_impute.columns)
    df.iloc[:,1:] = df_imputed.round(2)
    
    df = df.sort_values(by=df.columns[2], ascending=False)
    rank = 0
    current_obesity = float('inf')
    for index, row in df.iterrows():
        if row[df.columns[2]] < current_obesity:
            rank += 1
            current_obesity = row[df.columns[2]]
        df.loc[index, df.columns[0]] = rank
    df[df.columns[0]] = df[df.columns[0]].astype(int)
    return df

############ Function Call ############
web_parser1()

Unnamed: 0,Obesity Rank,Obese adults(mid-2000s) [Percentage %],Obese adults (2020) [Percentage %],Overweight(incl. obese) adults (mid-2000s) [Percentage %],Obese children and adolescents (mid-2000s) [Percentage %]
American Samoa,1,31.13,75.0,95.0,35.0
West Virginia,2,30.6,38.1,66.8,20.9
Mississippi,3,34.4,37.3,67.4,17.8
Oklahoma,4,28.1,36.5,64.2,15.4
Iowa,5,26.3,36.4,63.4,12.5
Alabama,6,30.1,36.3,65.4,16.7
Louisiana,7,29.5,36.2,64.2,17.2
Arkansas,8,28.1,35.0,64.7,16.4
Kentucky,9,28.4,34.3,66.8,20.6
Alaska,10,27.3,34.2,64.5,11.1


## Web Collection Requirement #2

In [68]:
def web_parser2():
    r = requests.get("https://data.cdc.gov/api/views/hn4x-zwk7/rows.json?accessType=DOWNLOAD")
    data = r.json()["data"]
    df = pd.DataFrame(data)
    df.loc[df.iloc[:,20] == "~", df.columns[21]] = None
    df.loc[df.iloc[:,20] == "~", df.columns[20]] = None
    df.loc[df.iloc[:,29] == "Data not reported", df.columns[29]] = "$35,000 - $49,999"

    imputer = KNNImputer(n_neighbors=2)
    
    df1_to_impute = df.iloc[:, 18:20]
    df1_imputed = pd.DataFrame(imputer.fit_transform(df1_to_impute), columns=df1_to_impute.columns)
    df.iloc[:,18:20] = df1_imputed.round(2)
    
    df2_to_impute = df.iloc[:, 22:25]
    df2_imputed = pd.DataFrame(imputer.fit_transform(df2_to_impute), columns=df2_to_impute.columns)
    df.iloc[:,22:25] = df2_imputed.round(2)
    
    df.sort_values(by=df.columns[8]).reset_index(drop=True)
    final_columns = [8,9,10,11,13,14,15,18,19,20,21,22,23,24,26,27,28,29,30]
    final_df = df.iloc[:, final_columns]
    final_df.columns = range(final_df.shape[1])
    return final_df

############ Function Call ############
web_parser2()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,2020,2020,US,National,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in no leisure-tim...,30.6,30.6,,,29.4,31.8,31255.0,,,,,Hispanic
1,2014,2014,GU,Guam,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,29.3,29.3,,,25.7,33.3,842.0,,High school graduate,,,
2,2013,2013,US,National,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,28.8,28.8,,,28.1,29.5,62562.0,,,,"$50,000 - $74,999",
3,2013,2013,US,National,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,32.7,32.7,,,31.9,33.5,60069.0,,,,"$35,000 - $49,999",
4,2015,2015,US,National,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 300 min...,26.6,26.6,,,25.6,27.6,30904.0,,,,"Less than $15,000",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93244,2022,2022,WY,Wyoming,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,24.5,24.5,,,16.3,35.3,111.0,,,,"Less than $15,000",
93245,2022,2022,WY,Wyoming,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in no leisure-tim...,36.0,36.0,,,27.9,45.0,159.0,,Less than high school,,,
93246,2022,2022,WY,Wyoming,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,35.2,35.2,,,30.6,40.0,450.0,35 - 44,,,,
93247,2022,2022,WY,Wyoming,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,35.3,35.3,,,30.2,40.7,512.0,,,,"$35,000 - $49,999",


#Inconsistency Revisions
 **If you were requested to revise your inconsistency section from Phase II, enter your responses here. Otherwise, ignore this section.**

For each inconsistency (NaN, null, duplicate values, empty strings, etc.) you discover in your datasets, write at least 2 sentences stating the significance, how you identified it, and how you handled it.

1. For web_parser1, there were a lot of missing values listed as "-" when it came to the data for the U.S. territories. I turned these into NaNs, turned the percentage strings into numerical values, and then use the KNN imputer to use the 3 nearest neighbors (based on the distances of the data) to calculate a value (the mean of the neighbor values) to impute for the NaNs. After these values were imputed for the Nanas in the percentage columns, the obesity rankings were rearranged to include the previously missing territories/districts. These rankings were based on the values of the third column.

2. For web_parser2, there was a data value column and sometimes it said "The data was not available due to the sample size." Because of this, the data in columns 18,19, and 22 to 24, were all empty. I again used the KNN imputer to calculate values to fill in the empty values based on 3 neighbors.

3. For web_parser2, some of the data in the income column said "Data not reported", so I replaced it with the income range (in accordance to the format of the income column) "30,000-49,999", which is the age range that includes the median US personal income for all workers - around $41,000.

4. (if applicable)

5. (if applicable)


## Data Sources

Include sources (as links) to your datasets. If any of these are different from your sources used in Phase II, please <b>clearly</b> specify.

*   Downloaded Dataset Source: https://healthdata.gov/dataset/CDC-Nutrition-Physical-Activity-and-Obesity-Legisl/mmuf-mryu/about_data

*   Web Collection #1 Source: https://en.wikipedia.org/wiki/Obesity_in_the_United_States

*   Web Collection #2 Source: https://en.wikipedia.org/wiki/Obesity_in_the_United_States




# Data Analysis
For the Data Analysis section, you are required to utilize your data to complete the following:

*   Create at least 5 insights
*   Generate at least 3 data visualizations
*   Export aggregated data to at least 1 summary file 

Create a function for each of the following sections mentioned above. Do not forget to fill out the explanation section for each function. 

Make sure your data analysis is not too simple. Performing complex aggregation and using modules not taught in class shows effort, which will increase the chance of receiving full credit. 

# Graphical User Interface (GUI) Implementation
If you decide to create a GUI for Phase II, please create a separate Python file (.py) to build your GUI. You must submit both the completed PhaseII.ipynb and your Python GUI file.

## Insights

In [137]:
def insight1():
    df = data_parser()
        
    healthTopic = "Physical Activity"
    filtered_df = df[df["HealthTopic"] == healthTopic]
    pd1 = pd.Series(filtered_df.groupby("LocationDesc").size())

    df = web_parser2().iloc[:, [3,6,7]]
    df.columns = ["LocationDesc", "Question","Data Value"]
    
    
    question1 = "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week"
    filtered_df1 = df[df["Question"] == question1]
    grouped_mean1 = filtered_df1.groupby(["LocationDesc"])["Data Value"].mean().round(2)
    
    question2 = "Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"
    filtered_df2 = df[df["Question"] == question2]
    grouped_mean2 = filtered_df2.groupby(["LocationDesc"])["Data Value"].mean().round(2)
    
    question3 = "Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)"
    filtered_df3 = df[df["Question"] == question3]
    grouped_mean3 = filtered_df3.groupby(["LocationDesc"])["Data Value"].mean().round(2)
    
    insight_df = pd.concat([pd1, grouped_mean1, grouped_mean2, grouped_mean3], axis=1)
    insight_df.sort_values(by=insight_df.columns[0], inplace=True)
    insight_df.columns = ["Number of Physical Activity Legislations", "Muscle Strengthening", "150 moderate/75 vigorous aerobic", "300 moderate/150 vigorous aerobic"]

    return insight_df

############ Function Call ############
insight1()

Unnamed: 0_level_0,Number of Physical Activity Legislations,Muscle Strengthening,150 moderate/75 vigorous aerobic,300 moderate/150 vigorous aerobic
LocationDesc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
South Dakota,2.0,29.608571,46.86,30.54
Montana,7.0,32.772643,53.379786,37.842643
Nebraska,8.0,30.492643,47.1305,28.749071
Kansas,8.0,28.995286,46.683857,28.768143
Wyoming,9.0,32.145214,49.845929,34.133786
Indiana,10.0,29.368571,43.625214,27.4945
North Dakota,11.0,29.620429,43.427571,27.196857
South Carolina,15.0,30.279071,46.845214,30.340429
Wisconsin,15.0,31.325143,52.003214,33.8125
Alaska,18.0,34.365571,52.983429,36.089857


### Insight 1 Explanation

I wanted to see if the number of legislations enacted would be directly proportional to the percentage of exercise done in that location. I wanted to see if these legislations spurred more exercise. There seemed to be no correlation between the number of legislations passed related to physical activity and the amount of physical activity done in that location. Seen in visualization 3 below.

In [92]:
def insight2():
    web_data = web_parser1()
    pd1 = pd.Series(web_data[web_data.columns[0]])
    df = data_parser()
    pd2 = pd.Series(df.groupby("LocationDesc").size()).sort_values()    

    insight_df = pd.concat([pd1,pd2],axis=1)
    insight_df = insight_df.sort_values(by="Obesity Rank")
    
    second_col = insight_df.columns[1]
    placeholder = -1
    insight_df[second_col] = insight_df[second_col].fillna(placeholder)
    insight_df[second_col] = insight_df[second_col].astype("Int64")
    insight_df[second_col] = insight_df[second_col].replace(placeholder, np.nan)

    insight_df.rename(columns={second_col:'Number of Legislations'}, inplace=True)
    return insight_df

############ Function Call ############
insight2()

Unnamed: 0,Obesity Rank,Number of Legislations
American Samoa,1,
West Virginia,2,151.0
Mississippi,3,244.0
Oklahoma,4,242.0
Iowa,5,187.0
Alabama,6,165.0
Louisiana,7,372.0
Arkansas,8,339.0
Kentucky,9,172.0
Alaska,10,46.0


### Insight 2 Explanation

Based on the table, the logical conclusion to arrive to is that the number of legislations (with policy topics related to obesity such as nutrition,obesity,physical activity,etc.) enacted by a US state/territory/disctrict is not correlated with its respective obesity level. The five states/districts/territories with the lowest obesity levels were Montana, California, Hawaii, District of Columbia, and Colorado in descending order. California had one of the most legislations enacted at 633, but Montana right before had only 91, on the lower side. Additionally, Texas had 525 legislations enacted and it was the 15th most obese on the list. Illinois had the most legislations, 637, and only ranked 27th. Wyoming had the lowest number of legislations enacted at 35 and it had a relatively lower obesity ranking of 35 on the list. All in all, there was no visible overarching relationship or trend between the number of legislations enacted related to obesity in US locations and its respective obesity ranking.

In [98]:
def insight3():

    df = data_parser()
        
    setting1 = "Early Care and Education"
    setting2 = "School/After School"
    filtered_df = df[(df["Setting"] == setting1) | (df["Setting"] == setting2)]
    pd1 = pd.Series(filtered_df.groupby("LocationDesc").size())

    web_data = web_parser1()
    pd2 = pd.Series(web_data[web_data.columns[4]])
    
    insight_df = pd.concat([pd1,pd2],axis=1)
    
    insight_df.rename(columns={insight_df.columns[0]:'Number of Children-Affecting Legislations'}, inplace=True)
    insight_df.sort_values(by="Number of Children-Affecting Legislations", inplace=True)

    return insight_df
    

############ Function Call ############
insight3()

Unnamed: 0,Number of Children-Affecting Legislations,Obese children and adolescents (mid-2000s) [Percentage %]
Wyoming,4.0,8.7
New Hampshire,6.0,12.9
Alaska,9.0,11.1
Minnesota,11.0,10.1
Kansas,11.0,14.0
Wisconsin,14.0,13.5
Nebraska,17.0,11.9
Idaho,18.0,10.1
Indiana,19.0,15.6
North Dakota,21.0,12.1


### Insight 3 Explanation
This table was done in order to demonstrate any relation between the number of legislations enacted by state that affected children (took setting in school/after school or early education and care) and the percentage of obesity in the children and adolescent population. From the table it looks like the two have no correlation. The table was sorted in ascending order, so if there were to be a correlation, the farther down (the more legislations enacted), I thought the obesity percentages would decrease. However, this was not the case. The obesity percentages were scattered and generally did not have an inverse relationship with the number of child-affecting legislations.


In [99]:
def insight4():
    web_data = web_parser1()
    pd1 = pd.Series(web_data[web_data.columns[0]])
    
    df = web_parser2().iloc[:, [3,6,7]]
    df.columns = ["LocationDesc", "Question","Data Value"]
    question1 = "Percent of adults who report consuming vegetables less than one time daily"
    filtered_df1 = df[df["Question"] == question1]
    grouped_mean1 = filtered_df1.groupby(["LocationDesc"])["Data Value"].mean().round(2)
    
    question2 = "Percent of adults who report consuming fruit less than one time daily"
    filtered_df2 = df[df["Question"] == question2]
    grouped_mean2 = filtered_df2.groupby(["LocationDesc"])["Data Value"].mean().round(2)

    insight_df = pd.concat([pd1,grouped_mean1,grouped_mean2],axis=1)
    
    first_col = insight_df.columns[0]
    placeholder = -1
    insight_df[first_col] = insight_df[first_col].fillna(placeholder)
    insight_df[first_col] = insight_df[first_col].astype("Int64")
    insight_df[first_col] = insight_df[first_col].replace(placeholder, np.nan)

    insight_df.columns = ["Obesity Rank", "Vegetables", "Fruit"] 
    return insight_df

############ Function Call ############
insight4()

Unnamed: 0,Obesity Rank,Vegetables,Fruit
American Samoa,1.0,,
West Virginia,2.0,21.389524,43.740714
Mississippi,3.0,24.224762,44.237857
Oklahoma,4.0,22.146071,46.370714
Iowa,5.0,24.150476,38.618333
Alabama,6.0,22.622619,43.579762
Louisiana,7.0,26.282143,44.963095
Arkansas,8.0,22.673452,44.181429
Kentucky,9.0,21.1075,43.059524
Alaska,10.0,21.671071,40.713929


### Insight 4 Explanation

I wanted to see if fruit and vegetable consumption were related to obesity. My prediction was that the locations with higher percentages of people who ate less than a vegetable and people who ate less than a fuit daly would have higher levels of obesity. However, that was not the case, and neither was the opposite true. These percentages seemed to have no effect/correlation on obesity levels. The vegetable percentage fluctuated in the twenties irrespective of the obesity rank. The fruit percentage was similar but the fluctuation range was just between the mid thirties to mid forties. There was no pattern/trend with the fruit percentages either.

In [86]:
def insight5():
    web_data = web_parser1()
    pd1 = pd.Series(web_data[web_data.columns[0]])
    
    df = web_parser2().iloc[:, [3,6,7]]
    df.columns = ["LocationDesc", "Question","Data Value"]
    question = "Percent of adults who engage in no leisure-time physical activity"
    filtered_df = df[df["Question"] == question]
    grouped_mean = filtered_df.groupby(["LocationDesc"])["Data Value"].mean().round(2)
    
    
    insight_df = pd.concat([pd1,grouped_mean],axis=1)
    
    first_col = insight_df.columns[0]
    placeholder = -1
    insight_df[first_col] = insight_df[first_col].fillna(placeholder)
    insight_df[first_col] = insight_df[first_col].astype("Int64")
    insight_df[first_col] = insight_df[first_col].replace(placeholder, np.nan)
    
    return insight_df


############ Function Call ############
insight5()

Unnamed: 0,Obesity Rank,Data Value
American Samoa,1.0,
West Virginia,2.0,30.209792
Mississippi,3.0,32.863363
Oklahoma,4.0,30.51997
Iowa,5.0,26.924702
Alabama,6.0,30.460952
Louisiana,7.0,30.764554
Arkansas,8.0,31.618512
Kentucky,9.0,30.74994
Alaska,10.0,23.169048


### Insight 5 Explanation

I picked this question as the one to compare to because it like the most arbitrary of the exercise questions. This question asked if no exercise was done for leisuer, whereas the others for a variety of combinations of aerobic and muscular exercises. Running this function already took long because of KNN imputer being called with web_parser2(), so adding more questions/columns would have been even less efficient. Regardless, as seen here, there was a pattern present between the amount of exercise done in the state and its relative obesity rank. The question was how much of the population did not do physical activity for leisure. The data showed that those states with a higher obesity rank had a percentage around 30% or in the high twenties. On the other hand, the states on the lower end of the obesity rank table, had percentages that had an average around the mid twenties.

## Data Visualizations

In [133]:
def visual1():
    df = insight2()
    fig = px.scatter(df, 
                     x = df.columns[1],
                     y = df.columns[0],
                    title = 'Number of Legislations vs. Obesity Rank')
    fig.show()
    
############ Function Call ############
visual1()

### Visualization 1 Explanation

This visualization has a loose linear relationship. If a line of best fit were to be drawn on the graph it would be a positively-sloped line. However, this does not indicate that the two variables are related. The data points would be too distant and spread out from this line of best fit. For example, a state that passed over 500 legislations was in the top 20 of the obesity rankings. And on the other hand, there were states/districts/territories that passed just over 100 legislations, and were almost at the end of the obesity rankings. While there is a loose trend, a definite claim cannot be made that the number of legislations in the state/territory/district is directly proportional to its respective obesity rank.

In [132]:
def visual2():
    df = insight5()
    fig = px.scatter(df, 
                     x = df.columns[1],
                     y = df.columns[0],
                     labels = {df.columns[1]: "Adults who engage in no leisure-time physical activity (%)"},
                    title = "Adults who engage in no leisure-time physical activity (%) vs. Obesity Rank")
    fig.show()

############ Function Call ############
visual2()

### Visualization 2 Explanation

This visualization clearly delineates the relationship between obesity rank and the percentage of adults who engage in no leisure-time physical activity. They are inversely proportional. There is only one outlier. The lower the percentage is, the higher rank of obesity, and vice versa. The locations with the highest levels of obesity have fewer people that engage in physical activity for leisure. And locations that have more people that enage in physical activity for leisure have lower levels of obesity.

In [141]:
def visual3():
    df = insight1()
    df_melted = df.melt(id_vars=[df.columns[0]], value_vars=df.columns[1:4], var_name="Exercise Type", value_name="Percentage of Adults (%)")
    fig = px.scatter(df_melted, x=df.columns[0],
                     y="Percentage of Adults (%)",
                     color="Exercise Type",
                     title = "Number of Physical Activity Legislations vs. Percentage of Physical Activity") 
    fig.update_traces(mode='lines+markers')
    fig.show()

############ Function Call ############
visual3()





### Visualization 3 Explanation

As shown, there is no correlation between the number of physical activity legislations and the amount of physical activity done. Each type of physical activity had no trend. The legislations are probably related to the increases in physical activity in the respective location, rather than the current level of physical activity.

## Summary Files

In [121]:
def summary1():
    df = web_parser2()
    question = "Percent of adults aged 18 years and older who have obesity"
    filtered_df = df[df[df.columns[6]] == question]
    grouped_mean = filtered_df.groupby([df.columns[3]])[df.columns[7]].mean().round(2).reset_index()
    grouped_mean.columns = ["LocationDesc", "Adult Obesity (%)"]
    grouped_mean.to_csv('summary1.csv', index=False, header=True)
    return grouped_mean

############ Function Call ############
summary1()

Unnamed: 0,LocationDesc,Adult Obesity (%)
0,Alabama,35.309851
1,Alaska,30.56497
2,Arizona,29.810863
3,Arkansas,35.356399
4,California,26.807679
5,Colorado,23.544494
6,Connecticut,28.289048
7,Delaware,31.88753
8,District of Columbia,25.538155
9,Florida,27.987262


# Cited Sources

If you used any additional sources to complete your Data Analysis section, list them here:

KNNImputer - scikit-learn 15.1 documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

6.4 Imputation of missing values - scikit-learn 15.1 documentation: 
https://scikit-learn.org/stable/modules/impute.html

pandas.DataFrame.melt - pandas 2.2.2 documentation: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html

# Video Presentation

If you uploaded your Video Presentation to Bluejeans, YouTube, or any other streaming services, please provide the link here:


*   Video Presentation Link : https://youtu.be/5t7yn_4F5vE


Make sure the video sharing permissions are accessible for anyone with the provided link.

# Submission

Prior to submitting your notebook to Gradescope, be sure to <b>run all functions within this file</b>. We will not run your functions ourselves, so we must see your outputs within this file in order to receive full credit.
