## Python Version of Chapter 5: 800 Children and Teens, part 2

We will use pandas and plotly.express to recreate the analysis from [Chapter 5](https://codap.xyz/awash/children-and-teens-2.html) of _Awash in Data_.


In [None]:
import pandas as pd  # tell python that we will be using pandas in our code

# the following statement would read the csv file from your computer's storage (hard disk, etc.)
# df = pd.read_csv(r"800_children_and_teens.csv")  

# this statement reads the csv file from a website
df = pd.read_csv(r"https://raw.githubusercontent.com/jcrumpton/binder_test/main/800_children_and_teens.csv")

# display the first several rows of the dataframe, also shows the column names
df.head()


## Grouping

In a moment, we will __group__ our data by age. First, we can approximate grouping by using Age as the x axis in a scatterplot.

In [None]:
# Needed to display graphs in Visual Studio Code
# import plotly.io as pio
# pio.renderers.default = "notebook"

import plotly.express as px

fig = px.scatter(df, x='Age', y='Height')
fig.show()

We don't know which dots correspond to Males and Females. Use the color attribute when creating the scatterplot to show the genders. This is similar to dragging an attribute onto a graph in CODAP.

In [None]:
fig = px.scatter(df, x='Age', y='Height', color='Gender')
fig.show()

This is still a little muddled. Let's actually group by age and gender so we can calculate means for each subgroup. First, get rid of the columns that we don't need.

In [None]:
subset = df[['Age', 'Gender', 'Height']]
subset.head()

We use the "groupby" method to divide the data frame (data set) into groups based on attribute values. Here we group by Age and then see how many people there are of each age.

In [None]:
groups = subset.groupby('Age')
print(groups.size())

We can group on Age and Gender to separate the data in 5 year old boys, 5 year old girls, etc.

In [None]:
groups = subset.groupby(['Age', 'Gender'])
print(groups.size())

Finally, we can calculate the mean Height for each subgroup.

In [None]:
heights = subset.groupby(['Age','Gender']).mean(['Height']).reset_index()
heights.head()


Now we have the average height for 5 year old boys, 5 year old girls, 6 year old boys, etc.

In [None]:
fig = px.scatter(heights, x='Age', y='Height', color='Gender')
fig.show()

We can add [trendlines](https://plotly.com/python-api-reference/generated/plotly.express.trendline_functions.html) if that would help with interpreting the data.
* ols: Ordinary Least Squares
* lowess: LOcally WEighted Scatterplot Smoothing
* expanding
* ewm: Exponentially Weighted Moment
* rolling

Hover over the trendline to see its formula or predicted value.

In [None]:
fig = px.scatter(heights, x='Age', y='Height', color='Gender', trendline='ols')
fig.show()

In [None]:
fig = px.scatter(heights, x='Age', y='Height', color='Gender', trendline='lowess')
fig.show()