## Python Version of Chapter 3: 800 Children and Teens, part 1

We will use pandas, an open source data analysis and manipulation tool, to recreate the analysis from [Chapter 3](https://codap.xyz/awash/children-and-teens-1.html) of _Awash in Data_.
If you would like to learn more about pandas:
* https://pandas.pydata.org/
* https://www.w3schools.com/python/pandas/

### DataFrames

In pandas, a dataframe is a 2 dimensonal data structure. You can think of it as a table with rows and columns. 

Before we start analyzing data, we must read the data into a dataframe. You can read a CSV (comma separated values) file from your disk or from a website.

In [None]:
import pandas as pd  # tell python that we will be using pandas in our code

# the following statement would read the csv file from your computer's storage (hard disk, etc.)
# df = pd.read_csv(r"800_children_and_teens.csv")  

# this statement reads the csv file from a website
df = pd.read_csv(r"https://raw.githubusercontent.com/jcrumpton/binder_test/main/800_children_and_teens.csv")

# display the first several rows of the dataframe, also shows the column names
df.head()


### 3.2 A Specific Question: Who is Taller?

Who is taller, males or females?

We will use plotly.express (a python library for creating graphs) to plot Height vs. Gender.

You can learn more about plotly.express at:
* https://plotly.com/python/plotly-express/
* https://www.geeksforgeeks.org/python-plotly-tutorial/

In [None]:
import plotly.express as px

fig = px.histogram(df, y='Height', color='Gender')
fig.show()

We are creating a histogram to see how many males and females there are of each height. The graph is not separated into male and female halves like the graph from CODAP, but we are using colors to differentiate between the genders. We can use [Facet plots](https://plotly.com/python/facet-plots/), figures made up of multiple subplots, to show the gender histograms side by side.

In [None]:
fig = px.histogram(df, y='Height', color='Gender', facet_col='Gender')
fig.show()

To get the mean height of each gender, we must first __filter__ the data by Gender. We will learn to do this in a bit...

### 3.3 Making the Question More Specific

__Filter__ the data set to only consider 10 year olds.

In [None]:
age_10 = df[df['Age'] == 10]  # create a new dataframe that only contains rows of df where Age==10
age_10.head()

In [None]:
fig = px.histogram(age_10, y='Height', color='Gender', facet_col='Gender')
fig.show()

We can change the number of "bins" to get more detail.

In [None]:
fig = px.histogram(age_10, y='Height', color='Gender', facet_col='Gender', nbins=35)
fig.show()

We can __filter__ the age_10 data set into males and females if we want to look at the average heights.

In [None]:
age_10_males = age_10[age_10['Gender']=='Male']
age_10_males.describe()  # mean height is 143.17

In [None]:
age_10_females = age_10[age_10['Gender']=='Female']
age_10_females.describe()  # mean height is 146.83

Now we can create data subsets for 11, 12, 15, and 17 year olds. But we will learn to _group_ data in Chapter 5 (800 Children and Teens, part two) so we do not have to do this manually.