# Computational notebooks demo

## Literate programming

One nice affordance of computational notebooks is that we can interleave blocks of text (like this one) written in [Markdown](https://www.markdownguide.org/getting-started/) with chunks of `code` written in Python in our case. This style of coding is called **"literate programming"**. It facilitates process documentation, reporducibile analyses, and reflection on analysis choices. 

Some advised practices for literate programming are:

1. Split up your code into distinct high level operations (e.g., load data).
2. Say why you are doing things. Why did you transform the data?
3. Interpret your charts in text. What patterns do you see? What do they mean for your analysis?
4. Comment your code if syntax isn't obvious.

The idea is that somebody unfamiliar with your work should be able to read this documentation of your work and repeat your analysis for themself, including your thought process. This is critical to doing data science.

## Demonstration

There are some steps we need to take before we can visualizing this data. We want to load the data and get it into a long format.

In [1]:
import pandas as pd

df = pd.read_csv("../data/iris.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [11]:
df['id'] = df.index
df = pd.wide_to_long(df, 
                stubnames=["petal", "sepal"], 
                sep="_", 
                i="id", 
                j='dimension', 
                suffix=r'\w+').reset_index()

df.head()

Unnamed: 0,id,dimension,species,petal,sepal
0,0,length,setosa,1.4,5.1
1,1,length,setosa,1.4,4.9
2,2,length,setosa,1.3,4.7
3,3,length,setosa,1.5,4.6
4,4,length,setosa,1.4,5.0


In [13]:
import altair as alt

alt.Chart(df).mark_circle().encode(
    x="sepal:Q",
    y="petal:Q"
)

Note clustering of points. Let's see if we can tease this apart in terms of our categorical variables.

It probably doesn't make sense to look at length and width together. Let's separate them. I can do this in one line of code because I've converted to a long format.

In [15]:
alt.Chart(df).mark_circle().encode(
    x="sepal:Q",
    y="petal:Q",
    column="dimension:N"
)

Now let's separate our species by color to see if they form clusters.

In [16]:
alt.Chart(df).mark_circle().encode(
    x="sepal:Q",
    y="petal:Q",
    color="species:N",
    column="dimension:N"
)

Most data won't be this clean, but it's helpful for the purpose of our demo. 

Note how I authored my visualizations in steps, choosing an initial visual representation and gradually adding information to it. Sometimes I'll get to a point where the charts I'm constructing stop working for my analysis, and I might need to consider alternative ways of looking at the data. It's almost always a good idea to view data in more than one way.

In [22]:
df = pd.read_csv("../data/iris.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [24]:
sepal = alt.Chart(df).mark_circle().encode(
    x="sepal_width:Q",
    y="sepal_length:Q",
    color="species:N"
)
petal = alt.Chart(df).mark_circle().encode(
    x="petal_width:Q",
    y="petal_length:Q",
    color="species:N"
)

sepal | petal

In [25]:
lengths = alt.Chart(df).mark_circle().encode(
    x="sepal_width:Q",
    y="petal_width:Q",
    color="species:N"
)
widths = alt.Chart(df).mark_circle().encode(
    x="sepal_length:Q",
    y="petal_length:Q",
    color="species:N"
)

lengths | widths