# Tutorial 2: exercise

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).

<img src="caltech_rosen.png">

*This tutorial exercise was generated from an Jupyter notebook.  You can download the notebook [here](t2_exercise.ipynb). Use this downloaded Jupyter notebook to fill out your responses.*

### Exercise 1

The [Anderson-Fisher iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a classic data set used in statistical and machine learning applications. Edgar Anderson carefully measured the lengths and widths of the petals and sepals of 50 irises in each of three species, *I. setosa*, *I. versicolor*, and *I. virginica*. Ronald Fisher then used this data set to distinguish the three species from each other.

**a)** Load the data set, which you can download [here](../data/anderson-fisher-iris.csv) into a Pandas `DataFrame` called `df`. Be sure to check out the structure of the data set before loading. You will need to use the `header=[0,1]` kwarg of `pd.read_csv()` to load the data set in properly.

In [1]:
import itertools

# Our numerical workhorses
import numpy as np
import pandas as pd
import scipy.integrate

# Import Altair for high level plotting
import altair as alt

# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.plotting

# Set up Bokeh for inline viewing
bokeh.io.output_notebook()

In [2]:
# Use pd.read_csv() to read in the data and store in a DataFrame
df = pd.read_csv('/Users/maria/Desktop/Caltech/BiBe103/04-bebi103-2018/tutorial_exercises/data/anderson-fisher-iris.csv', comment='#')

**b)** Take a look `df`. Is it tidy? Why or why not?

In [3]:
df.head()

Unnamed: 0,setosa,setosa.1,setosa.2,setosa.3,versicolor,versicolor.1,versicolor.2,versicolor.3,virginica,virginica.1,virginica.2,virginica.3
0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
1,,,,,,,,,,,,
2,5.1,3.5,1.4,0.2,7.0,3.2,4.7,1.4,6.3,3.3,6.0,2.5
3,4.9,3.0,1.4,0.2,6.4,3.2,4.5,1.5,5.8,2.7,5.1,1.9
4,4.7,3.2,1.3,0.2,6.9,3.1,4.9,1.5,7.1,3.0,5.9,2.1


No, it is not tidy. There are many types of observational input in one table and each variable does not form a column.

**c)** Perform the following operations to make a new `DataFrame` from the original one you loaded in exercise 1 to generate a new `DataFrame`. Do these operations one-by-one and explain what you are doing to the `DataFrame` in each one. The Pandas documentation might help.

In [4]:
df_tidy = df.stack(level=0)

Stack stacks the inputted levels from columns to index. So it changes a set of data from a row to a column basically.

In [5]:
df_tidy = df_tidy.sort_index(level=1)

sort_index sorts objects b their label, starting from the inputted level.

In [6]:
df_tidy = df_tidy.reset_index(level=1)

rest_index resets the indices of a dataframe. This is useful after things are shifted around/removed to remove any blank entries/ NaNs.

In [7]:
df_tidy = df_tidy.rename(columns={'level_1': 'species'})

This renames the columns more descriptively.

**d)** Is the resulting `DataFrame` tidy? Why or why not?

In [8]:
df_tidy.head()

Unnamed: 0,species,0
0,setosa,sepal length (cm)
2,setosa,5.1
3,setosa,4.9
4,setosa,4.7
5,setosa,4.6


Yes, the resulting dataframe is tidy. It fits the three requirements:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a separate table.

**e)** Using `df_tidy`, slice out all of the sepal lengths for I. versicolor as a Numpy array. 

<br />

In [15]:
df2 = df_tidy[df_tidy[0] != 'sepal length (cm)']
df2 = df2[df2['species'] =='versicolor']
df2.head()

Unnamed: 0,species,0
2,versicolor,7.0
3,versicolor,6.4
4,versicolor,6.9
5,versicolor,5.5
6,versicolor,6.5


### Exercise 2

**a)** Make a scatter plot of sepal width versus petal length with the glyphs colored by species.

I'm confused. I think I did something wrong since my data only has two columns. 
Not sure what I could have messed up since I just ran the commands already in the notebook.

To make a scatterplot, I would run:

In [16]:
alt.Chart(df2).mark_point().encode(
        x="petal length (cm)",
        y="sepal width (cm)").interactive()

TypeError: '<' not supported between instances of 'int' and 'str'

Chart({
  data:        species    0
  2   versicolor  7.0
  3   versicolor  6.4
  4   versicolor  6.9
  5   versicolor  5.5
  6   versicolor  6.5
  7   versicolor  5.7
  8   versicolor  6.3
  9   versicolor  4.9
  10  versicolor  6.6
  11  versicolor  5.2
  12  versicolor  5.0
  13  versicolor  5.9
  14  versicolor  6.0
  15  versicolor  6.1
  16  versicolor  5.6
  17  versicolor  6.7
  18  versicolor  5.6
  19  versicolor  5.8
  20  versicolor  6.2
  21  versicolor  5.6
  22  versicolor  5.9
  23  versicolor  6.1
  24  versicolor  6.3
  25  versicolor  6.1
  26  versicolor  6.4
  27  versicolor  6.6
  28  versicolor  6.8
  29  versicolor  6.7
  30  versicolor  6.0
  31  versicolor  5.7
  32  versicolor  5.5
  33  versicolor  5.5
  34  versicolor  5.8
  35  versicolor  6.0
  36  versicolor  5.4
  37  versicolor  6.0
  38  versicolor  6.7
  39  versicolor  6.3
  40  versicolor  5.6
  41  versicolor  5.5
  42  versicolor  5.5
  43  versicolor  6.1
  44  versicolor  5.8
  45  versicolor  

**b)** Make a plot comparing the petal widths of the respective species. Comment on why you chose the plot you chose.

I think a colorcoded scatterplot would make it the most clear. It allows you to see the heights of all three species in one plot and is not overly complicated.