## Describe data: A dive into visualizing data distributions

### Learning goals

* read data files from disk
* practice importing libraries 
* practice exploring data 
  * use data frames 
  * plot data frames
* visualize data using various representations (plots)

### Introduction

In this tutorial, we are going to play around with loading some existing data on a file from disk and then visualizing the distribution of values in the data. Data can be complex and interrelated but, ultimately, a data set is composed of *variables* that can take on *values*.

An important first step in analyzing data is often to see how those values are *distributed* – that is, to answer questions like

* what is the range of values (and does that range make sense)?
* are some values more common than others?
* is there a "typical" value? Or is there more than one typical value?
* are all the data near the "typical" value, or are they all very different?
* can we describe the data succinctly using a known *distribution*?
  - do the data come from an approximately *Gaussian* (or *Normal*) distribution?
  - if not, do they come from some other known distribution (or are they just crazy)?
* Are two (or more) distributions the...
  - same for all practical intents and purposes?
  - different looking enough for further investigation?
 
The above are all great questions that it is very important to try to answer as soon as possible after being handed a new dataset. As soon as we start working with new data, it is best practice to spend some good amount of time *describing the data*, that is, characterizing the basic properties the distribution of data. This will ultimately save us time down the road in a project. 

There are various fancy ways to do this, but the first step, and often the only necessary step, is to just *look at the data!* So, let's do that now!

The distribution plots we'll play with today are:

* [Histograms](https://en.wikipedia.org/wiki/Histogram)
* [Kernel Density Estimate (KDEs)](https://en.wikipedia.org/wiki/Kernel_density_estimation) plots
* [Emperical Cumulative Density Estimate (ECDEs)](https://en.wikipedia.org/wiki/Empirical_distribution_function) plots
* [Categorical or Strip plots](https://en.wikipedia.org/wiki/Dot_plot_(statistics)
* [Violin Plots](https://en.wikipedia.org/wiki/Violin_plot)
* [Boxplots](https://en.wikipedia.org/wiki/Box_plot)

Don't worry, we'll unpack these in turn. But they all have something in common; they all attempt to communicate the same thing – *how are the data values distributed* or *what do the distributions of data values look like?* – but do so in ways the emphasize different features and show different levels of detail.

### Preliminaries

#### As always, we need to import some "data science-y" libraries

As experienced in Tutorial 0001, python require importing the *libraries* needed for the work. This is a bit tricky at the beginning as beginners do not know what libraries contains and do. So, all this at the beginning can appear a bit confusing, but later on as your experience will grow during the semester the role of each library in your work will become more clear, simpler.

This class will cover the four fundamental libraries for data science in Python:

* [numpy](https://numpy.org/) to generate and manipulate numbers
* [pandas](https://pandas.pydata.org/) for reading, plotting and storing advanced data orbjects. Heareafter, we will using to read data saved in a file on your computer's hard drive. The operarions to read and write files from disk are generally referred to as  `i/o` (i.e., `i`nput/`o`utput file operations) 
* [matplotlib](https://matplotlib.org/) used to make simple plots 
* [seaborn](https://seaborn.pydata.org/) to plot more complex datasets

#### Import `numpy`
The first libray we import is called `numpy`. We import it using the shortname `np`. Python programmers import libraries using nicknames. This helps making the code sorter when a library is used. For example, to use a command  `arange()` available in `numpy` we would need to use the following line of code `numpy.arange()`, using the nickname `np` the code shortens to `np.arange()`. Nicknames are standardized in python, each libray is geenrally called with a specific nickname. Ok, let's import `numpy`:

In [None]:
import numpy as np 

#### Import `pandas`

After `numpy`, we import anohter major library, commonly used in data science applications: `pandas`:

In [None]:
import pandas as pd

#### Import `seaborn`

`seaborn` is one of the most used libraries for data visualization:

In [None]:
import seaborn as sns

#### Import `matplotlib`'s `pyplot`

After `seaborn`, we import anohter major library, commonly used in data science applications: `pyplot`. Note that `pyplot` is part of the larger library called `matplotlib`. So, here we are importing a sub-module, a smaller library part of a larger library. The syntax goes as follows:

In [None]:
import matplotlib.pyplot as plt

$\color{blue}{\text{Answer the following question:}}$

(1) What is the nickname used for `pyplot`? [Enter answer here]

#### Read a dataset saved in a comma separated values dataset (`.csv`)

The code, hereafter, uses `pandas` (`pd`) to directly read a `.csv` (comma separated value file). `Pandas` comes with a `csv` file reader. The reader can be called directly, as all the functions of `python`'s libraries are called, using the `<library_name><DOT><function_name>` notation:

In [None]:
myDataFromFile = pd.read_csv("datasets/007DataFile.csv")

$\color{blue}{\text{Answer the following questions:}}$

In the line of code above, what is the:

 - name of the library used to load the file?  [Enter answer here]
 - name of the `pandas` function we use to read the data file?  [Enter answer here]
 - data file name?  [Enter answer here]
 - name of the variable used to store the file?  [Enter answer here]
 - name of the folder containing the data file?  [Enter answer here]

#### Now let's make sure we read something that looks okay

We can use the built-in function `display` to take a look at the data. What is inside this file?

In [None]:
display(myDataFromFile)

Note that this data frame has two columns, but one of them is not numeric. It is a *grouping variable* that is stored as a text *string* rather than as a number. This type of data is refered to as ['*tidy*'](https://en.wikipedia.org/wiki/Tidy_data).

The same data could have stored in two columns, an "A" column and a "B" column. This kind of data set is 'untidy', and we have already encountered such data (the freezer data). Untidy data is not necessarily evil or anything; for the freezer data, the same variable (temperature) was measured over time, so it actually made sense to have rows represent time points, and the freezers in the columns. 

But a big advantage of tidy data is that it lets us more easily segregate data by the grouping variable(s). In fact, a big part of data science is actually "*data wrangling*", much of which involves making messy data into a tidy form for analysis.

$\color{blue}{\text{Answer the following question:}}$

 - What are the dimensions (the size) of the data?

### Data visualization

And now for the fun way of looking at data...
as visuals!

The seaborn library (which we imported as `sns`) is an advanced plotting library (we will learn more about it later on!). `Seaborn` is advanced because it takes care of many of the plotting details for us, and most of its commands make pretty good looking plots off the shelf (without much work).

So let's take seaborn out for a ride!

#### Histograms

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="hist", alpha=0.3)

Note what seaborn did here in a simple call. 

Here, we made a single call to `seaborn.displot()`, which is short for *distribution plot*. In the three arguments to the call, we 

 1. told `seaborn.displot()` what data frame to use,
 2. mapped the *Value* variable to the `x axis` of the plot and,  
 3. mapped the *Group* variable the color of the bars. 
 
The function we used was `seaborn.displot()` did a lot of stuff automatically for us: it defaulted to a *histogram* to show the data, picked the width of the binds for the histograms and the specific colors to use (orange & blue – go Gators!), labeled the `x` and `y` axes appropriately, and even made a legend for us! We can customize all of these things of course, but it's nice to have a command like `displot()` that makes a plot with decent defaults.

#### How can we learn more about how displot works?

In python you can add a question mark, `?`, at the end of a command to return the functionality of the command. Below the call to show the functionality of `seaborn` `displot` (be ready, this is pretty exhaustive of a command! Also, in some versions of Jupyter the help returned by the `?` will display in a new frame that will need to be closed by clicking the [x]):

In [None]:
sns.displot?

$\color{blue}{\text{Answer the following question:}}$

 - Write in the cell below the command to show the functionality of `pandas` `read_csv`

#### Kernel Density Estimate (KDE) plots

Histograms, by definition, are *discrete*: they divide the range of data values into discrete values defined by *bins*, and then count the number of observations in each bin, and then maps these counts to the `y` axis. 

Data, however, are often *continuous*, with no actual sharp transitions across (arbitrary) bin boundaries. So it would be nice to represent the data in a way reflects the underlying smoothness. One such plot is called a *Kernel Density Estimate* or *KDE* plot. We won't go too deeply under the hood of a KDE plot (calculus required!) but it essentially takes a histogram and blurs it to yield a continuous function (just like if you blur your eyes a sharp point becomes a continuous blob).

All we have to do to make a KDE plot (without calculus!) is to tell `seaborn.displot()` that that's what we want:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="kde")

As we can see, this plot conveys the data distributions in a perhaps cleaner and more visually appealing way.

We can also play around with the appearance of the plot by adding optional arguments to `seaborn.displot()`. For example, we can fill in the areas under the curves, and make the fill transparent so we can see one curve through the other:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="kde", fill=True, alpha=0.2)

The argument `fill` is self-explainatory, and `alpha`, for whatever reason, is the universal variable for "transparency" in computer graphics. It always ranges from 0 to 1, with 1 being opaque and 0 being invisible.

This is the part where you Google and play around and see what other ways you can change the appearance of our plot!

#### So which is better, the histogram or the KDE plot?

Well, one answer is "Why choose when you can have both?"

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="hist", kde=True)

But the better answer is that it depends. Both type of plot can be misleading. 

For example, a histogram can have too many bins:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="hist", bins=100)

Or too few:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="hist", bins=3)

---
#### Pro tip! 
If you have a sense of what represents a meaningful change in your data values, it can be more intuitive to adjust `binwidth` instead of `bins`. Try it!

---

Similarly, a KDE plot can be too smooth:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="kde", bw_adjust=3)

Note. How in the above cell we set the parameter `bw_adjust` to 3. Roughly speaking, this paramter sets the number of "values" the KDE will average over when deciding where to plot the lines for the the distribution. I feel like this plot is lying about the B distribution, making it look like the population is a perfect normal distribution when it may not be.)

They can also be too bumpy, defeating the very point of the KDE plot:

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="kde", bw_adjust=0.5)

Assuming the data are roughly normal (Gaussian), and it's reasonable to assume the population is normally distributed, and all else being equal, I tend to prefer KDE plots over histograms. I think they're prettier. That being said, I would never try to sell someone a KDE plot without 1) looking histograms myself to inspect the data and 2) having a histogram loaded into the chamber in case anybody asks. 

---
##### Note on KDE plots:
The *kernel density estimate* uses a mathematical function called the kernel - often a normal distribution - to "blur" the data via an operation akin to *convolution*. The 'bw_adjust' argument adjusts the *bandwidth* of the kernel. The larger the bandwidth, the smoother the plot will appear; the smaller, the more wiggly.

---


#### Empirical Cumulative Density Estimate plots

Another usefull way to look at distributions is with the *Empirical Cumulative Density Estimate* plot or *ECDF*. It plots, for each value on the x axis, the proportion of the data that fall to left of that value. It hence goes from 0.0 on the left (just below the very smallest data value) to 1.0 on the right (just above the very highest data value). 

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="ecdf")

Visually, you can see that - calculus alert! - the *ECDF* is essentially the integral of the *KDE*. In fact, let's actually plot the intergral of our first KDE plot from above.

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="kde", cumulative=True)

Sweet!

In all the "goldilocks" plots, we can clearly see the relative shift in the two distributions (median or mean) in addition to the different widths (standard deviations).

 The difference in medians or means is shown by the relative shift of the two distributions in all the plots. In the *ECDF*, the value on the x axis corresponding to midpoint on the y axis – at 0.5 – is by definition the median.
 
 The difference in standard deviations in the *KDE* and histogram is shown by the relative widths or "fatnesses" of the two distributions, whereas in the *ECDF*, it is given by the steepnesses (slopes) of the curve.
 
 Which type of plot is better is both situational and a matter of taste; I like the *histogram* and *KDE* for appreciating the "vibe" of distributions, but the *ECDF* can be better at revealing small but systematic shifts in the mean.

Seaborn also provides some built-in themes to change the overall appearance of plots:

In [None]:
sns.set_style("darkgrid")
sns.displot(myDataFromFile, x="Value", hue="Group", kind="ecdf")

That's nice! But, personally, I'd like more ticks/gridlines on the y axis and, since 0.5 corresponds to the median, it would be nice to have a gridline exactly there. To do this, we'll delve into the "lower level" `matplotlib` functions. In this particular case, we'll use the `matplotlib.pyplot.yticks()` function to make gridlines where we want them, and we'll use `np.arange()` to make the exact values for the gridlines. And remember, we imported `matplotlib.pyplot` as `plt`, so that will save us a little typing!

In [None]:
sns.displot(myDataFromFile, x="Value", hue="Group", kind="ecdf")
myTickMarks = np.arange(0, 1, 0.1)
plt.yticks(myTickMarks);

Here, we put the tick values in a new variable we created, and then passed that variable to `plt.yticks`. We could have done this all in one go if we wanted:
`plt.yticks(np.arange(0, 1, 0.1))`

I always want to read `arange()` as "arrange", but it's really "a range", as in "a range of values" – the three arguments to `np.arange()` are the minimum, maximum, and step size of the range. Now we can literally see that the two medians are about 0 and 1. Since the distributions are "normalish" or roughly Gaussian, these correspond to the means as well. 

(obviously, there is a corresponding call to set the x ticks...)

#### Categorical or strip plots

One simple thing we can do is to directluy look at all the data points, split by category:

In [None]:
sns.catplot(data=myDataFromFile, x="Group", y="Value")

The command `seaborn.catplot()` makes a *categorical plot*, i.e. a plot with a categorical x axis and a numerical y axis. Notice that `seaborn.catplot()` jitters the data points horizontally so you can see more of them without occlusion. This type of plot is also called a *strip plot*.

Also notice that our theme has been applied to every plot since early on in this tutorial (actually on cell `[14]`), we called `sns.set_style("darkgrid")`. This sets up a specific type of style for the plot and the style will be kept for all plots following the `set_style` call.

---

##### Quick quiz! 
Can you make the data points transparent so that data points on top of one another appear as darker clusters?

---

#### Violin plots

It's worth reiterating and emphasizing that, in the categorical plot, we changed the mapping of the variables. In the previous plots, both disributions plotted on a single coordinate system, and the categorical variable was mapped to a color. This is great for one and sometimes two distributions. But with multiple distributions, i.e. multiple values of a categorical grouping variable, plots can get busy and hard to read. Mapping the grouping variable to position on the x axis is a great solution, as it pulls the distributions apart so they can be visually compared more easily.

Another kind of plot that maps a grouping variable to the x axis is a *violin plot*. Here's one:

In [None]:
sns.violinplot(data=myDataFromFile, x="Group", y="Value")

The violin plot is essentially a KDE plot in which the distributions are flipped on their sides, separated along the x axis, and plotted along with a mirror image. Sometimes the actual data are plotted as well. As with all plots, there are various things we can tinker with (Google is your friend!). For example, we can plot the data values as "sticks" instead of points:

In [None]:
sns.violinplot(data=myDataFromFile, x="Group", y="Value", inner="stick")

#### Boxplots

Notice that both histograms and categorical (strip) plots attempt to show the data directly, whereas KDE and violin plots abstract the data a little by trying to estimate the smooth distribution underlying the data. We can do a further extraction by plotting some summary numbers instead of the data themselves using a *box plot*. 

In [None]:
sns.boxplot(data=myDataFromFile, x="Group", y="Value")

A boxplot shows 5 summary numbers. The *median* is shown by a horizontal line. The upper and lower bounds of the *interquartile range* or *IQR* are shown by a box. Finally, values at 1.5x the IQR above and below the median are plotted as *whiskers* (boxplots are sometimes called box-and-whiskers plots). Any data points falling outside the whiskers plotted individually as potential outliers.

Sometimes it is helpful to combine plots to show both the data and some summary numbers:

In [None]:
sns.boxplot(data=myDataFromFile, x="Group", y="Value")
sns.stripplot(data=myDataFromFile, x="Group", y="Value", alpha = 0.42)

We can "check our work" by summarizing the data and comparing the percentiles with what's shown by are boxes.

In [None]:
myDataFromFile.groupby("Group").describe()

Looks good!

But – wait! – let's unpack the call above a little bit. As we've already seen, data frames in pandas "know" how to do things. We saw last time that they know how to make a boxplot of themselves, for example.

In the call `myDataFromFile.groupby("Group").describe()`, the `myDataFromFile.groupby("Group")` part tells the data frame to group itself by the "Group" variable. And then the `.describe()` tells it to describe itself for us. 
If you've used R and the tidyverse, then this is roughly equivalent to 

```
myDataFromFile %>%
    group_by("Group") %>%
    summarize()
```


If we have our current data frame make a boxplot of itself...

In [None]:
myDataFromFile.boxplot()

... it's not super useful because it looks at the data, sees only one numeric variable, and makes a boxplot of that variable (from both groups). But since it turns out that data frames know how to group themselves, maybe we can group and then boxplot, just like we grouped and then described above. Let's try:

In [None]:
myDataFromFile.groupby("Group").boxplot()

Nice! I prefer the seaborn version, but this is a nice tool to have in out toolbelt.

Okay, so now we know how to make a number of *distribution* plots in python, and know a little bit about how to work with data that have a grouping variable. Sam would be proud!

$\color{blue}{\text{Final Report for this Tutorial.}}$

 - create a new repository under your user account on [github.com/yourUserName](github.com/yourUserName).
 - clone the repository locally on your computer (pick a smart folder to do so, say, `~/git` or `~/code`)
 - move the file of jupyter notebookd for this tutorial in the folder of the repository cloned above. This will require using a combination of commands such as `mv` and `cp` and `cd` and `ls` (you can also do this by use your mouse, but make sure you are copying inside the proper ).
 - use `git add [file name]` Add the file you edited for this tutorial to the repository
 - use `git commit -am "Your message here"` to commit the file to the repository
 - push the local repository with the changes and the newly added file to the cloud using `git push [repository name]`
 - Submit the URL to the github repository on Canvas
 
 Note. Now on, the above is going to be required for every tutorial. We will always ask for you to report the URL of the tutorial with all the operations performed by you and the answers to the questions visible.