In [1]:
library(tidyverse)
library(ggplot2)
options(repr.plot.width = 10 , repr.plot.height = 6, jupyter.plot_mimetypes = "image/png") 
theme_set(theme_classic())

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# STATS 504
## Week 1: Exploratory data analysis

To follow along in today's lecture you'll need to load `tidyverse` and also install the `nycflights13` package:

## What is exploratory data analysis

<img src="https://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg" style="margin: 0 0 0 20px; float: right" />

> Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.



## EDA (concrete version)

1. Generate questions about your data.
2. Search for answers by visualizing, transforming, and modelling your data.
3. Use what you learn to refine your questions and/or generate new questions.
4. (Return to #1).

Two types of questions are always useful for making discoveries within your data:

- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?

## Variation
- Real-life variables change from measurement to measurement. 
- This is often true even if you measure the same thing twice!
- Each measurements has a small amount of error ("noise").
- The noise is different each time you take a measurement.

## Covariation
- Real-life variables tend to change together in a related way.
- The best way to spot covariation is to visualize the relationship between two or more variables.

## Continuous vs. discrete
The appropriate visualization will depend on whether the data are:
- *Continuous*: they take on an infinite number of ordered values.
- *Discrete*: the take on one of a small number of values.

## 🤔 Manfacturers

Continuous or discrete?: `mpg$manufacturer`

<ol style="list-style-type: upper-alpha;">
    <li>Continuous</li>
    <li>Discrete</li>
    <li>Could be either</li>
</ol>

## 🤔 Prices

Continuous or discrete?: `diamonds$price`

<ol style="list-style-type: upper-alpha;">
    <li>Continuous</li>
    <li>Discrete</li>
    <li>Could be either</li>
</ol>

## 🤔 Ages

Continuous or discrete?: `mil$age`

<ol style="list-style-type: upper-alpha;">
    <li>Continuous</li>
    <li>Discrete</li>
    <li>Could be either</li>
</ol>

## One continuous variable

First we will focus on understanding the distribution of one continuous variable.

`morley` is a built-in dataset measuring the speed of light:

## 🤔 Types of variables
How would you describe the types `Expt`/`Run`/`Speed` in these data?

<ol style="list-style-type: upper-alpha;">
    <li>Continuous / Continuous / Continuous </li>
    <li>Discrete / Discrete / Discrete</li>
    <li>Continuous / Discrete / Continuous</li>
    <li>Discrete / Discrete / Continuous</li> 
</ol>

Here we're measuring the speed of light, an absolute, unchanging, physical constant:

$$c = 299,792,458 \, m/s.$$

But we get a different value with every experiment. Why?

## Visualizing variation in our data

In order to understand how accurately we measured the speed of light, we first need to assess its *variation*. Since the measurement is continuous, we have several options:

### Adjusting a histogram
Any dataset can be plotted using multiple different histograms. For example:

There is no one right answer for "how many different bins" -- different values tell different stories about your variable
- Larger values of bins are more detailed but have higher *variance*
- Smaller values are smoother but have higher *bias*

## Follow-up questions
Now that we can see variation in our data, what sort of follow-up questions should we ask?

- Which values are the most common? Why?

- Which values are rare? Why? Does that match your expectations?

- Can you see any unusual patterns? What might explain them?

## The diamonds dataset
Let's look at a different dataset built into R:

This is a dataset of price, quality, and other characteristics for 54k diamonds.

## 🤔 Question

What can be said about the distribution of `carat` in the `diamonds` dataset? (Check all that apply.)

<ol style="list-style-type: upper-alpha;">
    <li>Almost all diamonds are &lt; 3 carats.</li>
    <li>Missing values are encoded as <code>carat = -1</code>.</li>
    <li>Diamond makers appear to prefer diamonds that are rounded to the nearest .1 or .5 or carat.
    <li>There are more diamonds between 0 and 1 carats than &gt;1 carats.</li>
    <li>There are more diamonds that measure 2.0 carats than there are that measure between 1.8 and 2.0 carats.</li>
</ol>

(Hint: plot a histogram, and try out different values for `bins`, `breaks`, or, `binwidth`.)

## Unusual values (outliers)
Outliers are "unusual" observations. 
- Sometimes they are due to data entry errors.
- Sometimes they are important for other reasons. 

## Visualizing the distribution of a discrete variable
For a discrete variable, generally the only thing we're interested in is the count of each different value that the variable can assume. For this, something like a bar plot is often used:

## Discrete variables with many values
Sometimes a discrete variable can take on a lot of values, such that it's not practical to plot its entire distribution. For example:

In this case we can reduce the data in some way, for example, only plotting the most common airports:

## Covariation
**Covariation** is when multiple variables vary together in a similar way. Covariation is everwhere, e.g.:
- Height and weight
- Political preference and religion
- Income this year vs. income last year
- Etc.

One of the best way to spot covariation is to visualize the "joint distribution" of both variables.

When studying covariation among two variables, there are three possibilities, depending on whether the variables are continuous, discrete, or a mixture.

## Continuous and discrete 
With one continuous and one discrete variable, there are several choices:
- Box-and-whisker plot
- Multiple/colored histograms

Let's return to the `morley` dataset and consider covariation between `Expt` (experiment) and `Speed`:

Let's study covariation of `cut` and `price` in the `diamonds` data set.

## Two discrete variables
To study covariation between two discrete variables, we can count the number of observations for each combination of values:

Another type of plot you will see often (especially in bio) is a heat map:

## Two continuous variables
Finally, if we're studying the covariation between two continuous variables, we have several options:
- Scatter plot (`geom_point`)
- Binning (`geom_bin2d`/`geom_hex`)
- Contour/bivariate density (`geom_density_2d`)

## Three or more variables
Sometimes we even want to study the covariance among of three or more variables. Visualizing >2 dimensional data is, in general, challenging. The best solution tends to depend on the problem at hand.

## Speed of light
Let's consider covariation between all three variables in `morley`:

## Millenials
Let's use these techniques to explore a data set released by the [Pew Research Center](https://www.pewsocialtrends.org/2010/02/24/millennials-confident-connected-open-to-change/) on ... millennials!

![millennials](https://images2.minutemediacdn.com/image/upload/c_crop,h_1189,w_2119,x_0,y_225/f_auto,q_auto,w_1100/v1561494201/shape/mentalfloss/586493-istock-862201574.jpg)

Each column of the data corresponds to one of the question asked during the survey. You can find the full script [here](https://docs.google.com/file/d/14U2-rS_ljS7kH97PMFqmNMKDwefSL5AS/edit?usp=docslist_api&filetype=msword).

## 🤔 Ages

What would be a good way to visualize `mil$age`?

<ol style="list-style-type: upper-alpha;">
    <li>Histogram</li>
    <li>Scatter plot</li>
    <li>Line plot</li>
    <li>Density plot</li>
    <li>Something else</li>
</ol>

## 🤔 Geography

What would be a good way to visualize `mil$state`?

<ol style="list-style-type: upper-alpha;">
    <li>Histogram</li>
    <li>Bar plot</li>
    <li>Scatter plot</li>
    <li>Line plot</li>
    <li>Something else</li>
</ol>

## Life goals
Columns `q8a-q8h` ask the respondents to rate the importance of:

    a.  Being successful in a high-paying career or profession
    b.  Having a successful marriage
    c.  Living a very religious life
    d.  Being a good parent
    e.  Having lots of free time to relax or do things you want to do
    f.  Becoming famous
    g.  Helping other people who are in need
    h.  Owning your own home
    
The response scale is:

    1 One of the most important things
    2 Very important but not the most
    3 Somewhat important
    4 Not important
    9 Don’t know/Refused (VOL.)
    
What is a good way to summarize these responses?

## Social networking

    Q.20	How often do you visit the social networking site you use most often… several times a day, about once a day, every few days, once a week or less often?
    
What sort of variable is this (`mil$q20`)? How should we visualize it?

## Beyond plotting
Some other EDA tools that are useful:
- Dimensionality reduction (e.g. PCA)
- Missing value analysis

## Flights data
- In the next slide, we will analyze a dataset of information about flights.
- Here we will analyze a pre-formatted version.

## Dimensionality reduction
- For each flight we have a lot of data. 
- How to make sense of it all?
- Idea: embed each data point in a lower-dimensional space.

## Missing data analysis
- Missing data may indicate something important. 

What (if anything) do these missing values mean?