# Lecture 1.2: Working with Data

## Outline

* How to collect data?
* How to design an experiment?
* How to summarize data?
* Correlation vs. causation

## What is Data?

> **Data** is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information. Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs or images. ([Wikipedia](https://en.wikipedia.org/wiki/Data))

## Statistical Data Discovery in General

* Start with a question/hypothesis
* Collect data
* Analyze
* Check the results
* Repeat? Redesign?
* Communicate the results / visualization

<center><img src="images/sample_inference.png"></center>

## Example: What % of Earth is Covered by Trees?

<img src="images/earth.png" width="350">

* What is the research questions?
* How should we collect data to answer this question?
* How to analyze the data?
* Does our estimate make sense?
* What are the limitations in this study? What can we do to improve our estimate?
* How should we present the result?

## Getting Good Data

In statistics, our questions are about a population. The **population** is the complete set of items about which information is desired. In our previous example, the population is the entire surface area of earth.

In many cases, is it infeasible, too costly, or too time-consuming to collect data from the entire population, and instead, we obtain a sample which is a subset of the population.

Since statistical studies are driven by data, the way we collect data is at least as important as the data analysis/modeling.

* A sample should be representative of the population (junk in = junk out)

<img src="images/population_sample.png" width="500">

* Random sampling is often the best way to achieve this

## Sampling Methods

* Simple random sampling (SRS)
    * The easiest most widespread form of sampling
    * Each subject has an equal chance to being in the sample
* Other common sampling methods:
    * Systematic sampling
    * Stratified sampling
    * Cluster sampling

## Sampling bias

A sampling method is biased if the items in the population do not have the same chance of being included in the sample, i.e., the sample is not representative of the population.

### Common types of sample bias:

* **Selection bias**: occurs when some groups in the population are under- or over-represented 
* **Nonresponse bias**: occurs when some of the sampled subjects refuse to participate, or cannot be reached
* **Response bias**: occurs when the subject gives incorrect response or when the interviewer asks misleading or confusing questions. E.g., "Do you think we should have stricter gun control laws to reduce gun-related crimes?"

## Data Collection Procedures

* Observational study
    * Cross-sectional study
    * Retrospective (or case-control) study
    * Prospective (or longitudinal) study
* Experiment

### Observational Study

* Records data on individuals without attempting to influence the variables
    * A survey is a type of observational study
* There can be lurking (confounding) variabels affecting the results
    * A **lurking (confounding) variable** is a variable that is not included in the study but has an effect on the variables being studied. E.g., Drinking bottled water seems to be associated with having healthier babies - are there any lurking variables?

* **Cannot show causation**
    * We need to use randomized experiments to show cause & effect relationship

### Types of Observational Study

* Cross-sectional study: data are collected at one point in time
* Retrospective (or case-control) study: data are collected from the past
* Prospective (or longitudinal) study: data are collected in the future

### Design of Experiments (DOE)

A simple design case:

* Randomly assign the experiment subjects into two treatment groups, e.g., one group of patients receive a new medical treatment and the other group receive an old treatment
* Record and compare the responses

A randomized controlled experiment can provide evidence for causation. Although they can be unethical or too expensive to perform.

## Summary Statistics

Summary statistics are used for summarizing a sample. The most commonly used summary statistics describe the following characteristics of the data.

* Measure of location/center
* Measure of spread
* Measure of shape
* Measure of association/dependence (bivariate)

### Location

* **Mean**: the arithmetic average of the data values
$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + \ldots + x_n}{n} $$
    where n is the sample size.
    
    * The most common measure of center
    * Can be affected by extreme data values (outliers)

<img src="images/mean.png" width="600">

* **Median**: the middle number when the data values are put in order
    * If n is odd, the median is exactly the middle number
    * If n is even, the median is the average of the two middle numbers
    * Not affected by extreme values (outliers)
    
<img src="images/median.png" width="600">

* **Mode**: the most frequently occurred value
    * There may be no mode or several modes
    * Not affected by extreme values (outliers)
    
<img src="images/mode.png" width="600">

* **Percentile**: the $p^{th}$ percentile - $p\%$ of the values in the data are less than or equal to this value ($0 \leq p \leq 100$)

* **Quartile**: 
    * $1^{st}$ quartile = $25^{th}$ percentile
    * $2^{nd}$ quartile = $50^{th}$ percentile = median
    * $3^{rd}$ quartile = $75^{th}$ percentile
    
<img src="images/quartile.png" width="500">

### Spread

* **Variance**

$$ s^2_x = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $$

* **Standard deviation**

$$ s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} $$

* **Range** $= x_{maximum} - x_{minimum}$

* **Inter-quartile range (IQR)** $= Q_3 - Q_1$

### Shape

* **Skewness**: a measure of the asymmetry of a distribution

<img src="images/skewness.png" width="600">
<img src="images/shape.png" width="600">

* **Kurtosis**: a measure of the "peakedness" of a distribution

<img src="images/kurtosis.png" width="500">

### Dependence

* **Covariance**: 

$$ s_{xy} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) $$

* **Correlation**

$$ r_{xy} = \frac{s_{xy}}{s_x s_y} $$

* Covariance and correlation summarize the **linear relationship** between two variables

* Covariance only shows the direction of the relationship
    * $s_{xy} > 0$, large $X$ $\Leftrightarrow$ large $Y$, positive relationship
    * $s_{xy} < 0$, large $X$ $\Leftrightarrow$ small $Y$, negative relationship

* Correlation shows both the direction and the strength of the relationship
    * $r_{xy} > 0$, positive relationship; $r_{xy} < 0$, negative relationship
    * the larger the $|r_{xy}|$ is, the stronger the relationship is

## Correlation and Causation

* **Correlation does not imply causation**
* To make causal conclusions, we need to conduct a randomized experiment so that all the potential confouding effects are averaged out due to random group assignment.

Lab
----
1. How to issue a pull request (Stephanie)
2. Lab Walkthrough