<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Statistics Fundamentals

_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC), etc._

---

<a id="learning-objectives"></a>
## Learning Objectives
- Compute dot products, matrix multiplications, and vector norms by hand and using NumPy.
- Code summary statistics using NumPy and Pandas: variance, standard deviation, and correlation.
- Create basic data visualizations, including scatterplots, box plots, and histograms.
- Describe characteristics and trends in a data set using visualizations.
- Describe the bias and variance of statistical estimators.
- Identify a normal distribution within a data set using summary statistics and data visualizations.

### Lesson Guide
- [Linear Algebra Review](#review)
    - [Scalars, Vectors, and Matrices](#scalars)
	- [Basic Matrix Algebra](#basic)
	- [Dot Product](#dot)
	- [Matrix Multiplication](#multiplication)
	- [N-Dimensional Space](#n)
	- [Vector Norm](#vector)
- [Linear Algebra Applications to Machine Learning](#linear)
	- [Distance Between Actual Values and Predicted Values](#distance)
	- [Mean Squared Error](#mean)
	- [Least Squares](#least)
- [Independent Exercise: Examining the Titanic Data Set](#exercise)
- [Descriptive Statistics Continued](#descriptive)
	- [Measures of Dispersion: Standard Deviation and Variance](#dispersion)
- [Our First Model](#model)
- [A Short Introduction to Model Bias and Variance](#short)
	- [Bias-Variance Decomposition](#bias)
	- [Example Using Bessel's Correction](#bessels)
- [Correlation and Association](#correlation)
	- [Independent Exercise: Correlation in Pandas](#exercise2)
- [The Normal Distribution](#normal)
	- [What is the Normal Distribution?](#what)
	- [Skewness](#skewness)
	- [Kurtosis](#kurtosis)
- [Independent Exercise: Determining the Distribution of Your Data](#exercise3)
- [Hypothesis Testing Continued](#hypothesis)
    - [Confidence Intervals](#confidence)
    - [Error Types](#types)
- [Lesson Review](#topic-review)

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from ipywidgets import interact
plt.style.use('fivethirtyeight')

# This makes sure that graphs render in your notebook
%matplotlib inline

<a id="review"></a>
## Linear Algebra Review
---
**Objective:** Compute dot products, matrix multiplications, and vector norms by hand and using NumPy.

<a id="why"></a>
### Why Use Linear Algebra in Data Science?

Linear models are *efficient* and *well understood*. 
- They can often closely approximate nonlinear solutions.
- They scale to high dimensions without difficulty.

Because of these desirable properties, linear algebra is a *need-to-know subject* for machine learning. 
> In fact, it **forms the basis** of foundational models such as linear regression, logistic regression, and principal component analysis (PCA).

Unsurprisingly, advanced models such as *neural networks* and *support vector machines* rely on linear algebra as their "trick" for impressive speedups. 
> Modern-day **GPUs** are essentially linear algebra *supercomputers*. 
> - To utilize a GPU, models must often be carefully formulated in terms of vectors and matrices.

Beyond that, today's advanced models *build upon the simpler foundational models*. 
- Each neuron in a neural net is essentially a logistic regressor! 
- Support vector machines utilize a kernel trick to craftily make problems linear that would not otherwise appear to be.

> Although we do not have time in this course to comprehensively discuss linear algebra, we highly recommend you become fluent!

<a id="scalars"></a>
### Scalars, Vectors, and Matrices

A **scalar** is a single number. Here, symbols that are lowercase single letters refer to scalars. For example, the symbols $a$ and $v$ are scalars that might refer to arbitrary numbers such as $5.328$ or $7$. An example scalar would be:

$$a$$

A **vector** is an ordered sequence of numbers. Here, symbols that are lowercase single letters with an arrow — such as $\vec{u}$ — refer to vectors. An example vector would be:

$$\vec{u} = \left[ \begin{array}{c}
1&3&7
\end{array} \right]$$

> Take a moment to appreciate the use of LaTeX in the equation above ([more info](https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Working%20With%20Markdown%20Cells.ipynb#LaTeX-equations)).

In [None]:
# create a vector using np.array
foo = np.array([1,3,7])
print(foo)

An $m$ x $n$ **matrix** is a rectangular array of numbers with $m$ rows and $n$ columns.
- Each number in the matrix is an entry.
- Entries can be denoted $a_{ij}$, where $i$ denotes the row number and $j$ denotes the column number.

Note that, because each entry $a_{ij}$ is a **lowercase** single letter, a matrix is an array of *scalars*:

$$\mathbf{A}= \left[ \begin{array}{c}
a_{11} & a_{12} & ... & a_{1n}  \\
a_{21} & a_{22} & ... & a_{2n}  \\
... & ... & ... & ... \\
a_{m1} & a_{m2} & ... & a_{mn}
\end{array} \right]$$

Matrices are referred to using bold **uppercase** letters, such as $\mathbf{A}$. A bold font face is used to distinguish matrices from sets.

In [None]:
# create a matrix using np.array
bar = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(bar)

Note that in Python, a matrix is just a list of lists!
- The outermost list is a list of rows.

<a id="basic"></a>
### Basic Matrix Algebra


#### Addition and Subtraction
Vector **addition** is straightforward. If two vectors are of equal dimensions (vectors shown here as column vectors for convenience only):

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right],  \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

In [None]:
v = np.array([1, 3, 7])
w = np.array([1, 0, 1])

$\vec{v} + \vec{w} =
\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] + \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right] = 
\left[ \begin{array}{c}
1+1 \\
3+0 \\
7+1
\end{array} \right] = 
\left[ \begin{array}{c}
2 \\
3 \\
8
\end{array} \right]
$

(Subtraction is similar.)

In [None]:
# add the vectors together with the '+' operator


#### Scalar Multiplication
We scale a vector with **scalar multiplication**, multiplying a vector by a scalar (single quantity):

$ 2 \cdot \vec{v} = 2\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \cdot 1 \\
2 \cdot 3 \\
2 \cdot 7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \\
6 \\
14
\end{array} \right]$ 

In [None]:
# multiply vector v by 2


<a id="dot"></a>
### Dot Product
The **dot product** of two _n_-dimensional vectors is:

$ \vec{v} \cdot \vec{w} =\sum _{i=1}^{n}v_{i}w_{i}=v_{1}w_{1}+v_{2}w_{2}+\cdots +v_{n}w_{n} $

So, if:

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right], \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

$ \vec{v} \cdot \vec{w} = 1 \cdot 1 + 3 \cdot 0 + 7 \cdot 1 = 8 $

In [None]:
v = np.array([1, 3, 7])
w = np.array([1, 0, 1])

# calculate the dot product of v and w using np.dot


<a id="multiplication"></a>
### Matrix Multiplication
**Matrix multiplication**, $\mathbf{A}_{mn}$ x $\mathbf{B}_{ij}$, is valid when the left matrix has the same number of columns as the right matrix has rows ($n = i$). Each entry is the dot product of corresponding row and column vectors.

<img src='./assets/matrix-mult.gif' style="width: 400px">

(Image [source](https://www.mathsisfun.com/))

The dot product illustrated above is: $1 \cdot 7 + 2 \cdot 9 + 3 \cdot 11 = 58$. 
> Can you compute the rest of the dot products by hand?

<img src='./assets/matrix-mult.gif' style="width: 400px">

If the product is the $2$ x $2$ matrix $\mathbf{C}_{mj}$, then:

+ Matrix entry $c_{12}$ (its FIRST row and SECOND column) is the dot product of the FIRST row of $\mathbf{A}$ and the SECOND column of $\mathbf{B}$.

+ Matrix entry $c_{21}$ (its SECOND row and FIRST column) is the dot product of the SECOND row of $\mathbf{A}$ and the FIRST column of $\mathbf{B}$.

***Note*** that if the first matrix is $m$ x $n$ ($m$ rows and $n$ columns) and the second is  $i$ x $j$ (where $n = i$), then the final matrix will be $m$ x $j$. 
- For example, below we have $2$ x $3$ multiplied by $3$ x $2$, which results in a $2$ x $2$ matrix. 
> Can you see why?

In [None]:
A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([[7, 8], [9, 10], [11, 12]])

A.dot(B)

Make sure you can compute this by hand!

<a id="n"></a>
### N-Dimensional Space

We often refer to vectors as elements of an $n$-dimensional space. The symbol $\mathbb{R}$ refers to the set of all real numbers (written in uppercase "blackboard bold" font). Because this contains all reals, $3$ and $\pi$ are **contained in** $\mathbb{R}$. We often write this symbolically as $3 \in \mathbb{R}$ and $\pi \in \mathbb{R}$.

To get the set of all *pairs* of real numbers, we would essentially take the product of this set with itself (called the Cartesian product) — $\mathbb{R}$ x $\mathbb{R}$, abbreviated as $\mathbb{R}^2$. This set — $\mathbb{R}^2$ — contains all pairs of real numbers, so $(1, 3)$ is **contained in** this set. We write this symbolically as $(1, 3) \in \mathbb{R}^2$.

+ In 2-D space ($\mathbb{R}^2$), a point is uniquely referred to using two coordinates: $(1, 3) \in \mathbb{R}^2$.
+ In 3-D space ($\mathbb{R}^3$), a point is uniquely referred to using three coordinates: $(8, 2, -3) \in \mathbb{R}^3$.
+ In $n$-dimensional space ($\mathbb{R}^n$), a point is uniquely referred to using $n$ coordinates.

> Note that these coordinates of course are *isomorphic* to our vectors! 
> - Coordinates and vectors are both defined as ordered sequences of numbers. 
> So, especially in machine learning, we often visualize vectors of length $n$ as points in $n$-dimensional space.

<a id="vector"></a>
### Vector Norm

The **magnitude** of a vector, $\vec{v} \in \mathbb{R}^{n}$, can be interpreted as its length in $n$-dimensional space. Therefore it is calculable via the Euclidean distance from the origin:

$\vec{v} = \left[ \begin{array}{c}
v_{1} \\
v_{2} \\
\vdots \\
v_{n}
\end{array} \right]$

then $\| \vec{v} \| = \sqrt{v_{1}^{2} + v_{2}^{2} + ... + v_{n}^{2}} = \sqrt{v^Tv}$

E.g. if $\vec{v} = 
\left[ \begin{array}{c}
3 \\
4
\end{array} \right]$, then $\| \vec{v} \| = \sqrt{3^{2} + 4^{2}} = 5$

This is also called the vector **norm**. You will often see this used in machine learning.

In [None]:
x = np.array([3,4])

# calculate the norm of the vector x with np.linalg.norm


<a id="linear"></a>
## Linear Algebra Applications to Machine Learning
---

<a id="distance"></a>
### Distance Between Actual Values and Predicted Values
We often need to know the difference between predicted values and actual values. In 2-D space, we compute this as:
$$\| \vec{actual} - \vec{predicted} \| =\sqrt{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2}$$

> Note that this is just the straight-line distance between the actual point and the predicted point.

<a id="mean"></a>
### Mean Squared Error
Often, it's easier to look at the mean of the squared errors. Where $\hat{y}(\mathbf{X})$ is a vector of predicted values (a function of the data matrix $\mathbf{X}$) and $\vec{y}$ is the actual values:

$$MSE = \frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|^2$$

<a id="least"></a>
### Least squares
Many machine learning models are based on the following form:

$$\min \| \hat{y}(\mathbf{X}) - \vec{y} \|$$

The goal is to minimize the distance between model predictions and actual data.

Learn more in the scikit-learn [docs](http://scikit-learn.org/stable/modules/linear_model.html).

<a id="exercise"></a>
## Independent Exercise: Examining the Titanic Data Set (~10 mins)

---

**Objective:** Read in the Titanic data and look at a few summary statistics.

In [None]:
titanic = pd.read_csv('./datasets/titanic.csv')

#### 1. Print out the column names of the DataFrame using the `.columns` attribute:

In [None]:
# preview columns


#### 2. Print out the dimensions of the DataFrame using the `.shape` attribute:

In [None]:
# preview data dimensions


#### 3. Print out the data types of the columns using the `.dtypes` attribute:

In [None]:
# what are the column data types?


#### 4. Print out the first five rows of the data using the built-in `.head()` function:

In [None]:
# look at the first 5 rows


#### 5. Use the built-in  `.value_counts()` function to count the values of each type in the `pclass` column:

In [None]:
# can we preview the plcass variable?


#### 6. Pull up descriptive statistics for each variable using the built-in `.describe(include='all')` function:

In [None]:
# pull up descriptive statistics for each variables


In [None]:
# uh oh, we have some missing values, but we won't do anything with them for now

### Diagnosing Data Problems

- Whenever you get a new data set, the fastest way to find mistakes and inconsistencies is to look at the descriptive statistics.
  - If anything looks too high or too low relative to your experience, there may be issues with the data collection.

Your data may contain a lot of missing values and may need to be cleaned meticulously before they can be combined with other data.
  - You can take a quick average or moving average to ***smooth out the data*** and combine that to preview your results before you embark on your much longer data-cleaning journey.
  - Sometimes ***filling in missing values*** with their means or medians will be the best solution for dealing with missing data. 
  - Other times, you may want to ***drop*** the offending rows or do real imputation.

<a id="descriptive"></a>
## Descriptive Statistics Continued
---

- **Objective:** Code summary statistics using NumPy and Pandas: standard deviation, and correlation.

<a id="dispersion"></a>
### Measures of Dispersion: Standard Deviation and Variance

Standard deviation (SD, $σ$ for population standard deviation, or $s$ for sample standard deviation) is a measure that is used to quantify the amount of variation or dispersion from the mean of a set of data values. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are spread out.

Standard deviation is the square root of variance:

$$variance = \frac {\sum{(x_i - \bar{X})^2}} {n-1}$$

$$s = \sqrt{\frac {\sum{(x_i - \bar{X})^2}} {n-1}}$$

> **Standard deviation** is often used because it is in the same units as the original data! By glancing at the standard deviation, we can immediately estimate how "typical" a data point might be by how many standard deviations it is from the mean. Furthermore, standard deviation is the only value that makes sense to visually draw alongside the original data.
>
> **Variance** is often used for efficiency in computations. The square root in the SD always increases with the function to which it is applied. So, removing it can simplify calculations (e.g., taking derivatives), particularly if we are using the variance for tasks such as optimization.

That can be a lot to take in, so let's break it down in a Python demo.

#### Assign the first 5 rows of titanic age data to a variable:

In [None]:
# Take the first 5 rows of titanic age data
first_five = titanic.age[:5]

print(first_five)

#### Calculate the mean manually:

In [None]:
# Calculate mean by hand
mean = (22 + 38 + 26 + 35 + 35) / 5.0

#### Calculate the variance manually:

In [None]:
# Calculate variance by hand
(np.square(22 - mean) +
np.square(38 - mean) +
np.square(26 - mean) +
np.square(35 - mean) +
np.square(35 - mean)) / 4.0

#### Calculate the variance and the standard deviation using Pandas:

In [None]:
# Verify with Pandas
print(first_five.var())
print(first_five.std())

<a id="model"></a>
## Our First Model
---

In this section, we will make a **mathematical model** of data. When we say **model**, we mean it in the same sense that a toy car is a **model** of a real car. If we mainly care about appearance, the toy car model is an excellent model. However, the toy car fails to accurately represent other aspects of the car. For example, we cannot use a toy car to test how the actual car would perform in a collision.

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/Norev_4cv.jpg/250px-Norev_4cv.jpg' style="width: 200px">



In data science, we might take a rich, complex person and model that person solely as a two-dimensional vector: _(age, smokes cigarettes)_. For example: $(90, 1)$, $(28, 0)$, and $(52, 1)$, where $1$ indicates "smokes cigarettes." This model of a complex person obviously fails to account for many things. However, if we primarily care about modeling health risk, it might provide valuable insight.

Now that we have superficially modeled a complex person, we might determine a formula that evaluates risk. For example, an older person tends to have worse health, as does a person who smokes. So, we might deem someone as having risk should `age + 50*smokes > 100`.

This is a **mathematical model**, as we use math to assess risk. It could be mostly accurate. However, there are surely elderly people who smoke who are in excellent health.

---

Let's make our first model from scratch. We'll use it predict the `fare` column in the Titanic data. So what data will we use? Actually, none.

The simplest model we can build is an estimation of the mean, median, or most common value. If we have no feature matrix and only an outcome, this is the best approach to make a prediction using only empirical data. 

This seems silly, but we'll actually use it all the time to create a baseline of how well we do with no data and determine whether or not our more sophisticated models make an improvement.

> You can find out more about dummy estimators [here](http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators).

#### Get the `fare` column from the Titanic data and store it in variable `y`:

In [None]:
# Get the Fare column from the titanic data and store as y
y = titanic['fare']

#### Create predictions `y_pred` (in this case just the mean of `y`):

In [None]:
# Stored predictions in y_pred
y_pred = y.mean()

#### Find the average squared distance between each prediction and its actual value:

This is known as the mean squared error (MSE).

In [None]:
# Squared Error is really hard to read, Let's look at Mean Squared Error
np.mean(np.square(y-y_pred))

#### Calculate the root mean squared error (RMSE), the square root of the MSE:

In [None]:
np.sqrt(np.mean(np.square(y-y_pred)))

<a id="short"></a>
## A Short Introduction to Model Bias and Variance 

---

- **Objective:** Describe the bias and variance of statistical estimators.

In simple terms, ***bias*** shows how accurate a model is in its predictions. (It has **low bias** if it hits the bullseye!)

***Variance*** shows how reliable a model is in its performance. (It has **low variance** if the points are predicted consistently!)

> These characteristics have important interactions, but we will save that for later.

<img src='./assets/bias-vs-variance.png' style="width: 600px">

Remember how we just calculated mean squared error to determine the accuracy of our prediction? 
- It turns out we can do this for any statistical estimator, including mean, variance, and machine learning models.

We can even decompose mean squared error to identify where the source of error comes from.

<a id="bias"></a>
### Bias-Variance Decomposition

In the following notation, $f$ refers to a perfect model, while $\hat{f}$ refers to our model.

**Bias**

Error caused by bias is calculated as the difference between the expected prediction of our model and the correct value we are trying to predict:

$$Bias = E[\hat{f}(x)] - f(x)$$

**Variance**

Error caused by variance is taken as the variability of a model prediction for a given point:

$$Variance = E[(\hat{f}(x) - E[\hat{f}(x)])^2]$$

**Mean Squared Error**
$$MSE(\hat{f}(x)) = Var(\hat{f}(x)) + Bias(\hat{f}(x),f(x))^2$$

> The MSE is actually composed of three sources of error: The **variance**, **bias**, and some **irreducible error** that the model can never render given the available features.

This topic will come up again, but for now it's enough to know that we can decompose MSE into the bias of the estimator and the variance of the estimator.

<a id="bessels"></a>
### Example Using Bessel's Correction

It's rarely practical to measure every single item in a population to gather a statistic. 
> - *Instead, we usually sample a few items and use those to infer a population value.

For example, we can take a class of **200 students** and **measure their height**, but rather than measuring everyone, we *select students at random* to estimate the average height in the class and the variance of the height in the class.

We know we can take the mean as follows:

$$E[X] = \bar{X} =\frac 1n\sum_{i=1}^nx_i$$

> What about the variance?

Intuitively and by definition, population variance looks like this (the average distance from the mean):

$$\frac {\sum{(x_i - \bar{X})^2}} {n}$$

It's actually better to use the following for a sample (why?):

$$\frac {\sum{(x_i - \bar{X})^2}} {n-1}$$

In some cases, we may even use:

$$\frac {\sum{(x_i - \bar{X})^2}} {n+1}$$

> Detailed explanations can be found here:
> - [Bessel correction](https://en.wikipedia.org/wiki/Bessel%27s_correction)
> - [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error)
> 
> *Better* explanation can be found here:
> - [Why we divide by n - 1 in variance](https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/more-standard-deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance)

Let's show an example of computing the variance by hand.

Suppose we have the following data:

$$X = [1, 2, 3, 4, 4, 10]$$

First, we compute its mean: 

$$\bar{X} = (1/6)(1 + 2 + 3 + 4 + 4 + 10) = 4$$

Because this is a sample of data rather than the full population, we'll use the second formula. Let's first "mean center" the data:

$$X_{centered} = X - \bar{X} = [-3, -2, -1, 0, 0, 6]$$

Now, we'll just find the *average squared distance* of each point from the mean:

$$variance = \frac {\sum{(x_i - \bar{X})^2}} {n-1} = \frac {(-3)^2 + (-2)^2 + (-1)^2 + 0^2 + 0^2 + 6^2}{6-1} = \frac{14 + 36}{5} = 10$$

#### So, the variance of $X$ is $10$. 

However, we can't compare this directly to the original units because it's in the original units squared.
- So, we will use the **standard deviation of $X$**, $\sqrt{10} \approx 3.16$ to see that the value of $10$ is farther than one standard deviation from the mean of $4$.
- So, we conclude it is somewhat far from most of the points (more on what it really might mean later).

---

A variance of zero means there is no spread.
- If we take instead $X = [1, 1, 1, 1]$, then clearly the mean $\bar{X} = 1$. 
- So, $X_{centered} = [0, 0, 0, 0]$, which directly leads to a variance of 0.

> (Make sure you understand why! Remember that variance is the average squared distance of each point from the mean.)

In [None]:
# make some random data
heights = np.random.rand(200) + 6.5

In [None]:
# function for handy interactive plot of sample means
def plot_means(sample_size):
    true_mean = np.mean(heights)

    mean_heights = []
    for n in range(5,sample_size):
        for j in range(30):
            mean_height = np.mean(np.random.choice(heights, n, replace=False))
            mean_heights.append((n, mean_height))
    
    sample_height = pd.DataFrame(mean_heights, columns=['sample_size', 'height'])

    sample_height.plot.scatter(x='sample_size', y='height', figsize=(14, 4), alpha=0.5)
    plt.axhline(y=true_mean, c='r')
    plt.title("The Bias and Variance of the Mean Estimator")
    plt.show()

In [None]:
# function for handy interactive plot of sample variances
def plot_variances(sample_size):
    true_variance = np.var(heights)

    var_heights = []
    for n in range(5,sample_size):
        for j in range(30):
            var_height1 = np.var(np.random.choice(heights, n, replace=False), ddof=0)
            var_height2 = np.var(np.random.choice(heights, n, replace=False), ddof=1)
            var_height3 = np.var(np.random.choice(heights, n, replace=False), ddof=-1)
            var_heights.append((n, var_height1, var_height2, var_height3))
    
    sample_var = pd.DataFrame(var_heights, columns=['sample_size', 'variance1', 'variance2', 'variance3'])
    sample_var.plot.scatter(x='sample_size', y='variance1', figsize=(14, 3), alpha=0.5)
    plt.axhline(y=true_variance, c='r')
    plt.title("The Bias and Variance of the Population Variance Estimator (n)")
    
    sample_var.plot.scatter(x='sample_size', y='variance3', figsize=(14, 3), alpha=0.5)
    plt.axhline(y=true_variance, c='r')
    plt.title("The Bias and Variance of the Biased Sample Variance Estimator (n+1)")
    
    sample_var.plot.scatter(x='sample_size', y='variance2', figsize=(14, 3), alpha=0.5)
    plt.axhline(y=true_variance, c='r')
    plt.title("The Bias and Variance of the Sample Variance Estimator (n-1)")
    plt.show()

In [None]:
interact(plot_means, sample_size=(5,200));

- The red line in the chart above is the true average height, but because we don't want to ask 200 people about their height, we take a samples.

- The blue dots show the estimate of the average height after taking a sample. To give us an idea of how sampling works, we simulate taking multiple samples.

- The $X$ axis shows the sample size we take, while the blue dots show the likely average heights we'll conclude for a given sample size.

- Even though the true average height is around 7 feet, a small sample may lead us to think that it's actually 6.7 or 7.3 feet. 

- Notice that the red line is in the center of our estimates. On average, we are correct and have no bias.

- If we take a larger sample size, we get a better estimate. This means that the variance of our estimate gets smaller with larger samples sizes.

In [None]:
interact(plot_variances, sample_size=(5,200));

- Not all estimators are created equal.

- The red lines in the charts above show the true variance of height.

- The top graph is the population variance estimator, while the bottom graph is the sample variance estimator.

- It's subtle, but notice that the population variance estimator is not centered on the red line. It's actually biased and consistently underestimates the true variance, especially at low sample sizes.

- You may also notice that the scatter of the population variance estimator is smaller. That means the variance of the population variance estimator is smaller. Essentially, it's the variability of the estimator. 

- Play around with the sliders to get a good view of the graphs.

<a id="correlation"></a>
## Correlation and Association
---

- **Objective:** Describe characteristics and trends in a data set using visualizations.

Correlation measures how variables related to each other.

Typically, we talk about the Pearson correlation coefficient — a measure of **linear** association.

We refer to perfect correlation as **colinearity**.

The following are a few correlation coefficients. Note that if both variables trend upward, the coefficient is positive. If one trends opposite the other, it is negative.

It is important that you always look at your data visually — the coefficient by itself can be misleading:

<img src='./assets/correlation-examples.png' style="width: 800px">

<a id="exercise2"></a>
## Independent Exercise 2: Correlation in Pandas (~10 mins)

---

**Objective:** Explore options for measuring and visualizing correlation in Pandas.

#### 1. Display the correlation matrix for all Titanic variables
- Generate a correlation matrix from the Titanic data using the `.corr()` method.

#### 2. Use Seaborn to plot a heat map of the correlation matrix:

> The `sns.heatmap()` function will accomplish this.
> - Generate a correlation matrix from the Titanic data using the `.corr()` method.
> - Pass the correlation matrix into `sns.heatmap()` as its only parameter.

In [None]:
# use Seaborn to plot a correlation heatmap


#### 3. Take a closer look at survived and fare using a scatter plot with the `.plot.scatter()` method.
- You'll need to specify `x='fare', y='survived'`.
- Is correlation a good way to inspect the association of fare and survival?

In [None]:
# take a closer look at survived and fare using a scatter plot


<a id="normal"></a>
## The Normal Distribution
---

- **Objective:** Identify a normal distribution within a data set using summary statistics and data visualizations.

###  Math Review
- What is an event space?
  - A listing of all possible occurrences.
- What is a probability distribution?
  - A function that describes how events occur in an event space.
- What are general properties of probability distributions?
  - All probabilities of an event are between 0 and 1.
  - The probability that something occurs is almost certain, or 1.
  

<a id="what"></a>
### What is the normal distribution?
- A normal distribution is often a key assumption to many models.
  - In practice, if the normal distribution assumption is not met, it's not the end of the world. Your model is just less efficient in most cases.

- The normal distribution depends on the mean and the standard deviation.

- The mean determines the center of the distribution. The standard deviation determines the height and width of the distribution.

- Normal distributions are symmetric, bell-shaped curves.

- When the standard deviation is large, the curve is short and wide.

- When the standard deviation is small, the curve is tall and narrow.

<img src='./assets/normal.png' style="width: 600px">

#### Why do we care about normal distributions?

- They often show up in nature.
- Aggregated processes tend to distribute normally, regardless of their underlying distribution — provided that the processes are uncorrelated or weakly correlated (central limit theorem).
- They offer effective simplification that makes it easy to make approximations.

#### Plot a histogram of 1,000 samples from a random normal distribution:

The `np.random.randn(numsamples)` function will draw from a random normal distribution with a mean of 0 and a standard deviation of 1.

- To plot a histogram, pass a NumPy array with 1000 samples as the only parameter to `plt.hist()`.
- Change the number of bins using the keyword argument `bins`, e.g. `plt.hist(mydata, bins=50)`

In [None]:
# plot a histogram of several random normal samples from NumPy


<a id="skewness"></a>
###  Skewness
- Skewness is a measure of the asymmetry of the distribution of a random variable about its mean.
- Skewness can be positive or negative, or even undefined.
- Notice that the mean, median, and mode are the same when there is no skew.

<img src='./assets/skewness.jpg' style="width: 700px">

#### Plot a lognormal distribution generated with NumPy.

A **lognormal distribution** is what you get when you plug a normal distribution into an exponential function.
- It tends to be a normal distribution that's been shifted to the left such that it drops of quickly on one side and much more slowly on the other.
- This makes it useful more modeling data that may take on very large values.

- Lognormal distributions are extremely useful when analyzing things like stock prices. Normal distributions cannot be used to model stock prices because they have a negative side and stock prices cannot fall below zero.
- Conversely, the normal distribution works better when calculating things like returns. The reason being that the return can contain both positive and negative values, and a lognormal distribution will fail to capture the negative aspects.

Take 1,000 samples using `np.random.lognormal(size=numsamples)` and plot them on a histogram.

In [None]:
# plot a lognormal distribution generated with NumPy
plt.hist(np.random.lognormal(size=1000), bins=50);

#####  Real World Application - When mindfullness beats complexity
- Skewness is surprisingly important.
- Most algorithms implicitly use the mean by default when making approximations.
- If you know your data is heavily skewed, you may have to either transform your data or set your algorithms to work with the median.

<a id="kurtosis"></a>
### Kurtosis
- Kurtosis is a measure of whether the data are peaked or flat, relative to a normal distribution.
- Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. 

<img src='./assets/kurtosis.jpg' style="width: 500px">

####  Real-World Application: Risk Analysis
- Long-tailed distributions with high kurtosis elude intuition; we naturally think the event is too improbable to pay attention to.
- It's often the case that there is a large cost associated with a low-probability event, as is the case with hurricane damage.
- It's unlikely you will get hit by a Category 5 hurricane, but when you do, the damage will be catastrophic.
- Pay attention to what happens at the tails and whether this influences the problem at hand.
- In these cases, understanding the costs may be more important than understanding the risks.

### Testing for normalcy
In general, there are two types of normality tests.

#### 1. Statistical Tests:
These are methods that calculate statistics on the data and quantify how likely it is that the data was drawn from a Gaussian distribution.
#### 2. Graphical Methods:
These are methods for plotting the data and qualitatively evaluating whether the data looks Gaussian.

In [None]:
## first, let's generate some data to work with
def normalize_raw_data(raw_data):
    # normalize raw data
    data_sorted = sorted(raw_data)
    data_mean = np.mean(raw_data)
    data_std = np.std(raw_data)
    data_norm = (data_sorted - data_mean) / data_std
    data_pdf = stats.norm.pdf(data_norm)
    return data_norm,data_pdf

# generate data
dat_raw = np.random.randn(1000)
dat_norm,dat_pdf = normalize_raw_data(dat_raw) # normalize raw data

# add some bias for next example
bias = np.random.randn(1000)
bias[bias<0]=0
dat_bias_raw = np.random.randn(1000)+bias*5
dat_bias_norm,dat_bias_pdf = normalize_raw_data(dat_bias_raw) # normalize raw data

#### Statistical Tests
There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.

Each test makes different assumptions and considers different aspects of the data. To interpret them, we use the standard hypothesis testing approach that we've seen previously.
>These tests assume that that a sample was drawn from a Gaussian distribution (i.e. the null hypothesis, or $H_0$). We assign a threshold level before-hand (called alpha or $\alpha$- typically 0.05), that is used to interpret the p-value as follows:
>
>- p <= alpha: reject H0, not normal.
>- p > alpha: fail to reject H0, normal.

To state that another way- **Larger p-values suggest that our sample was likely drawn from a Gaussian distribution.**

In [None]:
# let's test how normal it is using statistics
dat_raw = np.random.randn(1000)

# Shapiro-Wilk test
from scipy.stats import shapiro

def normality_test(data):
    stat, p = shapiro(data)
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret
    alpha = 0.05
    if p > alpha:
        print('Sample looks Gaussian (fail to reject H0)')
    else:
        print('Sample does not look Gaussian (reject H0)')
        
normality_test(dat_raw)
normality_test(dat_bias_raw)

#### Graphical Methods
The informal approach to testing normality is to compare a histogram of the sample data to a normal probability curve. The empirical distribution of the data (the histogram) should be bell-shaped and resemble the normal distribution.

In [None]:
# plot data
f,ax1 = plt.subplots()
ax1.hist(dat_norm, 20, density=1) # raw data first
ax1.plot(dat_norm , dat_pdf, lw=3, c='r') # then overlay with normal distribution to compare
ax1.set_xlim([-5,5])
plt.show()

In [None]:
# plot biased data
f,ax1 = plt.subplots()
ax1.hist(dat_bias_norm, 20, density=1 ) # raw data first
ax1.plot(dat_bias_norm, dat_bias_pdf, lw=3, c='r') # then overlay with normal distribution to compare
ax1.set_xlim([-5,5])
plt.show()

# let's talk more about visualizing your data in the next section...

<a id="exercise3"></a>
## Independent Exercise 3: Determining the Distribution of Your Data (~10 mins)
---

**Objective:** Create basic data visualizations,including scatterplots, box plots, and histograms.

#### 1. Use the `.hist()` function of your Titantic DataFrame to plot histograms of all the variables in your data.

- The function `plt.hist(data)` calls the Matplotlib library directly.
- However, each DataFrame has its own `hist()` method that by default plots one histogram per column. 
- Given a DataFrame `my_df`, it can be called like this: `my_df.hist()`. 

In [None]:
# plot all variables in titanic using histograms


#### 2. Use the built-in `.plot.box()` function of your Titanic DataFrame to plot box plots of your variables.

- Given a DataFrame, a box plot can be made where each column is one tick on the x axis.
- To do this, it can be called like this: `my_df.plot.box()`.
- Try using the keyword argument `showfliers`, e.g. `showfliers=False`.

In [None]:
# plotting all histograms can be unweildly, boxplots can be more concise


#### 3. Look at the Titanic data variables and ask:
- Are any of them *normal*?
- Are any *skewed*?

In [None]:
# are any of the titanic variables normal or skewed?


<a id="hypothesis"></a>
## Hypothesis Testing Continued
---

**Objective**: Test a hypothesis within a sample case study.

You'll remember that we've worked previously on descriptive statistics such as mean and variance. How would we tell if there is a difference between our groups? How would we know if this difference was real or if our finding is simply the result of chance?

For example, if we are working on sales data, how would we know if there was a difference between the buying patterns of men and women at Acme, Inc.? Hypothesis testing!

> **Note:** In this course, hypothesis testing is primarily used to assess foundational models such as linear and logistic regression.

### Hypothesis Testing Steps

Generally speaking, we start with a **null hypothesis** and an **alternative hypothesis**, which is the opposite of the null. Then, you check whether the data support rejecting your null hypothesis or fail to reject the null hypothesis.

For example:

- Null hypothesis: There is no relationship between gender and sales.
- Alternative hypothesis: There is a relationship between gender and sales.

> ***Note:*** 
> *"Failing to reject"* the null hypothesis is **not the same** as *"accepting"* it. Your alternative hypothesis may indeed be true, but you don't necessarily have enough data to show that yet.
> 
> This distinction is important for helping you avoid overstating your findings. You should only state what your data and analysis can truly represent.

#### How Do We Tell if the Association We Observed is Statistically Significant?

Statistical significance is the likelihood that a result or relationship is caused by something other than mere random chance. Statistical hypothesis testing is traditionally employed to determine whether or not a result is statistically significant.

We might ask: **How likely is the effect observed to be true, assuming the null hypothesis is true?** If the probability of our observation occurring by chance is less than 5 percent (supposing the null hypothesis), then we reject the null hypothesis. Note that the 5 percent value is in many ways arbitrary — many statisticians require even higher confidence levels.

The probability of our observations occurring by chance, given the null hypothesis, is the **pvalue** ($p$).

---

**Example:** Suppose you flip a coin three times and get three heads in a row. These three flips are our observations.

+ We want to know whether or not the coin is fair. So, we select the **null hypothesis: The coin is fair.**
+ Now, let's suppose the null hypothesis is true. Three heads in a row occurs with a chance of $1/2^3 \approx 12.5\%$.
+ Because there is a reasonable ($> 5\%$) chance of three heads occuring naturally, we do not reject the null hypothesis.
+ So, **we conclude** that we do not have enough data to tell whether or not the coin is fair ($p = 0.125$).

---

In other words, we say that something is NOT statistically significant if there is a *less than **5 percent** chance that our finding was caused by chance alone* (assuming the null hypothesis is true).

<a id="confidence"></a>
### Confidence Intervals

A closely related concept is **confidence intervals**. A 95 percent confidence interval can be interpreted like so: under infinite sampling of the population, we would expect that the true value of the parameter we are estimating to fall within that range 95% of the time.

Keep in mind that we only have a **single sample of data** and not the **entire population of the data**. The "true" effect/difference is either within this interval or it is not. We have no firm knowledge, however, that our single estimate of the "true" effect/difference is close or not to the "truth". The confidence interval around our estimate tells us, with a given sample size and level of confidence, the range in which future estimates are likely to fall.

Note that using 95 percent confidence intervals is just a convention. You can create 90 percent confidence intervals (which will be more liberal), 99 percent confidence intervals (which will be more conservative), or whatever intervals you prefer.

---

<a id="types"></a>
### Error Types

Statisticians often classify errors not just as errors but as one of two specific types of errors — type I and type II.

+ **Type I errors** are false positives.
    - Machine learning: Our model falsely predicts "positive." (The prediction is incorrect.)
    - Statistics: Incorrect rejection of a true null hypothesis.

+ **Type II errors** are false negatives.
    - Machine learning: Our model falsely predicts "negative." (The prediction is incorrect.)
    - Statistics: Incorrectly retaining a false null hypothesis.

Understanding these errors can be especially beneficial when designing models. For example, we might decide that type I errors are OK but type II errors are not. We can then optimize our model appropriately.

> **Example:** Suppose we make a model for airline security in which we predict whether or not a weapon is present ("positive"). In this case, we would much rather have type I errors (falsely predict a weapon) than type II errors (falsely predict no weapon).

> **Example:** Suppose we make a model for the criminal justice system in which we whether or not a defendant is guilty ("positive"). In this case, we would much rather have type II errors (falsely predict innocent) than type I errors (falsely predict guilty).

#### Can you phrase these examples in terms of null hypotheses?

<a id="topic-review"></a>
## Lesson Review
---

- We covered several different types of summary statistics, what are they?
- We covered three different types of visualizations, which ones?
- Describe bias and variance and why they are important.
- What are some important characteristics of distributions?

**Any further questions?**