<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Statistics Fundamentals

_Instructor:_ Alexander Egorenkov (DC), Amy Roberts (NYC) Tim Book, General Assembly DC_

---

<a id="learning-objectives"></a>
## Learning Objectives
- **Linear algebra:** Dot products, matrix multiplications, and vector norms by hand and using NumPy.
- **Summary statistics:** Using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.
- **Discover trends:** Using basic summary statistics and viz.
- **Bias/variance tradeoff:** Describe the bias and variance of statistical estimators.
- **Identify a normal distribution** within a data set using summary statistics and data visualizations.

### Lesson Guide
- [Linear Algebra Review](#linear-algebra-review)
    - [Scalars, Vectors, and Matrices](#scalars-vectors-and-matrices)
	- [Basic Matrix Algebra](#basic-matrix-algebra)
	- [Dot Product](#dot-product)
	- [Matrix Multiplication](#matrix-multiplication)
	- [Vector Norm](#vector-norm)
- [Linear Algebra Applications to Machine Learning](#linear-algebra-applications-to-machine-learning)
    - [Code-Along: Examining the Cars Data Set](#codealong-examining-the-cars-dataset)
- [Descriptive Statistics Fundamentals](#descriptive-statistics-fundamentals)
	- [Measures of Central Tendency](#measures-of-central-tendency)
	- [Math Review](#math-review)
	- [Measures of Dispersion: Standard Deviation and Variance](#measures-of-dispersion-standard-deviation-and-variance)
- [Our First Model](#our-first-model)
- [A Short Introduction to Model Bias and Variance](#a-short-introduction-to-model-bias-and-variance)
- [Correlation and Association](#correlation-and-association)
- [The Normal Distribution](#the-normal-distribution)
	- [What is the Normal Distribution?](#what-is-the-normal-distribution)
	- [Skewness](#skewness)
	- [Kurtosis](#kurtosis)
- [Determining the Distribution of Your Data](#determining-the-distribution-of-your-data)
- [Lesson Review](#topic-review)

#### Import Libraries for Lesson

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact # Did you catch this is new? It'll all us to 'interact' with a visual later
plt.style.use('fivethirtyeight')

# This makes sure that graphs render in your notebook.
%matplotlib inline

<a id="linear-algebra-review"></a>
## Linear Algebra Review
---
**Objective:** Compute dot products, matrix multiplications, and vector norms by hand and using NumPy.

<a id="why-linear-algebra"></a>
### Why Use Linear Algebra in Data Science?

- Linear models are efficient and well understood. They can often closely approximate nonlinear solutions, and they scale to high dimensions without difficulty.


- **Linear models are all based on linear algebra**, so we should know that too.


- Furthermore, even the complicated models rely on the basic models, which in turn rely heavily on linear algebra.


- Although we do not have time in this course to comprehensively discuss linear algebra, you may want to take time to understand it better.

<a id="scalars-vectors-and-matrices"></a>
### Scalars, Vectors, and Matrices

<img src="assets/images/scalars-vectors-matrices.png">


A **scalar** is a single number. 
- Symbols that are lowercase single letters refer to scalars. For example, the symbols $a$ and $v$ are scalars that might refer to arbitrary numbers such as $5.328$ or $7$. 

- An example scalar would be: $a$

- It's usually easy to consider vectors as either a $1 \times n$ or $n \times 1$ "row" or "column" vector, where convenient.
<br>
<br>

A **vector** is an ordered sequence of numbers, **like a list**. 
- Unlike a Python list, a vector can only be numeric. It can be a row or a column.
- Here, symbols that are lowercase single letters with an arrow — such as $\vec{u}$ — refer to vectors. An example vector would be:

$$\vec{u} = \left[ \begin{array}{c}
1&3&7
\end{array} \right]$$

In [2]:
# Create a vector using np.array.
u = np.array([1, 3, 7])
print(u)
#print(np.sum(u))
#print(u[2])

[1 3 7]


An $m$ x $n$ **matrix** is a rectangular array of numbers with $m$ rows and $n$ columns. Each number in the matrix is an entry. Entries can be denoted $a_{ij}$, where $i$ denotes the row number and $j$ denotes the column number. Note that, because each entry $a_{ij}$ is a lowercase single letter, a matrix is an array of scalars:

$$\mathbf{A}= \left[ \begin{array}{c}
a_{11} & a_{12} & \cdots & a_{1n}  \\
a_{21} & a_{22} & \cdots & a_{2n}  \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{array} \right]$$

Matrices are referred to using bold uppercase letters, such as $\mathbf{A}$. A bold font face is used to distinguish matrices from sets. (Sometimes, not always).

In [3]:
# Create a matrix using np.array.
m = np.array([[1, 3, 7], [4, 6, 3], [2, 5, 6]])
m

array([[1, 3, 7],
       [4, 6, 3],
       [2, 5, 6]])

Note that in Python, a matrix is just a list of lists converted to numpy arrays(or a group of vectors)! **In fact, a vector is also matrix!**

#### Arrays are More Efficient than Pandas Series

In [4]:
s = pd.Series(u)
print(s) 

0    1
1    3
2    7
dtype: int64


In [5]:
#%timeit -n 1000000 u[1]

In [6]:
#%timeit -n 1000000 s[1]

<a id="basic-matrix-algebra"></a>
### Basic Matrix Algebra


#### Addition and Subtraction
Vector **addition** is straightforward. If two vectors are of equal dimensions (The vectors are shown here as column vectors for convenience only):

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right],  \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

In [7]:
v = np.array([1, 3, 7])
w = np.array([1, 0, 1])

$\vec{v} + \vec{w} =
\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] + \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right] = 
\left[ \begin{array}{c}
1+1 \\
3+0 \\
7+1
\end{array} \right] = 
\left[ \begin{array}{c}
2 \\
3 \\
8
\end{array} \right]
$

(Subtraction is similar.)

In [8]:
# Add the vectors together with +.
v+w

array([2, 3, 8])

In [9]:
# Using numpy
np.sum([v,w], axis=0)

array([2, 3, 8])

**Classroom Question**: What happens when **axis=1**?

**Classroom Exercise**:
Subtract the vectors.  Write it out by hand and then allow Python to do the work.

In [10]:
# Subtract the vectors using the similar methods that were used for addition


#### Scalar Multiplication
We scale a vector with **scalar multiplication**, multiplying a vector by a scalar (single quantity):

$ 2 \cdot \vec{v} = 2\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \cdot 1 \\
2 \cdot 3 \\
2 \cdot 7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \\
6 \\
14
\end{array} \right]$ 

In [11]:
# Multiply v by 2.

In [12]:
# Multiply w and v

<a id="dot-product"></a>
### Dot Product
The **dot product** of two _n_-dimensional vectors is:

$ \vec{v} \cdot \vec{w} =\sum _{i=1}^{n}v_{i}w_{i}=v_{1}w_{1}+v_{2}w_{2}+\cdots +v_{n}w_{n} $

So, if:

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right], \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

$ \vec{v} \cdot \vec{w} = 1 \cdot 1 + 3 \cdot 0 + 7 \cdot 1 = 8 $

_Tim Note:_ When considering vectors as "column vectors", you will often see a dot product written as $\mathbf{v}^T\mathbf{w}$. In more pure-math based literature, you might even see $\langle v, w \rangle$.

In [13]:
# Calculate the dot product of v and w using np.dot.

<a id="matrix-multiplication"></a>
### Matrix Multiplication
**Matrix multiplication**, $\mathbf{AB}$, is valid when the left matrix has the same number of columns as the right matrix has rows. Each entry is the dot product of corresponding row and column vectors.

![](assets/images/matrix-multiply-a.gif)
(Image: mathisfun.com)


![](assets/images/matrix-multiplication-song.png)

The dot product illustrated above is: $1 \cdot 7 + 2 \cdot 9 + 3 \cdot 11 = 58$. **Can you compute the rest of the dot products by hand?**

If the product is the $2$ x $2$ matrix $\mathbf{C}$, then:

+ Matrix entry $c_{12}$ (its FIRST row and SECOND column) is the dot product of the FIRST row of $\mathbf{A}$ and the SECOND column of $\mathbf{B}$.

+ Matrix entry $c_{21}$ (its SECOND row and FIRST column) is the dot product of the SECOND row of $\mathbf{A}$ and the FIRST column of $\mathbf{B}$.

**Lets compute the example above, with the $2$ x $3$ matrix multiplied by $3$ x $2$ matrix, which results in a $2$ x $2$ matrix. Can you see why?**

In [14]:
# Multiply the two above matrices
A = np.array([[1, 2, 3], [4, 5, 6]])

B = 

C= np.dot()

SyntaxError: invalid syntax (<ipython-input-14-24c27636119f>, line 4)

In [None]:
# Subset C to show the value in the first row and second column

**Classroom Exercise:** Calculate the dot product of the matrix m and vector w by hand and then using numpy.

<a id="vector-norm"></a>
### Vector Norm

The **magnitude** of a vector, $\vec{v} \in \mathbb{R}^{n}$, can be interpreted as its length in $n$-dimensional space. Therefore it is calculable via the Euclidean distance from the origin:

$\vec{v} = \left[ \begin{array}{c}
v_{1} \\
v_{2} \\
\vdots \\
v_{n}
\end{array} \right]$

then $\| \vec{v} \| = \sqrt{v_{1}^{2} + v_{2}^{2} + ... + v_{n}^{2}} = \sqrt{\vec{v}^T\vec{v}}$

E.g. if $\vec{v} = 
\left[ \begin{array}{c}
3 \\
4
\end{array} \right]$, then $\| \vec{v} \| = \sqrt{3^{2} + 4^{2}} = 5$

This is also called the vector **norm**. You will often see this used in machine learning to calculate distances.

In [16]:
# Calculate the norm of the vector x with np.linalg.norm.
np.linalg.norm([v,w])

7.810249675906654

<a id="linear-algebra-applications-to-machine-learning"></a>
## Linear Algebra Applications to Machine Learning
---

Linear Algebra will give you better intuition for machine learning algorithms and see them beyond "black boxes".  Models have parameters, or hyperparameters that you can tune, and understanding the inner workings can help you refine your models.

You can also code algorithms from scratch, if you choose to become more advanced.

<a id="distance-between-actual-values-and-predicted-values"></a>
### Distance Between Actual Values and Predicted Values
We often need to know the difference between predicted values and actual values. 

![](assets/images/vector-norms.png)


#### L² Norm
Most commonly, we use the **L²** norm, which is the sum of the squared values.  In 2-D space, we compute this as:
$$ L^2 norm = \|\vec{actual} - \vec{predicted} \| = \sqrt{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 ... + (actual_n - predicted_n)^2}$$

Note that this is just the **straight-line distance** or **as-the-crow-flies distance** between the actual point and the predicted point.


#### L¹ Norm
Another less used method is the **L¹** norm, aka **taxicab distance** because it describes the number of blocks to travel to reach the destination.

$$ L^1 norm = \|\vec{actual} - \vec{predicted} \| = |(actual_1 - predicted_1)| + |(actual_2 - predicted_2)| ... + |(actual_n - predicted_n)| $$

<br>
<br>
<br>


### Mean Absolute Error
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. All individual differences have equal weight.

$$MAE = \frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|$$


<a id="mean-squared-error"></a>
### Mean Squared Error
Another method for measuring distance, or error between predicted and actual, is the mean of the squared errors.  **This is often used to measure the quality of regression models.** Where $\hat{y}(\mathbf{X})$ is a vector of predicted values (a function of the data matrix $\mathbf{X}$) and $\vec{y}$ is the actual values:

$$MSE = \frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|^2$$



### Root Mean Squared Error
Another similar method for measuring distance, or error between predicted and actual, is the square root of the mean of the squared errors.  **This is another common method to measure the quality of regression models.** Where $\hat{y}(\mathbf{X})$ is a vector of predicted values (a function of the data matrix $\mathbf{X}$) and $\vec{y}$ is the actual values:

$$RMSE = \sqrt{\frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|^2}$$


Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful than MAE and MSE when large errors are not desired.

<a id="least-squares"></a>
### Least Squares
Regression models use least squares to optimize the fit of the model, and are based on the following form:

$$\min \| \hat{y}(\mathbf{X}) - \vec{y} \|^2$$

The goal is to minimize the distance between model predictions and actual data.

Let's see this in [scikit-learn](http://scikit-learn.org/stable/modules/linear_model.html).


<a id="codealong-examining-the-cars-dataset"></a>
### Code-Along: Examining the Cars dataset
---

Read in the Motor Trend Cars data. 

**Data Source**: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

**Description**
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

**Format**
A data frame with 32 observations on 11 (numeric) variables.

- **mpg**	Miles/(US) gallon
- **cyl**	Number of cylinders
- **disp**	Displacement (cu.in.)
- **hp**	Gross horsepower
- **drat**	Rear axle ratio
- **wt**	Weight (1000 lbs)
- **qsec**	1/4 mile time
- **vs**	Engine (0 = V-shaped, 1 = straight)
- **am**	Transmission (0 = automatic, 1 = manual)
- **gear**	Number of forward gears
- **carb**	Number of carburetors

In [None]:
mtcars = pd.read_csv('data/mtcars.csv')

Imagine we were trying to predict mpg of a car.  Lets create 2 random columns, *predicted mpg* and *predicted_mpg_2* , that we will assume were predicted with 2 different machine learning models.

In [None]:
np.random.seed(0)
mtcars['predicted_mpg'] = np.random.randint(15, 30, mtcars.shape[0])
mtcars['predicted_mpg_2'] = np.random.randint(18, 26, mtcars.shape[0])

#### Print out the dimensions of the DataFrame using the `.shape` attribute:

In [None]:
# Preview data dimensions.
mtcars.shape

#### Print out the data types of the columns using the `.dtypes` attribute:

In [None]:
# What are the column data types?
mtcars.dtypes

#### Pull up descriptive statistics for each variable using the built-in `.describe()` function:

In [None]:
# Pull up descriptive statistics for mpg, predicted_mpg and predicted_mpg_2
mtcars[['mpg','predicted_mpg','predicted_mpg_2']].describe()

Calculate Euclidean distance between the predicted columns and actual column

In [None]:
#L2 Norm aka Euclidean Distance aka Straight-Line Distance
print('Model 1 L2 Norm:', np.linalg.norm(mtcars.mpg-mtcars.predicted_mpg))

print('Model 2 L2 Norm:', np.linalg.norm(mtcars.mpg-mtcars.predicted_mpg_2))

### Exercises:

#### 1. Going the Distance

Calculate the L1 Norm for each prediction.  Look at the help for np.linalg.norm, and specifically the **ord** parameter. (hint:L**1**)


In [None]:
print('Model 1 L1 Norm:', np.linalg.norm(mtcars.mpg-mtcars.predicted_mpg, ord=1))

print('Model 2 L1 Norm:', np.linalg.norm(mtcars.mpg-mtcars.predicted_mpg_2, ord=1))

Calculate the MAE using numpy.  (hint: nest np.abs into np.mean and np.abs)

In [None]:
print('Model 1 MAE:',np.mean(np.abs(mtcars.mpg-mtcars.predicted_mpg)))

print('Model 2 MAE:',np.mean(np.abs(mtcars.mpg-mtcars.predicted_mpg_2)))

Calculate the MSE using numpy.  (hint: use np.mean and np.square)

In [None]:
print('Model 1 MSE:',np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg)))

print('Model 2 MSE:',np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg_2)))

Calculate the RMSE using numpy.  (hint: use the MSE calculation and take the square root)

In [None]:
print('Model 1 RMSE:',np.sqrt(np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg))))

print('Model 2 RMSE:',np.sqrt(np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg_2))))

Based on these metrics, which of these 2 simple models is better at explaining the behavior?

<a id="descriptive-statistics-fundamentals"></a>
## Descriptive Statistics Fundamentals
---

- **Objective:** Code summary statistics using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.

### Statistics

Statistics is essentially the study of distributions. We leverage distributions to tie the frequency of a value to the actual value observed. Our goal is to understand how to pull meaning out of distributions of various datasets to arrive at the formal definition of statistics


>**Statistics** is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.

That said there is ALOT of nuance within statistics. For this class you won't need to intimately understand statistics - but as you progress through your Data Science career it will increase in frequency. While the *litmus test* is a Data Scientist is better at statistics than a programmer you'll be able to go much further with an in-depth review.

Statistical References:
* A great start [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
* [Bayesian Data Analysis, by Andrew Gelman](http://www.stat.columbia.edu/~gelman/book/)
* [Machine Learning: a Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/)
* [Pattern Recognition and Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)


### A Quick Review of Notation

The sum of a constant, $k$, $n$ times:
$$\sum_{i=1}^nk$$

In [None]:
# k + k + k + k + ... + k
# For i from 1 up to and including n, add k to the sum.

> It is often helpful to think of these sums as `for` loops. For example, the equation can be compactly computed like so:

```
total = 0

# For i from 1 up to and including n, add k to the sum.
for i in range(1, n+1):
    total += k
```

> Or, even more succinctly (using a generator comprehension):

```
total = sum(k for i in range(1, n+1))
```

In [None]:
k=5
n=10
total=0
for i in range(1, n+1):
    total += k
    print(total)

The sum of all numbers from 1 up to and including $n$:
$$\sum_{i=1}^ni$$

In [None]:
# 1 + 2 + 3 + ... + n

> ```
total = sum(i for i in range(1, n+1))
```

In [None]:
n=10
total=0
total = sum(i for i in range(1, n+1))
print(total)

The sum of all $x$ from the first $x$ entry to the $n$th $x$ entry:
$$\sum_{i=0}^nx_i$$

In [None]:
# x_1 + x_2 + x_3 + ... + x_n

> ```
total = sum(xi in x)      # or just sum(x)
```

#### Code-Along

_Optional: Write down the mathematical notation for the following questions:_

In [None]:
# Compute the sum of seven 4s using base Python.
sum([4, 4, 4, 4, 4, 4, 4])

$$\sum_{i=1}^{7}{4}$$

In [None]:
# Compute the sum of seven 4s using NumPy np.sum and np.multiply.
print(np.sum([4, 4, 4, 4, 4, 4, 4]))
print(np.multiply(4,7))

$$\sum_{i=1}^{7}{4}$$

In [None]:
# Compute the sum of 1 through 10 using base Python.
sum(i+1 for i in range(10))

$$\sum_{i=0}^{10}{x_i}$$

In [None]:
# Using the titanic.fare column, compute the total fare paid by passengers.
titanic = pd.read_csv('data/titanic.csv')
print(np.sum(titanic.fare))

<a id="measures-of-central-tendency"></a>
### Measures of Central Tendency

- Mean
- Median
- Mode

#### Mean
The mean — also known as the average or expected value — is defined as:
$$E[X] = \bar{X} =\frac 1n\sum_{i=1}^nx_i$$

It is determined by summing all data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.

* The mean can be highly affected by outliers.

#### Median
The median refers to the midpoint in a series of numbers. Notice that the median is not affected by outliers, so it more so represents the "typical" value in a set.

$$ 0,1,2,[3],5,5,1004 $$

$$ 1,3,4,[4,5],5,5,7 $$

**To find the median:**

- Arrange the numbers in order from smallest to largest.
    - If there is an odd number of values, the middle value is the median.
    - If there is an even number of values, the average of the middle two values is the median.
<br>
<br>
- Although the median has many useful properties, the mean is easier to use in optimization algorithms. 
- The median is more often used in analysis than in machine learning algorithms.

#### Mode
The mode of a set of values is the value that occurs most often.
A set of values may have more than one mode, or no mode at all.

$$1,0,1,5,7,8,9,3,4,1$$ 

$1$ is the mode, as it occurs the most often (three times).

### Exercises
#### 2. Sinking into Central Tendency

In [None]:
# Find the mean of the titanic.fare series using base Python:
sum(titanic['fare'])/float(len(titanic['fare']))

In [None]:
# Find the mean of the titanic.fare series using NumPy:
np.mean(titanic.fare)

In [None]:
# Find the mean of the titanic.fare series using Pandas:
titanic.fare.mean()

In [None]:
# What was the median fare paid (using Pandas)?
titanic.fare.median()

In [None]:
# What was the median fare paid (using numpy)?
np.median(titanic.fare)

In [None]:
# The mean and median are not the same, does this tell you anything about the fares?

#A: Typically this is indicative that the distribution is not normal

In [None]:
# Use Pandas to find the most common fare paid on the Titanic:
titanic.fare.mode()
# Notice that this returns a series instead of a single number, why?
#A: There are 2 modes in this data.

<a id="measures-of-dispersion-standard-deviation-and-variance"></a>
### Measures of Dispersion: Standard Deviation and Variance

Standard deviation (SD, $σ$ for population standard deviation, or $s$ for sample standard deviation) is a measure that is used to quantify the amount of variation or dispersion from the mean of a set of data values. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are spread out.

Standard deviation is the square root of variance:

$$\text{variance} = s^2 = \frac {\sum{(x_i - \bar{x})^2}} {n-1}$$

$$s = \sqrt{\frac {\sum{(x_i - \bar{x})^2}} {n-1}}$$

> **Standard deviation** is often used because it is in the same units as the original data. By glancing at the standard deviation, we can immediately estimate how "typical" a data point might be by how many standard deviations it is from the mean.

> **Variance** is often used for efficiency in computations. The square root in the SD always increases with the function to which it is applied. So, removing it can simplify calculations (e.g., taking derivatives), particularly if we are using the variance for tasks such as optimization.

### Class Exercise

**Assign the first 5 rows of titanic age data to a variable:**

In [None]:
# Take the first five rows of titanic age data.
first_five = titanic.age[:5,]
print(first_five)

**Calculate the mean "by hand":**

In [None]:
# Calculate mean by hand and assign to variable 'mean'.
mean = (22 + 38 + 26 + 35 + 35) / 5.0
print(mean)

#### Calculate the variance and standard deviation "by hand" using numpy and the mean variable you just created:

In [None]:
# Calculate variance by hand
(np.square(22 - mean) +
np.square(38 - mean) +
np.square(26 - mean) +
np.square(35 - mean) +
np.square(35 - mean)) / 4.0

In [None]:
np.sqrt((np.square(22 - mean) +
np.square(38 - mean) +
np.square(26 - mean) +
np.square(35 - mean) +
np.square(35 - mean)) / 4.0)

#### Calculate the variance and the standard deviation using Pandas:

In [None]:
# Verify with Pandas
# Verify with Pandas
print(first_five.var())
print(first_five.std())

A **quartile** is a type of **quantile**. Quartiles in statistics are values that divide your data into quarters. The **first quartile (Q1)** is defined as the middle number between the smallest number and the median of the data set. The **second quartile (Q2)** is the median of the data. The **third quartile (Q3)** is the middle value between the median and the highest value of the data set. 

**Quartiles** represent the value for which 25% of the data is below (Q1) and the value for which 25% of the data is above (Q3)

The four quarters that divide a data set into quartiles are:

1. The lowest 25% of numbers.
2. The next lowest 25% of numbers (up to the median).
3. The second highest 25% of numbers (above the median).
4. The highest 25% of numbers.

**Use the titanic passenger ages to calculate the first and third quartiles**

In [None]:
# Using the pd.qcut() method from pandas to find the 1st (Q1) and 3rd (Q3) quartiles
pd.qcut(titanic.age,4)

In [None]:
titanic.age.quantile([0.25,0.5,0.75])

The **interquartile range (IQR)** is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.

In [None]:
float(titanic.age.quantile([0.75]))-float(titanic.age.quantile([0.25]))

<a id="our-first-model"></a>
## Our First Model
---

In this section, we will make a **mathematical model** of data. When we say **model**, we mean it in the same sense that a map is a **model** of the real world. Google Maps can get us to that restaurant without getting lost, but it can't tell us where each individual pothole is. This is good enough.

As another example for when we say **model**, we mean it in the same sense that a toy car is a **model** of a real car. If we mainly care about appearance, the toy car model is an excellent model. However, the toy car fails to accurately represent other aspects of the car. For example, we cannot use a toy car to test how the actual car would perform in a collision.

<img src="http://www.azquotes.com/picture-quotes/quote-all-models-are-wrong-but-some-are-useful-george-e-p-box-53-42-27.jpg">

In data science, we might take a rich, complex person and model that person solely as a two-dimensional vector: _(age, smokes cigarettes)_. For example: $(90, 1)$, $(28, 0)$, and $(52, 1)$, where $1$ indicates "smokes cigarettes." This model of a complex person obviously fails to account for many things. However, if we primarily care about modeling health risk, it might provide valuable insight.

Now that we have superficially modeled a complex person, we might determine a formula that evaluates risk. For example, an older person tends to have worse health, as does a person who smokes. So, we might deem someone as having risk should `age + 50*smokes > 100`. 

This is a **mathematical model**, as we use math to assess risk. It could be mostly accurate. However, there are surely elderly people who smoke who are in excellent health.


---

Let's make our first model from scratch. We'll use it predict the `fare` column in the Titanic data. So what data will we use? Actually, none.

The simplest model we can build is an estimation of the mean, median, or most common value. If we have no feature matrix and only an outcome, this is the best approach to make a prediction using only empirical data. 

This seems silly, but we'll actually use it all the time to create a baseline of how well we do with no data and determine whether or not our more sophisticated models make an improvement.

#### Get the `fare` column from the Titanic data and store it in variable `y`:

In [None]:
# Get the fare column from the Titanic data and store it as y:
y=titanic['fare']

#### Create predictions `y_pred` (in this case just the mean of `y`):

In [None]:
# Stored predictions in y_pred:
y_pred = y.mean()
y_pred2 = y.median()

### Exercises:

#### 3. Baseline Comparisons

#### Find the average squared distance between each prediction and its actual value:

This is known as the mean squared error (MSE).

The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points. For every data point, you take the distance vertically from the point to the corresponding y value on the curve fit (the error), and square the value. Then you add up all those values for all data points, and, in the case of a linear fit, divide by the number of points minus two.** The squaring is done so negative values do not cancel positive values. The smaller the Mean Squared Error, the closer the fit is to the data. The MSE has the units squared of whatever is plotted on the vertical axis.

In [None]:
# Compare the mean baseline vs. the median baseline
print(np.mean(np.square(y-y_pred)))
print(np.mean(np.square(y-y_pred2)))

The **Root Mean Squared error (RMSE)** is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis.

> Key point: The RMSE is thus the distance, on average, of a data point from the fitted line, measured along a vertical line.

*The RMSE is directly interpretable in terms of measurement units* making it a better measure of goodness of fit than a correlation coefficient. One can compare the RMSE to observed variation in measurements of a typical point. The two should be similar for a reasonable fit.

#### Calculate the root mean squared error (RMSE), the square root of the MSE:

In [None]:
# Compare the mean baseline vs. the median baseline
print(np.sqrt(np.mean(np.square(y-y_pred))))
print(np.sqrt(np.mean(np.square(y-y_pred2))))

**Which of the baseline models is better and why?  In your own words, describe the purpose of a baseline model?**

#### 4. Baseline Car Model
With the mtcars dataset, generate the baseline comparisons for mean and median mpg vs. actual mpg.  Explain which baseline is a better model and why.

In [None]:
y=mtcars['mpg']
y_pred=y.mean()
y_pred2=y.median()

In [None]:
#MSE
print(np.mean(np.square(y-y_pred)))
print(np.mean(np.square(y-y_pred2)))

#RMSE
print(np.sqrt(np.mean(np.square(y-y_pred))))
print(np.sqrt(np.mean(np.square(y-y_pred2))))

<a id="a-preface-on-modeling"></a>
### A Preface on Modeling
---
As we venture down the path of modeling, it can be difficult to determine which choices are "correct" or "incorrect".  A primary challenge is to understand how different models will perform in different circumstances and different types of data. It's essential to practice modeling on a variety of data.

As a beginner it is essential to learn which metrics are important for evaluating your models and what they mean. The metrics we evaluate our models with inform our actions.  

*Exploring datasets on your own with the skills and tools you learn in class is highly recommended!*

<a id="a-short-introduction-to-model-bias-and-variance"></a>
## A Short Introduction to Model Bias and Variance 

---

- **Objective:** Describe the bias and variance of statistical estimators.

In simple terms, **bias** shows how accurate a model is in its predictions. (It has **low bias** if it hits the bullseye!)

**Variance** shows how reliable a model is in its performance. (It has **low variance** if the points are predicted consistently!)

![Bias and Variance](assets/images/biasVsVarianceImage.png)

Remember how we just calculated mean squared error to determine the accuracy of our prediction? It turns out we can do this for any statistical estimator, including mean, variance, and machine learning models.

We can even decompose mean squared error to identify the source of error - reducible error & irreducible error.

* Irreducible error or inherent uncertainty is associated with a natural variability in a system. 
* Reducible error is not only something we can address but should be addressed to maximize accuracy. Given what we're talking bout it shouldn't surprise you to learn it's components are **error due to squared bias** and **error due to variance**.

### Primer on Variance/Bias Tradeoff
Models that exhibit small variance and high bias *underfit* the truth.  Models that exhibit high variance and low bias *overfit* the truth target. Both prevent us from making strong predictions

![Over/underfit](assets/images/underoverfit.png)

The **“tradeoff”** between bias and variance can be viewed in this manner – a learning algorithm with low bias must be “flexible” so that it can fit the data well. But if the learning algorithm is too flexible (for instance, too linear), it will fit each training data set differently, and hence have high variance. A key characteristic of many supervised learning methods is a built-in way to control the bias-variance tradeoff either automatically or by providing a special parameter that the data scientist can adjust.

<a id="correlation-and-association"></a>
## Correlation and Association
---

Correlation measures how variables related to each other.

Typically, we talk about the Pearson correlation coefficient — a measure of **linear** association.

We refer to perfect correlation as **collinearity**.

The following are a few correlation coefficients. Note that if both variables trend upward, the coefficient is positive. If one trends opposite the other, it is negative. 

It is important that you always look at your data visually — the coefficient by itself can be misleading:

![Example correlation values](./assets/images/correlation_examples.png)

<a id="codealong-correlation-in-pandas"></a>
### Classroom Exercise: Correlation in Pandas

**Objective:** Explore options for measuring and visualizing correlation in Pandas.

#### Display the correlation matrix for all Titanic variables:

In [None]:
titanic.corr()

#### Use Seaborn to plot a heat map of the correlation matrix:

The `sns.heatmap()` function will accomplish this.

- Generate a correlation matrix from the Titanic data using the `.corr()` method.
- Pass the correlation matrix into `sns.heatmap()` as its only parameter.

In [None]:
# Use Seaborn to plot a correlation heat map
sns.heatmap(titanic.corr())

In [None]:
# Take a closer look at the survived and fare variables using a scatter plot
titanic.plot(kind='scatter', x='fare', y='survived')

Is correlation a good way to inspect the association of fare and survival?

### Exercises:
#### 5. Car Correlation

In [None]:
# Check the correlation of all numeric variables
mtcars.corr()

In [None]:
# Use Seaborn to plot a correlation heat map
sns.heatmap(mtcars.corr())

In [None]:
#Create a clustermap using seaborn
sns.clustermap(mtcars.corr())

**a.** In your own words, describe what the visualization helps to explain.

**b.** Which variable is most closely associated with mpg?  (ie. if you could only create a model with one variable, which one would you use?)How do you know from this visualization?

**c.** Which variable is most closesly associated with displacement?

**d.** If there were 2 clusters or segments of variables, what would each consist of?

**e.** If there were 3 clusters of variables, what would each consist of?

<a id="the-normal-distribution"></a>
## The Normal Distribution
---

- **Objective:** Identify a normal distribution within a data set using summary statistics and data visualizations.

- What is an event space?
  - A listing of all possible occurrences.
<br>
<br>
- What is a probability distribution?
  - A function that describes how events occur in an event space.
<br>
<br>
- What are general properties of probability distributions?
  - All probabilities of an event are between 0 and 1.
  - All events in the event space combined have probability 1.
<br>
<br>  

<a id="what-is-the-normal-distribution"></a>
### What is the Normal Distribution?
- A normal distribution is often a key assumption to many models.
  - In practice, if the normal distribution assumption is not met, it's not the end of the world. Your model is just less efficient in most cases.

- The normal distribution is **completely summarized by its mean and standard deviation**.

- The **mean** controls its **center**.

- The **standard deviation** controls how **spread out** it is.

- Normal distributions are **symmetric, bell-shaped curves**.

![normal distribution](assets/images/normal.png)


#### Why do we care about normal distributions?

- They often show up in nature.
- Aggregated processes tend to distribute normally, regardless of their underlying distribution (**Central Limit Theorem**)
    - The **Central Limit Theorem** states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. ([More Info](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/))
    
- They offer effective simplification that makes it easy to make approximations.
- It can improve our machine learning algorithms
<br>
Machine learning algorithms are usually designed to be smart enough to find out how to deal with any distribution present in the features by themselves. At the same time even if it isn't necessary to transform the actual distributions for an algorithm to work properly, it can still be beneficial for these reasons:


* To make the cost function minimize better the error of the predictions
* To make the algorithm converge properly and faster

We'll discuss various ways to transform or rescale data (i.e. normalization, standardization) later in the course

#### Plot a histogram of 1,000 samples from a random normal distribution:

The `np.random.randn(numsamples)` function will draw from a random normal distribution with a mean of 0 and a standard deviation of 1.

- To plot a histogram, pass a NumPy array with 1000 samples as the only parameter to `plt.hist()`.
- Change the number of bins using the keyword argument `bins`, e.g. `plt.hist(mydata, bins=50)`

In [None]:
# Plot a histogram of several random normal samples from NumPy.
samples = np.random.randn(10000)
plt.hist(samples,100);

<a id="skewness"></a>
###  Skewness
- Skewness is a measure of the asymmetry of the distribution of a random variable about its mean.
- Skewness can be positive or negative, or even undefined.
- Notice that the mean, median, and mode are the same when there is no skew.

![skewness](assets/images/skewness---mean-median-mode.jpg)

#### Plot a lognormal distribution generated with NumPy.

Take 1,000 samples using `np.random.lognormal(size=numsamples)` and plot them on a histogram.

In [None]:
# Plot a lognormal distribution generated with NumPy
plt.hist(np.random.lognormal(size=1000),10);

#####  Real World Application - When mindfullness beats complexity
- Skewness is surprisingly important.
- Most algorithms implicitly use the mean by default when making approximations.
- If you know your data is heavily skewed, you may have to either transform your data or set your algorithms to work with the median.

<a id="kurtosis"></a>
### Kurtosis
- Kurtosis is a measure of whether the data are peaked or flat, relative to a normal distribution.
- Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. 

![kurtosis](assets/images/kurtosis.jpg)

[Wikipedia](https://en.wikipedia.org/wiki/Kurtosis) includes additional pictures and explanations that may best drive this concept home

####  Real-World Application: Risk Analysis
- Long-tailed distributions with high kurtosis elude intuition; we naturally think the event is too improbable to pay attention to.
- It's often the case that there is a large cost associated with a low-probability event, as is the case with hurricane damage. (ie. It's unlikely you will get hit by a Category 5 hurricane, but when you do, the damage will be catastrophic.)

<a id="determining-the-distribution-of-your-data"></a>
## Determining the Distribution of Your Data
---

**Objective:** Create basic data visualizations, including scatterplots, box plots, and histograms.

![](./assets/images/distributions.png)

#### Use the `.hist()` function of your Titantic DataFrame to plot histograms of all the variables in your data.

- The function `plt.hist(data)` calls the Matplotlib library directly.
- However, each DataFrame has its own `hist()` method that by default plots one histogram per column. 
- Given a DataFrame `my_df`, it can be called like this: `my_df.hist()`. 

In [None]:
# Plot all variables in the Titanic data set using histograms:
titanic.hist();

#### Use the built-in `.plot.box()` function of your Titanic DataFrame to plot box plots of your variables.

- Given a DataFrame, a box plot can be made where each column is one tick on the x axis.
- To do this, it can be called like this: `my_df.plot.box()`.
- Try using the keyword argument `showfliers`, e.g. `showfliers=False`.

In [None]:
# Plotting all histograms can be unweildly; box plots can be more concise:
titanic.plot.box();

<a id="exercise"></a>
### Classroom Discussion

1. Look at the Titanic data variables.
- Are any of them normal?
- Are any skewed?
- How might this affect our modeling?

<a id="topic-review"></a>
## Lesson Review
---

- We covered several different types of summary statistics, what are they?
- We covered three different types of visualizations, which ones?
- Describe bias and variance and why they are important.
- What are some important characteristics of distributions?