<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Statistics Fundamentals

_Instructor:_ Alexander Egorenkov (DC), Amy Roberts (NYC) Tim Book, General Assembly DC_

---

<a id="learning-objectives"></a>
## Learning Objectives
- **Linear algebra:** Dot products, matrix multiplications, and vector norms by hand and using NumPy.
- **Summary statistics:** Using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.
- **Discover trends:** Using basic summary statistics and viz.
- **Bias/variance tradeoff:** Describe the bias and variance of statistical estimators.
- **Identify a normal distribution** within a data set using summary statistics and data visualizations.

### Lesson Guide
- [Linear Algebra Review](#linear-algebra-review)
    - [Scalars, Vectors, and Matrices](#scalars-vectors-and-matrices)
	- [Basic Matrix Algebra](#basic-matrix-algebra)
	- [Dot Product](#dot-product)
	- [Matrix Multiplication](#matrix-multiplication)
	- [Vector Norm](#vector-norm)
- [Linear Algebra Applications to Machine Learning](#linear-algebra-applications-to-machine-learning)
    - [Code-Along: Examining the Cars Data Set](#codealong-examining-the-cars-dataset)


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics # Did you catch this is new? 
plt.style.use('fivethirtyeight')

# This makes sure that graphs render in your notebook.
%matplotlib inline

<a id="where-are-we-in-the-data-science-workflow"></a>
## Where Are We in the Data Science Workflow?

![Data Science Workflow](./assets/images/data-science-workflow.png)

<a id="linear-algebra-review"></a>
## Linear Algebra Review
---
**Objective:** Compute dot products, matrix multiplications, and vector norms by hand and using NumPy.

<a id="why-linear-algebra"></a>
### Why Use Linear Algebra in Data Science?

- Linear models are efficient and well understood. They can often closely approximate nonlinear solutions, and they scale to high dimensions without difficulty.


- **Linear models are all based on linear algebra**, so we should know that too.


- Furthermore, even the complicated models rely on the basic models, which in turn rely heavily on linear algebra.


- Although we do not have time in this course to comprehensively discuss linear algebra, you may want to take time to understand it better.

<a id="scalars-vectors-and-matrices"></a>
### Scalars, Vectors, and Matrices

<img src="assets/images/scalars-vectors-matrices.png">


A **scalar** is a single number. 
- Symbols that are lowercase single letters refer to scalars. For example, the symbols $a$ and $v$ are scalars that might refer to arbitrary numbers such as $5.328$ or $7$. 

- An example scalar would be: $a$

- It's usually easy to consider vectors as either a $1 \times n$ or $n \times 1$ "row" or "column" vector, where convenient.
<br>
<br>

A **vector** is an ordered sequence of numbers, **like a list**. 
- Unlike a Python list, a vector can only be numeric. It can be a row or a column.
- Here, symbols that are lowercase single letters with an arrow — such as $\vec{u}$ — refer to vectors. An example vector would be:

$$\vec{u} = \left[ \begin{array}{c}
1&3&7
\end{array} \right]$$

In [None]:
# Create a vector using np.array. Numpy arrays are data structures in the numpy module, 
# this one operates like a list. 
# But the difference is how numpy works with this data structure.

u = np.array([1, 3, 7])
print(u)
print(np.sum(u))
print(u[0])

An $m$ x $n$ **matrix** is a rectangular array of numbers with $m$ rows and $n$ columns. Each number in the matrix is an entry. Entries can be denoted $a_{ij}$, where $i$ denotes the row number and $j$ denotes the column number. Note that, because each entry $a_{ij}$ is a lowercase single letter, a matrix is an array of scalars:

$$\mathbf{A}= \left[ \begin{array}{c}
a_{11} & a_{12} & \cdots & a_{1n}  \\
a_{21} & a_{22} & \cdots & a_{2n}  \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{array} \right]$$

Matrices are referred to using bold uppercase letters, such as $\mathbf{A}$. A bold font face is used to distinguish matrices from sets. (Sometimes, not always).

In [None]:
# Create a matrix using np.array.
m = np.array([[1, 3, 7], [4, 6, 3], [2, 5, 6]])
m

Note that in Python, a matrix is just a list of lists converted to numpy arrays(or a group of vectors)! **In fact, a vector is also matrix!**

#### Arrays are More Efficient than Pandas Series

In [None]:
s = pd.Series(u)
print(s) 

In [None]:
a = %timeit -n 10000 u[1]

In [None]:
%timeit -n 10000 s[1]

<a id="basic-matrix-algebra"></a>
### Basic Matrix Algebra


#### Addition and Subtraction
Vector **addition** is straightforward. If two vectors are of equal dimensions (The vectors are shown here as column vectors for convenience only):

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right],  \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

In [None]:
v = np.array([1, 3, 7])
w = np.array([1, 0, 1])

$\vec{v} + \vec{w} =
\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] + \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right] = 
\left[ \begin{array}{c}
1+1 \\
3+0 \\
7+1
\end{array} \right] = 
\left[ \begin{array}{c}
2 \\
3 \\
8
\end{array} \right]
$

(Subtraction is similar.)

In [None]:
# Add the vectors together with +.
v + w

In [None]:
# now using numpy (np.sum)
np.sum([v,w], axis=0)

**Classroom Question**: What happens when **axis=1**?

**Classroom Exercise**:
Subtract the vectors.  Write it out by hand and then allow Python to do the work.

Use `.subtract()` in the `numpy` module. If needed, use `shift-tab` to see how it works.

In [None]:
# Subtract the vectors using the two methods similar to those we just used for addition
np.subtract(v,w)

In [None]:
# Use an alternative method to perform vector subtraction.
v-w

#### Scalar Multiplication
We scale a vector with **scalar multiplication**, multiplying a vector by a scalar (single quantity):

$ 2 \cdot \vec{v} = 2\left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \cdot 1 \\
2 \cdot 3 \\
2 \cdot 7
\end{array} \right] = 
 \left[ \begin{array}{c}
2 \\
6 \\
14
\end{array} \right]$ 

In [None]:
# Multiply v by 2.
2*v

In [None]:
# Multiply w and v
w*v

<a id="dot-product"></a>
### Dot Product
The **dot product** of two _n_-dimensional vectors is:

$ \vec{v} \cdot \vec{w} =\sum _{i=1}^{n}v_{i}w_{i}=v_{1}w_{1}+v_{2}w_{2}+\cdots +v_{n}w_{n} $

So, if:

$\vec{v} = \left[ \begin{array}{c}
1 \\
3 \\
7
\end{array} \right], \vec{w} = \left[ \begin{array}{c}
1 \\
0 \\
1
\end{array} \right]$

$ \vec{v} \cdot \vec{w} = 1 \cdot 1 + 3 \cdot 0 + 7 \cdot 1 = 8 $

_Tim Note:_ When considering vectors as "column vectors", you will often see a dot product written as $\mathbf{v}^T\mathbf{w}$. In more pure-math based literature, you might even see $\langle v, w \rangle$.

In [None]:
# Calculate the dot product of v and w using np.dot.
np.dot(v,w)

In [None]:
# try the .dot() method on v to do the same
# How could you have found this option?
v.dot(w)

<a id="matrix-multiplication"></a>
### Matrix Multiplication
**Matrix multiplication**, $\mathbf{AB}$, is valid when the left matrix has the same number of columns as the right matrix has rows. Each entry is the dot product of corresponding row and column vectors.

![](assets/images/matrix-multiply-a.gif)
(Image: mathisfun.com)


![](assets/images/matrix-multiplication-song.png)

The dot product illustrated above is: $1 \cdot 7 + 2 \cdot 9 + 3 \cdot 11 = 58$. **Can you compute the rest of the dot products by hand?**

If the product is the $2$ x $2$ matrix $\mathbf{C}$, then:

+ Matrix entry $c_{12}$ (its FIRST row and SECOND column) is the dot product of the FIRST row of $\mathbf{A}$ and the SECOND column of $\mathbf{B}$.

+ Matrix entry $c_{21}$ (its SECOND row and FIRST column) is the dot product of the SECOND row of $\mathbf{A}$ and the FIRST column of $\mathbf{B}$.

**Lets compute the example above, with the $2$ x $3$ matrix multiplied by $3$ x $2$ matrix, which results in a $2$ x $2$ matrix. Can you see why?**

In [None]:
# Multiply the two above matrices

#Make them first
A = np.array([[1,2,3],[4,5,6]])

B = np.array([[7,8],[9,10],[11,12]])

#now multiply! 
# Someone tell me what goes here to multply these to matrices (order matters)

C = np.dot(A,B)
print(C)

In [None]:
# Subset C to show the value in the first row and second column (upper right value!)
C[0][1] 

# or

C[0,1]

<a id="n-dimensional-space"></a>
### N-Dimensional Space

We often refer to vectors as elements of an $n$-dimensional space. The symbol $\mathbb{R}$ refers to the set of all real numbers (written in uppercase "blackboard bold" font). Because this contains all reals, $3$ and $\pi$ are **contained in** $\mathbb{R}$. We often write this symbolically as $3 \in \mathbb{R}$ and $\pi \in \mathbb{R}$.

To get the set of all pairs of real numbers, we would essentially take the product of this set with itself (called the Cartesian product) — $\mathbb{R}$ x $\mathbb{R}$, abbreviated as $\mathbb{R}^2$. This set — $\mathbb{R}^2$ — contains all pairs of real numbers, so $(1, 3)$ is **contained in** this set. We write this symbolically as $(1, 3) \in \mathbb{R}^2$.

+ In 2-D space ($\mathbb{R}^2$), a point is uniquely referred to using two coordinates: $(1, 3) \in \mathbb{R}^2$.
+ In 3-D space ($\mathbb{R}^3$), a point is uniquely referred to using three coordinates: $(8, 2, -3) \in \mathbb{R}^3$.
+ In $n$-dimensional space ($\mathbb{R}^n$), a point is uniquely referred to using $n$ coordinates.

Note that these coordinates of course are isomorphic to our vectors! After all, coordinates are ordered sequences of numbers, just as we define vectors to be ordered sequences of numbers. So, especially in machine learning, we often visualize vectors of length $n$ as points in $n$-dimensional space.

<a id="vector-norm"></a>
### Vector Norm 

The **magnitude** of a vector, $\vec{v} \in \mathbb{R}^{n}$, can be interpreted as its length in $n$-dimensional space. Therefore it is calculable via the Euclidean distance from the origin:

$\vec{v} = \left[ \begin{array}{c}
v_{1} \\
v_{2} \\
\vdots \\
v_{n}
\end{array} \right]$

then $\| \vec{v} \| = \sqrt{v_{1}^{2} + v_{2}^{2} + ... + v_{n}^{2}} = \sqrt{\vec{v}^T\vec{v}}$

E.g. if $\vec{v} = 
\left[ \begin{array}{c}
3 \\
4
\end{array} \right]$, then $\| \vec{v} \| = \sqrt{3^{2} + 4^{2}} = 5$

This is also called the vector **norm**. You will often see this used in machine learning.

In [None]:
# Calculate the norm of the vector x with np.linalg.norm.
x = np.array([3,4])

np.linalg.norm(x)

<a id="linear-algebra-applications-to-machine-learning"></a>
## Linear Algebra Applications to Machine Learning
---

Linear Algebra will give you better intuition for machine learning algorithms and see them beyond "black boxes".  Models have parameters, or hyperparameters that you can tune, and understanding the inner workings can help you refine your models.

You can also code algorithms from scratch, if you choose to become more advanced.

<a id="distance-between-actual-values-and-predicted-values"></a>
### Distance Between Actual Values and Predicted Values
We often need to know the difference between predicted values and actual values. 

![](assets/images/vector-norms.png)


#### L² Norm (Least Squares)
Most commonly, we use the **L²** norm, which is the sum of the squared values.  In 2-D space, we compute this as:
$$ L^2 norm = \|\vec{actual} - \vec{predicted} \| = \sqrt{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 ... + (actual_n - predicted_n)^2}$$

Note that this is just the **straight-line distance** or **as-the-crow-flies distance** between the actual point and the predicted point.


#### L¹ Norm (Least Absolute Deviations)
Another less used method is the **L¹** norm, aka **taxicab distance** because it describes the number of blocks to travel to reach the destination.

$$ L^1 norm = \|\vec{actual} - \vec{predicted} \| = |(actual_1 - predicted_1)| + |(actual_2 - predicted_2)| ... + |(actual_n - predicted_n)| $$


![](assets/images/L1-vs-L2-properties1.png)
<br>
<br>


### Mean Absolute Error
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. All individual differences have equal weight.

$$MAE = \frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|$$


<a id="mean-squared-error"></a>
### Mean Squared Error
Another method for measuring distance, or error between predicted and actual, is the mean of the squared errors.  **This is often used to measure the quality of regression models.** Where $\hat{y}(\mathbf{X})$ is a vector of predicted values (a function of the data matrix $\mathbf{X}$) and $\vec{y}$ is the actual values:

$$MSE = \frac{1} {n} ( \hat{y}(\mathbf{X}) - \vec{y} )^2$$



### Root Mean Squared Error
Another similar method for measuring distance, or error between predicted and actual, is the square root of the mean of the squared errors.  **This is another common method to measure the quality of regression models.** Where $\hat{y}(\mathbf{X})$ is a vector of predicted values (a function of the data matrix $\mathbf{X}$) and $\vec{y}$ is the actual values:

$$RMSE = \sqrt{\frac{1} {n} ( \hat{y}(\mathbf{X}) - \vec{y} )^2}$$


Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful than MAE and MSE when large errors are not desired.

<a id="least-squares"></a>
### Least Squares
Regression models use least squares to optimize the fit of the model, and are based on the following form:

$$\min \| \hat{y}(\mathbf{X}) - \vec{y} \|^2$$

The goal is to minimize the distance between model predictions and actual data.

Let's see this in [scikit-learn](http://scikit-learn.org/stable/modules/linear_model.html).


<a id="codealong-examining-the-cars-dataset"></a>
### Follow-Along: Examining the Cars dataset
---

This is a follow-along vs a code-along for the sake of time so we can get back to statistics

Read in the Motor Trend Cars data. 

**Data Source**: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

**Description**
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

**Format**
A data frame with 32 observations on 11 (numeric) variables.

- **mpg**	Miles/(US) gallon
- **cyl**	Number of cylinders
- **disp**	Displacement (cu.in.)
- **hp**	Gross horsepower
- **drat**	Rear axle ratio
- **wt**	Weight (1000 lbs)
- **qsec**	1/4 mile time
- **vs**	Engine (0 = V-shaped, 1 = straight)
- **am**	Transmission (0 = automatic, 1 = manual)
- **gear**	Number of forward gears
- **carb**	Number of carburetors

In [None]:
mtcars = pd.read_csv('datasets/mtcars.csv')

Imagine we were trying to predict mpg of a car.  Lets create 2 random columns, *predicted mpg* and *predicted_mpg_2* , that we will assume were predicted with 2 different machine learning models.

In [None]:
mtcars.head()

#### Create synthetic values for each prediction (demo purpose)

In [None]:
np.random.seed(99)  # this ensures repeatability of results - our results should be the same.
mtcars['predicted_mpg'] = np.random.randint(15, 30, mtcars.shape[0])
mtcars['predicted_mpg_2'] = np.random.randint(18, 26, mtcars.shape[0])

In [None]:
mtcars.head(10)

#### Print out the dimensions of the DataFrame using the `.shape` attribute:

In [None]:
# Preview data dimensions.
mtcars.shape

#### Print out the data types of the columns using the `.dtypes` attribute:

In [None]:
# What are the column data types?
mtcars.dtypes

#### Pull up descriptive statistics for each variable using the built-in `.describe()` function:

In [None]:
# Pull up descriptive statistics for mpg, predicted_mpg and predicted_mpg_2
mtcars[['mpg','predicted_mpg','predicted_mpg_2']].describe()

#### Calculate Euclidean distance between the predicted columns and actual column

In [None]:
#L2 Norm aka Euclidean Distance aka Straight-Line Distance
print('Model 1 L2 Norm:', np.linalg.norm(mtcars['mpg']-mtcars['predicted_mpg']))

print('Model 2 L2 Norm:', np.linalg.norm(mtcars['mpg']-mtcars['predicted_mpg_2']))

### Exercises:

#### 1. Going the Distance

Calculate the L1 Norm for each prediction.  Look at the help for np.linalg.norm, and specifically the **ord** parameter. (hint:L**1**)


In [None]:
print('Model 1 L1 Norm:', np.linalg.norm(mtcars['mpg']-mtcars['predicted_mpg'], ord=1))

In [None]:
print('Model 2 L1 Norm:', np.linalg.norm(mtcars['mpg']-mtcars['predicted_mpg_2'], ord=1))

#### 2. Calculate the MAE using numpy.  (hint: nest np.abs into np.mean and np.abs)

In [None]:
print('Model 1 MAE:', np.mean(np.abs(mtcars['mpg']-mtcars['predicted_mpg'])))
print('Model 2 MAE:', np.mean(np.abs(mtcars['mpg']-mtcars['predicted_mpg_2'])))

#### 3. Calculate the MSE using numpy.  (hint: use np.mean and np.square)

In [None]:
print('Model 1 MSE:', np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg)))
print('Model 2 MSE:', np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg_2)))

**Question: What are the units of this error?**

#### 4. Calculate the RMSE using numpy.  (hint: use the MSE calculation and take the square root)

In [None]:
print('Model 1 RMSE:', np.sqrt(np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg))))
print('Model 2 RMSE:', np.sqrt(np.mean(np.square(mtcars.mpg-mtcars.predicted_mpg_2))))

**Question: What are the units of this error?**

#### 5. Based on these metrics, which of these 2 simple models is better at explaining the behavior?