<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Statistics Fundamentals

_Instructor:_ Alexander Egorenkov (DC), Amy Roberts (NYC) Tim Book, General Assembly DC_

---

<a id="learning-objectives"></a>
## Learning Objectives
- **Summary statistics:** Using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.
- **Discover trends:** Using basic summary statistics and viz.
- **Bias/variance tradeoff:** Describe the bias and variance of statistical estimators.
- **Identify a normal distribution** within a data set using summary statistics and data visualizations.

### Lesson Guide
- [Descriptive Statistics Fundamentals](#descriptive-statistics-fundamentals)
	- [Measures of Central Tendency](#measures-of-central-tendency)
	- [Math Review](#math-review)
	- [Measures of Dispersion: Standard Deviation and Variance](#measures-of-dispersion-standard-deviation-and-variance)
- [Our First Model](#our-first-model)
- [A Short Introduction to Model Bias and Variance](#a-short-introduction-to-model-bias-and-variance)
- [Correlation and Association](#correlation-and-association)
- [The Normal Distribution](#the-normal-distribution)
	- [What is the Normal Distribution?](#what-is-the-normal-distribution)
	- [Skewness](#skewness)
	- [Kurtosis](#kurtosis)
- [Determining the Distribution of Your Data](#determining-the-distribution-of-your-data)
- [Lesson Review](#topic-review)
- [Extra Lab - Feature Reduction](#lab-stats-dim)

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics # Did you catch this is new? 

plt.style.use('fivethirtyeight')

# This makes sure that graphs render in your notebook.
%matplotlib inline

<a id="descriptive-statistics-fundamentals"></a>
## Statistics Fundamentals
---

- **Objective:** Code summary statistics using NumPy and Pandas: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation.

### Statistics

Statistics is essentially the study of distributions. We leverage distributions to tie the frequency of a value to the actual value observed. Our goal is to understand how to pull meaning out of distributions of various datasets to arrive at the formal definition of statistics


>**Statistics** is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.

That said there is ALOT of nuance within statistics. For this class you won't need to intimately understand statistics - but as you progress through your Data Science career it will increase in frequency. While the *litmus test* is a Data Scientist is better at statistics than a programmer you'll be able to go much further with an indepth review.

Statistical References:
* A great start [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
* [Bayesian Data Analysis, by Andrew Gelman](http://www.stat.columbia.edu/~gelman/book/)
* [Machine Learning: a Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/)
* [Pattern Recognition and Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)
* And of course my personal [favorite](http://mtvernon.wsu.edu/wp-content/uploads/2016/12/Statistics_for_Terrified_Biologists.pdf)


### A Quick Review of Statistical Methods in Python and Pandas

In [None]:
titanic = pd.read_csv('datasets/titanic.csv')
titanic.head()

In [None]:
type(titanic)

**Using the `titanic['fare']` column, compute the total fare paid by passengers.**


In [None]:
# Using the titanic.fare column, compute the total fare paid by passengers.
np.sum(titanic['fare'])

In [None]:
# base Python
sum(titanic['fare'])

In [None]:
# Pandas
titanic['fare'].sum()

What type of data can we pull out of the titanic data using basic statistics?


<a id="measures-of-central-tendency"></a>
### Measures of Central Tendency

- Mean
- Median
- Mode

#### Mean
The mean is defined as:
$$\bar{x} =\frac 1n\sum_{i=1}^nx_i$$

It is determined by summing all data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.

Be careful — the mean can be highly affected by outliers. For example, the mean of a very large number and some small numbers will be much larger than the "typical" small numbers. Earlier, we saw that the mean squared error (MSE) was used to optimize linear regression. Because this mean is highly affected by outliers, the resulting linear regression model is, too.

We say the mean is **sensitive** to outliers.

#### Median
The median refers to the midpoint in a series of numbers. Notice that the median is not affected by outliers, so it more so represents the "typical" value in a set.

$$ 0,1,2,[3],5,5,1004 $$

$$ 1,3,4,[4,5],5,5,7 $$

To find the median:

- Arrange the numbers in order from smallest to largest.
    - If there is an odd number of values, the middle value is the median.
    - If there is an even number of values, the average of the middle two values is the median.

Although the median has many useful properties, the mean is easier to use in optimization algorithms. The median is more often used in analysis than in machine learning algorithms.

The median isn't really affected by a few outliers. We say the median is **resistant** to outliers.

#### Mode
The mode of a set of values is the value that occurs most often.
A set of values may have more than one mode, or no mode at all.

$$1,0,1,5,7,8,9,3,4,1$$ 

$1$ is the mode, as it occurs the most often (three times).

#### Code-Along

#### Find the mean of the `titanic.fare` series using base Python:

#### Find the mean of the `titanic.fare series` using NumPy:

#### Find the mean of the `titanic.fare` series using Pandas:

#### What is the median fare paid (using Pandas)?

#### The mean and median are not the same, what does this tell you about the fares?


A: 

#### Use Pandas to find the most common fare paid on the Titanic:

In [None]:
titanic['fare'].mode()

#### Notice that this returns a series instead of a single number, why?

In [None]:
a = [1,2,3,4,5,5,5,6,6,6,7,7,7]

sr = pd.Series(a)
sr

In [None]:
sr.mode()

#### Use the built-in  `.value_counts()` function to count the values of each type in the `pclass` column:

In [None]:
titanic['pclass'].value_counts()

In [None]:
# normalize to see value counts as percentages.
titanic['pclass'].value_counts(normalize=True)

#### Pull up descriptive statistics for each variable using the built-in `.describe()` function:

In [None]:
titanic.describe()

In [None]:
titanic.info()

Answer questions about the titanic dataset.

Average fare of survivors vs not survived.

In [None]:
is_survived = titanic['survived'] == 1

# summary stats on survivors
titanic[is_survived].describe()

In [None]:
# summary stats on those who did not survive
titanic[~is_survived].describe()

### Diagnosing Data Problems

- Whenever you get a new data set, one of the fastest way to find mistakes and inconsistencies is to look at the **descriptive statistics**.
  - If anything looks too high or too low relative to your experience, there may be issues with the data collection.
- Your data may contain a lot of **missing values** and may need to be cleaned meticulously before they can be combined with other data.
  - You can take a quick average or moving average to smooth out the data and combine that to preview your results before you embark on your much longer data-cleaning journey.
  - Sometimes filling in missing values with their means or medians will be the best solution for dealing with missing data. Other times, you may want to drop the offending rows or do real imputation.

<a id="math-review"></a>
### Math Review

#### How Do We Measure Distance?

One method is to take the difference between two points:

$$X_2 - X_1$$

However, this can be inconvenient because of negative numbers.

We often use the following square root trick to deal with negative numbers. Note this is equivalent to the absolute value (if the points are 1-D):

$$\sqrt{(X_2-X_1)^2} = | X_2 - X_1 |$$

#### What About Distance in Multiple Dimensions?

We can turn to the Pythagorean theorem.

$$a^2 + b^2 = c^2$$

To find the distance along a diagonal, it is sufficient to measure one dimension at a time:

$$\sqrt{a^2 + b^2} = c$$

More generally, we can write this as the norm (You'll see this in machine learning papers):

$$\|X\|_2 = \sqrt{\sum{x_i^2}} = c$$

What if we want to work with points rather than distances? For points $\vec{x}: (x_1, x_1)$ and $\vec{y}: (y_1, y_2)$ we can write:

$$\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2} = c$$
or
$$\sqrt{\sum{(x_i - y_i)^2}} = c$$
or
$$\| \vec{x} - \vec{y} \| = c$$

> You may be more familiar with defining points as $(x, y)$ rather than $(x_1, x_2)$. However, in machine learning it is much more convenient to define each coordinate using the same base letter with a different subscript. This allows us to easily represent a 100-dimensional point, e.g., $(x_1, x_2, ..., x_{100})$. If we use the grade school method, we would soon run out of letters!

<a id="measures-of-dispersion-standard-deviation-and-variance"></a>
### Measures of Dispersion: Standard Deviation and Variance

Standard deviation (SD, $σ$ for population standard deviation, or $s$ for sample standard deviation) is a measure that is used to quantify the amount of variation or dispersion from the mean of a set of data values. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are spread out.

Standard deviation is the square root of variance:

$$\text{variance} = s^2 = \frac {\sum{(x_i - \bar{x})^2}} {n-1}$$

$$s = \sqrt{\frac {\sum{(x_i - \bar{x})^2}} {n-1}}$$

> **Standard deviation** is often used because it is in the same units as the original data! By glancing at the standard deviation, we can immediately estimate how "typical" a data point might be by how many standard deviations it is from the mean. Furthermore, standard deviation is the only value that makes sense to visually draw alongside the original data.

> **Variance** is often used for efficiency in computations. The square root in the SD always increases with the function to which it is applied. So, removing it can simplify calculations (e.g., taking derivatives), particularly if we are using the variance for tasks such as optimization.

We can also write variance in standard mathematic notation 

$$\sigma = \frac {\sum{(x_i - \bar{X})^2}} {n}$$

Or:

$$Var(\mathbf{X}) = \mathbb{E}[(X - \mathbb{E}[X])^2]$$

Or:

$$Var(\mathbf{X}) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$$

The final equation is obtained by expanding the squared quantity in the first equation then simplifying the summed terms.

**That can be a lot to take in, so let's break it down in Python.**

#### Assign the first 5 rows of titanic age data to a variable:

In [None]:
# Take the first five rows of titanic age data.
first_five = titanic['age'].head()

#or
#first_five = titanic['age'][:5]

first_five

#### Calculate the mean by hand:

In [None]:
# Calculate mean by hand.
mean = (22+38+26+35+35)/5
#mean = sum(first_five)/5
mean

#### Calculate the variance by hand:

In [None]:
# Calculate variance by hand
# Calculate variance by hand
(np.square(22 - mean) +
np.square(38 - mean) +
np.square(26 - mean) +
np.square(35 - mean) +
np.square(35 - mean)) / 4.0


#### Calculate the variance and the standard deviation using Pandas:

In [None]:
# Verify with Pandas
first_five.var()

In [None]:
# std dev of fare - in units of currency (pounds we assume)
first_five.std()

In [None]:
np.sqrt(first_five.var())

A **quartile** is a type of **quantile**. Quartiles in statistics are values that divide your data into quarters. The **first quartile (Q1)** is defined as the middle number between the smallest number and the median of the data set. The **second quartile (Q2)** is the median of the data. The **third quartile (Q3)** is the middle value between the median and the highest value of the data set. 

**Quartiles** represent the value for which 25% of the data is below (Q1) and the value for which 25% of the data is above (Q3)

The four quarters that divide a data set into quartiles are:

1. The lowest 25% of numbers.
2. The next lowest 25% of numbers (up to the median).
3. The second highest 25% of numbers (above the median).
4. The highest 25% of numbers.

**Use the titanic passenger ages to calculate the first and third quartiles**

In [None]:
# Using the pd.qcut() or pd.quantile() methods from pandas to find the 1st (Q1) and 3rd (Q3) quartiles

set(pd.qcut(titanic['age'], 4))

In [None]:
titanic['age'].quantile([.25,.5,.75, 1])

The **interquartile range (IQR)** is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.

In [None]:
# Calculate the interquartile range of the titanic passenger ages
titanic['age'].describe()

In [None]:
# Inter-quantile
np.array(titanic['age'].quantile([.75])) - np.array(titanic['age'].quantile([.25]))

<a id="our-first-model"></a>
## Our First Model
---

#### Mathematical models are tools to help us understand the world around us. They:
 1. Help to explain a system
 2. Allow us to study the effects of different components
 3. Grant the ability to make predictions about behaviour
 4. Give us experimental tool for testing theories and assessing quantitive conjectures
 5. Provide us a process where their formulation clarifies assumptions, variables, and parameters
 
While all that is helpful it cannot be overstated enough that **models are not reality**; they are an extreme simplification of reality.

In this section, we will make a **mathematical model** of data. When we say **model**, we mean it in the same sense that a map is a **model** of the real world. Google Maps can get us to that restaurant without getting lost, but it can't tell us where each individual pothole is. This is good enough.

As another example for when we say **model**, we mean it in the same sense that a toy car is a **model** of a real car. If we mainly care about appearance, the toy car model is an excellent model. However, the toy car fails to accurately represent other aspects of the car. For example, we cannot use a toy car to test how the actual car would perform in a collision.

<img src="http://www.azquotes.com/picture-quotes/quote-all-models-are-wrong-but-some-are-useful-george-e-p-box-53-42-27.jpg">

### Example of a model
In data science, we might take a rich, complex person and model that person solely as a two-dimensional vector: _(age, smokes cigarettes)_. For example: $(90, 1)$, $(28, 0)$, and $(52, 1)$, where $1$ indicates "smokes cigarettes." This model of a complex person obviously fails to account for many things. However, if we primarily care about modeling health risk, it might provide valuable insight.

Now that we have superficially modeled a complex person, we might determine a formula that evaluates risk. For example, an older person tends to have worse health, as does a person who smokes. So, we might deem someone as having risk should `age + 50*smokes > 100`. 

This is a **mathematical model**, as we use math to assess risk. It could be mostly accurate. However, there are surely elderly people who smoke who are in excellent health.


---

Let's make our first model from scratch. We'll use it predict the `fare` column in the Titanic data. So what data will we use? Actually, none.

The simplest model we can build is an estimation of the mean, median, or most common value. If we have no feature matrix and only an outcome, this is the best approach to make a prediction using only empirical data. 

This seems silly, but we'll actually use it all the time to create a baseline of how well we do with no data and determine whether or not our more sophisticated models make an improvement.

You can find out more about dummy estimators [here](http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators).

#### Get the `fare` column from the Titanic data and store it in variable `y`:

In [None]:
# Get the fare column from the Titanic data and store it as y:
y = titanic['fare']

#### Create predictions `y_pred` (in this case just the mean of `y`):

In [None]:
# Stored predictions in y_pred:
y_pred = y.mean()
#y_pred = y.median()

In [None]:
y_pred

### Exercises:

#### 3. Baseline Comparisons

#### Find the average squared distance between each prediction and its actual value:

This is known as the mean squared error (MSE).

The **Mean Squared Error (MSE)** is a measure of how close a fitted line is to data points. For every data point, you take the distance vertically from the point to the corresponding y value on the curve fit (the error), and square the value.

In [None]:
# Squared error is hard to interpret; let's look at mean squared error:
err = y_pred -y

err

In [None]:
# Ans: average of squared the error

mse = np.mean(np.square(err))

mse

The **Root Mean Squared error (RMSE)** is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis.

> Key point: The RMSE is thus the distance, on average, of a data point from the fitted line, measured along a vertical line.

*The RMSE is directly interpretable in terms of measurement units* making ita better measure of goodness of fit than a correlation coefficient. One can compare the RMSE to observed variation in measurements of a typical point. The two should be similar for a reasonable fit.

#### Calculate the root mean squared error (RMSE), the square root of the MSE:

In [None]:
# print(np.sqrt(np.mean(np.square(y-y_pred))))
np.sqrt(mse)

<a id="a-preface-on-modeling"></a>
### A Preface on Modeling
---
As we venture down the path of modeling, it can be difficult to determine which choices are "correct" or "incorrect".  A primary challenge is to understand how different models will perform in different circumstances and different types of data. It's essential to practice modeling on a variety of data.

As a beginner it is essential to learn which metrics are important for evaluating your models and what they mean. The metrics we evaluate our models with inform our actions.  

*Exploring datasets on your own with the skills and tools you learn in class is highly recommended!*

<a id='documentation'></a>

## Digging into Documentation

---

Get familiar with looking up things in Documentation. As we progress into class it will be impossible to cover even 50% of possibilities with the libraries we'll be using. Two in particular are the `sklearn` and `statsmodels` documentation. You are going to be doing a lot of it over the course of class and beyond.

[The statsmodels documentation can be found here.](http://statsmodels.sourceforge.net/devel/) Many recommend using the bleeding-edge version of statsmodels. [For that you can reference the code on github.](https://github.com/statsmodels/statsmodels/)

[The sklearn documentation can be found here.](http://scikit-learn.org/stable/documentation.html)

The packages have fairly different approaches and syntax for constructing models. Below are examples for linear regression in each package:
* [Linear regression in statsmodels](http://statsmodels.sourceforge.net/devel/examples/#regression)
* [Linear regression in scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

If you haven't yet, familliarize yourself with the format of the documentation.

<a id="a-short-introduction-to-model-bias-and-variance"></a>
## A Short Introduction to Model Bias and Variance 

---

- **Objective:** Describe the bias and variance of statistical estimators.

In simple terms, **bias** shows how accurate a model is in its predictions. (It has **low bias** if it hits the bullseye!)

**Variance** shows how reliable a model is in its performance. (It has **low variance** if the points are predicted consistently!)

These characteristics have important interactions, but we will save that for later.

![Bias and Variance](assets/images/biasVsVarianceImage.png)

Remember how we just calculated mean squared error to determine the accuracy of our prediction? It turns out we can do this for any statistical estimator, including mean, variance, and machine learning models.

We can even decompose mean squared error to identify the source of error - reducible error & irreducible error.

* Irreducible error or inherent uncertainty is associated with a natural variability in a system. 
* Reducible error is not only something we can address but should be addressed to maximize accuracy. Given what we're talking bout it shouldn't surprise you to learn it's components are **error due to squared bias** and **error due to variance**.

### Primer on Variance/Bias Tradeoff
Models that exhibit small variance and high bias *underfit* the truth.  Models that exhibit high variance and low bias *overfit* the truth target. Both prevent us from making strong predictions

![Over/underfit](assets/images/underoverfit.png)

The **“tradeoff”** between bias and variance can be viewed in this manner – a learning algorithm with low bias must be “flexible” so that it can fit the data well. But if the learning algorithm is too flexible (for instance, too linear), it will fit each training data set differently, and hence have high variance. A key characteristic of many supervised learning methods is a built-in way to control the bias-variance tradeoff either automatically or by providing a special parameter that the data scientist can adjust.


Note that if your target truth is highly nonlinear, and you select a linear model to approximate it, then you’re introducing a bias resulting from the linear model’s inability to capture nonlinearity. In fact, your linear model is underfitting the nonlinear target function over the training set. Likewise, if your target truth is linear, and you select a nonlinear model to approximate it, then you’re introducing a bias resulting from the nonlinear model’s inability to be linear where it needs to be. In fact, the nonlinear model is overfitting the linear target function over the training set.

In the figure below, we see a plot of the model’s performance using prediction capability on the vertical axis as a function of model complexity on the horizontal axis. Here, we depict the case where we use a number of different orders of polynomial functions to approximate the target function. Shown in the figure are the calculated square bias, variance, and error on the test set for each of the estimator functions.

We see that as the model complexity increases, the variance slowly increases and the squared bias decreases. This points to the tradeoff between bias and variance due to model complexity, i.e. models that are too complex tend to have high variance and low bias, while models that are too simple will tend to have high bias and low variance. The best model will have both low bias and low variance. 
![bias_variance_tradeoff](assets/images/Bia_variance_tradeoff_fig.jpg)


[Primer Sourced from: The Clever Machine reference](https://theclevermachine.wordpress.com/2013/04/21/model-selection-underfitting-overfitting-and-the-bias-variance-tradeoff/)

<a id="bias-variance-decomposition"></a>
### Bias-Variance Decomposition

In the following notation, $f$ refers to a perfect model, while $\hat{f}$ refers to our model.

**Bias**

Error caused by bias is calculated as the difference between the expected prediction of our model and the correct value we are trying to predict:

$$Bias = (\text{the truth}) - (\text{our estimate})$$

**Variance**

Error caused by variance is taken as the variability of a model prediction for a given point:

$$Variance = E[\left((\text{our estimate}) - (\text{average estimate})\right)^2]$$

**Mean Squared Error**
$$ MSE = Variance + Bias^2 + \text{irreducible error}$$

> The MSE is actually composed of three sources of error: The **variance**, **bias**, and some **irreducible error** that the model can never render given the available features.

This topic will come up again, but for now it's enough to know that we can decompose MSE into the bias of the estimator and the variance of the estimator.

<a id="example-using-bessels-correction"></a>
### Discussion of Bessel's Correction

It's rarely practical to measure every single item in a population to gather a statistic. We will usually sample a few items and use those to infer a population value.

For example, we can take a class of 200 students and measure their height, but rather than measuring everyone, we select students at random to estimate the average height in the class and the variance of the height in the class.

We know we can take the mean as follows:

$$E[X] = \bar{X} =\frac 1n\sum_{i=1}^nx_i$$

What about the variance?

Intuitively and by definition, population variance looks like this (the average distance from the mean):

$$\frac {\sum{(x_i - \bar{X})^2}} {n}$$

It's actually better to use the following for a sample (why?):

$$\frac {\sum{(x_i - \bar{X})^2}} {n-1}$$

In some cases, we may even use:

$$\frac {\sum{(x_i - \bar{X})^2}} {n+1}$$

Detailed explanations can be found here:

- [Bessel correction](https://en.wikipedia.org/wiki/Bessel%27s_correction).
- [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error).

Let's show an example of computing the variance by hand.

Suppose we have the following data:

$$X = [1, 2, 3, 4, 4, 10]$$

First, we compute its mean: 

$$\bar{X} = (1/6)(1 + 2 + 3 + 4 + 4 + 10) = 4$$

Because this is a sample of data rather than the full population, we'll use the second formula. Let's first "mean center" the data:

$$X_{centered} = X - \bar{X} = [-3, -2, -1, 0, 0, 6]$$

Now, we'll just find the average squared distance of each point from the mean:

$$variance = \frac {\sum{(x_i - \bar{X})^2}} {n-1} = \frac {(-3)^2 + (-2)^2 + (-1)^2 + 0^2 + 0^2 + 6^2}{6-1} = \frac{14 + 36}{5} = 10$$

So, the **variance of $X$** is $10$. However, we cannot compare this directly to the original units because it is in the original units squared. So, we will use the **standard deviation of $X$**, $\sqrt{10} \approx 3.16$ to see that the value of $10$ is farther than one standard deviation from the mean of $4$. So, we can conclude it is somewhat far from most of the points (more on what it really might mean later).

---

A variance of zero means there is no spread. If we take instead $X = [1, 1, 1, 1]$, then clearly the mean $\bar{X} = 1$. So, $X_{centered} = [0, 0, 0, 0]$, which directly leads to a variance of 0. (Make sure you understand why! Remember that variance is the average squared distance of each point from the mean.)

<a id="correlation-and-association"></a>
## Correlation and Association
---

- **Objective:** Describe characteristics and trends in a data set using visualizations.

Correlation measures how variables related to each other.

Typically, we talk about the Pearson correlation coefficient — a measure of **linear** association.

We refer to perfect correlation as **colinearity**.

The following are a few correlation coefficients. Note that if both variables trend upward, the coefficient is positive. If one trends opposite the other, it is negative. 

It is important that you always look at your data visually — the coefficient by itself can be misleading:

![Example correlation values](./assets/images/correlation_examples.png)

<a id="codealong-correlation-in-pandas"></a>
### Code-Along: Correlation in Pandas

**Objective:** Explore options for measuring and visualizing correlation in Pandas.

#### Display the correlation matrix for all Titanic variables:

In [None]:
# A:
titanic.corr()

#### Use Seaborn to plot a heat map of the correlation matrix:

The `sns.heatmap()` function will accomplish this.

- Generate a correlation matrix from the Titanic data using the `.corr()` method.
- Pass the correlation matrix into `sns.heatmap()` as its only parameter.

In [None]:
sns.heatmap(titanic.corr())

In [None]:
# Take a closer look at the survived and fare variables using a scatter plot
titanic.plot(kind='scatter', x='survived', y = 'fare');

# Is correlation a good way to inspect the association of fare and survival?

<a id="the-normal-distribution"></a>
## The Normal Distribution
---

- **Objective:** Identify a normal distribution within a data set using summary statistics and data visualizations.

- What is an event space?
  - A listing of all possible occurrences.
- What is a probability distribution?
  - A function that describes how events occur in an event space.
- What are general properties of probability distributions?
  - All probabilities of an event are between 0 and 1.
  - All events in the event space combined have probability 1.
  

<a id="what-is-the-normal-distribution"></a>
### What is the Normal Distribution?
- A normal distribution is often a key assumption to many models.
  - In practice, if the normal distribution assumption is not met, it's not the end of the world. Your model is just less efficient in most cases.

- The normal distribution is **completely summarized by its mean and standar deviation**.

- The **mean** controls its **center**.

- The **standard deviation** controls how **spread out** it is.

- Normal distributions are **symmetric, bell-shaped curves**.

![normal distribution](assets/images/normal.png)


#### Why do we care about normal distributions?

- They often show up in nature.
- Aggregated processes tend to distribute normally, regardless of their underlying distribution (**Central Limit Theorem**)
    - The **Central Limit Theorem** states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. ([More Info](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/))
    
- They offer effective simplification that makes it easy to make approximations.
- It can improve our machine learning algorithms
<br>
Machine learning algorithms are usually designed to be smart enough to find out how to deal with any distribution present in the features by themselves. At the same time even if it isn't necessary to transform the actual distributions for an algorithm to work properly, it can still be beneficial for these reasons:


* To make the cost function minimize better the error of the predictions
* To make the algorithm converge properly and faster

We'll discuss various ways to transform or rescale data (i.e. normalization, standardization) later in the course

#### Plot a histogram of 1,000 samples from a random normal distribution:

The `np.random.randn(numsamples)` function will draw from a random normal distribution with a mean of 0 and a standard deviation of 1.

- To plot a histogram, pass a NumPy array with 1000 samples as the only parameter to `plt.hist()`.
- Change the number of bins using the keyword argument `bins`, e.g. `plt.hist(mydata, bins=50)`

In [None]:
# Plot a histogram of several random normal samples from NumPy.
np.random.seed(10)
samples = np.random.randn(1000000)

In [None]:
plt.hist(samples, bins=50);

<a id="skewness"></a>
###  Skewness
- Skewness is a measure of the asymmetry of the distribution of a random variable about its mean.
- Skewness can be positive or negative, or even undefined.
- Notice that the mean, median, and mode are the same when there is no skew.

![skewness](assets/images/skewness---mean-median-mode.jpg)

#### Plot a lognormal distribution generated with NumPy.

Take 1,000 samples using `np.random.lognormal(size=numsamples)` and plot them on a histogram.

In [None]:
# Plot a lognormal distribution generated with NumPy

log_samples = np.random.lognormal(size=1000)
plt.hist(log_samples, 100);

In [None]:
titanic['fare'].skew()

In [None]:
titanic['fare'].hist(bins=20)

In [None]:
df = pd.DataFrame(samples)
df.skew()

#####  Real World Application - When mindfullness beats complexity
- Skewness is surprisingly important.
- Most algorithms implicitly use the mean by default when making approximations.
- If you know your data is heavily skewed, you may have to either transform your data or set your algorithms to work with the median.

<a id="kurtosis"></a>
### Kurtosis
- Kurtosis is a measure of whether the data are peaked or flat, relative to a normal distribution.
- Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. 

![kurtosis](assets/images/kurtosis.jpg)

[Wikipedia](https://en.wikipedia.org/wiki/Kurtosis) includes additional pictures and explanations that may best drive this concept home

In [None]:
titanic['fare'].kurtosis()

####  Real-World Application: Risk Analysis
- Long-tailed distributions with high kurtosis elude intuition; we naturally think the event is too improbable to pay attention to.
- It's often the case that there is a large cost associated with a low-probability event, as is the case with hurricane damage.
- It's unlikely you will get hit by a Category 5 hurricane, but when you do, the damage will be catastrophic.
- Pay attention to what happens at the tails and whether this influences the problem at hand.
- In these cases, understanding the costs may be more important than understanding the risks.

<a id="determining-the-distribution-of-your-data"></a>
## Determining the Distribution of Your Data
---

**Objective:** Create basic data visualizations, including scatterplots, box plots, and histograms.

![](./assets/images/distributions.png)

#### Use the `.hist()` function of your Titantic DataFrame to plot histograms of all the variables in your data.

- The function `plt.hist(data)` calls the Matplotlib library directly.
- However, each DataFrame has its own `hist()` method that by default plots one histogram per column. 
- Given a DataFrame `my_df`, it can be called like this: `my_df.hist()`. 

In [None]:
# Plot all variables in the Titanic data set using histograms:
titanic['fare'].hist();

In [None]:
# make observations on the distributions for each column

#### Use the built-in `.plot.box()` function of your Titanic DataFrame to plot box plots of your variables.

- Given a DataFrame, a box plot can be made where each column is one tick on the x axis.
- To do this, it can be called like this: `my_df.plot.box()`.
- Try using the keyword argument `showfliers`, e.g. `showfliers=False`.

In [None]:
# Plotting all histograms can be unweildly; box plots can be more concise:
titanic[['age', 'fare']].plot.box(showfliers=False)

<a id="exercise"></a>
### Exercise

1. Look at the Titanic data variables.
- Are any of them normal?
- Are any skewed?
- How might this affect our modeling?

In [None]:
# Work on your answers here!


![](./assets/images/visualization_flow_chart.jpg)

<a id="topic-review"></a>
## Lesson Review
---

- We covered several different types of summary statistics, what are they?
- We covered three different types of visualizations, which ones?
- Describe bias and variance and why they are important.
- What are some important characteristics of distributions?

**Any further questions?**

<a id="lab-stats-dim"></a>
## Extra Lab: Using Stats for Feature Reduction
---

- In this section, we will apply four of the techniques from a PyData DC 2016 talk, ["A Practical Guide to Dimensionality Reduction"](https://pyvideo.org/pydata-dc-2016/a-practical-guide-to-dimensionality-reduction-techniques.html).

- Your solutions do not have to be fully automated!

In [None]:
chicago_df = pd.read_csv('./datasets/chicago.csv')

# Display the first 3 rows of the chicago dataframe


### 1. Percent missing values

The presenter suggests to drop features when > 95% of the values are missing.

#### 1a. For each column, what % are missing? 
- For this exercise, suppose only `np.nan` indicates missing.

In [None]:
chicago_df.isna().sum()/chicago_df.shape[0]


#### 1b. For each column with missing values, create an indicator column that is `True` if missing and `False` otherwise. 

- Make the column name the original followed by `_Missing`. For example, `Address` would become `Address_Missing`.

### 2. Amount of variation

#### 2a. What is the variance of each numeric column?

- Drop any columns that have zero variance. 
- Are there any non-numeric columns with zero variance?

### 3. Pairwise correlation

#### 3a. Which pairs of features are highly correlated?
- For this exercise, use >= 0.65.
- Keep in mind -0.8 and 0.8 are both highly correlated.

In [None]:
# Age/LotSizeSqft
# HouseSizeSqft/LotSizeSqft


#### 3b. For each pair, drop the feature that is less correlated with the target (Price)

In [None]:
# Age/LotSizeSqft -
# HouseSizeSqft/LotSizeSqft -



### 4. Correlation with the target

#### 4a. Which pairs of features are lowly correlated with MEDV?

- For this exercise, suppose < 0.25.

#### 4b. Plot the (absolute values) of the correlations in descending order using a line plot. 

- Is there an "elbow" in the curve where the correlations flatten out or do not drop as steeply?