<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/xx_misc/probability_and_statistics/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Statistical Analysis of Data

**Statistics** are numbers that can be used to describe or summarize variable data points. For example, the expected value of a distribution and the mean of a sample are statistics. Being able to perform sound statistical analysis is crucial for a data scientist, and in this lesson we will outline a few key statistical concepts:

* Statistical Sampling
  * Sampling vs. Population
  * Simple Random Sample (SRS)
  * Sample Bias
* Variables and Measurements
* Measures of Center
  * Mean
  * Median
  * Mode
* Measures of spread
  * Variance and Standard Deviation
  * Standard Error
* Distributions
* Coefficient of Variation ($R^2$)
* Correlation Coefficient (Pearson's $r$)
* Hypothesis Testing


### Load Packages

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

## Statistical Sampling

### Sampling vs. Population

*What is the difference between a sample and a population?*  

You can think of a sample as its own population, which is just a subset of the global population.  You could imagine a biologist tagging some sample of birds, tracking their movements with GPS, and using that data to make inferences about the patterns of the global population of a species.  

An important assumption in statistics is that an unbiased sample comes from the same distribution as the population, assuming that the global distribution is normal. We can test this hypothesis using a single sided t-test, a statistical method to compare sample means to the population means.

### Simple Random Sample

A **simple random sample (SRS)** is one of the most common statistical sampling techniques. To get an SRS, we take a random subset of the data, without replacement. An SRS is unbiased because every member of the population has an equal opportunity to be selected. True randomness does not exist computationally, so we must use pseudorandom functions which, for most common statistical applications, will suffice as a statistically random method. 

### Sample Bias

Bias, as with a weighted coin that falls on heads more often, can be present in many stages of an experiment or data analysis. Some biases, like **selection bias**, are easy to detect. For example, a sample obtained from the Census Bureau in 2010 collected information on residents across the United States. Surely not every resident responded to their requests, so the ones who did are assumed to be a representative sample. This experiment has some selection bias, however, since those who respond to the census tend to be at home during the day, which means they are more likely to be either very young or very old. Another example is political polling by phone; those at home ready to answer the phone tend to be older, yielding a biased sample of voters.

**Confirmation bias**, a form of cognitive bias, can affect online and offline behavior. Those who believe that the earth is flat are more likely to share misinformation that supports their flat-earth theory rather than facts which dispel the claim. Picking up on this preference, YouTube's algorithm surfaces more flat earth video suggestions to those who've watched at least one. Such video suggestions then feed back into the users' confirmation bias.

There are other types of bias which may further confound an experiment or a data collection strategy. These biases are beyond the scope of this course but should be noted. Here's an [exhaustive list of cognitive biases](https://en.wikipedia.org/wiki/List_of_cognitive_biases). Data scientists of all skill levels can experience pitfalls in their design and implementation strategies if they are not aware of the source of some bias in their experiment design or error in their data sources or collection strategies.






## Variables and Measurements

We have already learned about programming data types, like string, integer, and float. These data types make up variable types that are categorized according to their measurement scales. We can start to think about variables divided into two groups: **numerical**, and **categorical**. 

### Numerical Variables
Numerical data can be represented by both numbers and strings, and it can be further subdivided into discrete and continuous variables. 

**Discrete** data is anything that can be counted, like the number of user signups for a web app or the number of waves you caught surfing.

Conversely, **continuous** data cannot be counted and must instead be measured. For example, the finish times of a 100m sprint and the waiting time for a train are continuous variables.

### Categorical Variables
Categorical data can take the form of either strings or integers. However, these integers have no numerical value, they are purely a minimal labeling convention.

***Nominal*** data is labeled without any specific order. In machine learning, these categories would be called classes, or levels. A feature can be binary (containing only two classes) or multicategory (containing more than two classes). In the case of coin flip data, you have either a heads or tails because the coin cannot land on both or none. An example of multicategory data is the seven classes of life (animal, plant, fungus, protists, archaebacteria, and eubacteria).

***Ordinal data*** is categorical data where the order has significance, or is ranked. This could be Uber driver ratings on a scale of 1 to 5 stars, or gold, silver, and bronze Olympic medals. It should be noted that the differences between each level of ordinal data are assumed to be equivalent, when in reality they may not be. For example, the perceived difference between bronze and silver may be different than the difference between silver and gold.

## Measures of Center

Central tendency is the point around which most of the data in a dataset is clustered. Some measures of central tendency include the mean (sometimes referred to as the **average**), the median, and the mode.

Note that the mean and median only apply to numerical data.

The **mean** is easy to calculate; it is the sum of a sequence of numbers divided by the number of samples. The mean is not robust to outliers (that is, less likely to be affected by a few data points that are out of the ordinary), and if your data is not normally distributed then the mean may not be a good measure of central tendency.

The **median** is the middle data point in a series. If your set contains four samples, the median is halfway between the 2nd and 3rd data point. If your set contains five samples, the median is the 3rd data point. The median can often be close to the mean, but it is more robust to outliers.

The mode is the most commonly occurring data point in a series. The mode is especially useful for categorical data and doesn't make sense for continuous data. Sometimes there is no mode, which indicates that all of the data points are unique. Sometimes a sample can be multimodal, or have multiple equally occurring modes. The mode gives insight into a distribution's frequency, including some possible source of error.


## Measures of Spread

### Variance and Standard Deviation
The population variance, $\sigma^{2}$ is defined as follows.

$$ \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2$$

where:

- $N$ is the population size
- The $x_i$ are the population values
- $\mu$ is the population mean

The population standard deviation $\sigma$ is the square root of $\sigma^2$.

Data scientists typically talk about variance in the context of variability, or how large the difference between each ordered point in a sample is to its mean. 

The sample variance $s^2$ is as follows:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}$$

where:

- $n$ is the sample size
- The $x_i$ are the sample values
- $\bar{x}$ is the sample mean

The sample standard deviation $s$ is the square root of $s^2$.

### Standard Error
Data scientists work with real-life datasets, so we are mainly concerned with sample variance. Therefore, we use sample standard deviation to estimate the standard deviation of a population. Standard error (SE) is the standard deviation of a sampling distribution.

$$SE =\frac{s}{\sqrt{n}} $$

When running a test to statistically measure whether the means from two distributions $i$ and $j$ are the same, this statistic becomes:

$$ SE_{} =\sqrt{\frac{s_{i}^{2}+s_{j}^{2}}{n_{i}+n_{j}}} $$

where:

- $s_i, s_j$ are the sample standard deviations for samples $i$ and $j$ respectively
- $n_i, n_j$ are the sample sizes for samples $i$ and $j$ respectively

## Distributions

Now that we have a handle on the different variable data types and their respective measurement scales, we can begin to understand the different categories of distributions that these variable types come from. For humans to understand distributions, we generally visualize data on a measurement scale.

### Normal
Many natural phenomena are normally distributed, from human height distribution to light wave interference. A normally distributed variable in a dataset would describe a variable whose data points come from a normal. It is also referred to as the [Gaussian](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss) distribution. If $X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$, we write $X \sim N(\mu, \sigma^2)$.

Below is a plot of a normal distribution's [probability density function](https://en.wikipedia.org/wiki/Probability_density_function).


In [None]:
x = np.arange(-5, 5, 0.1)
y = stats.norm.pdf(x)

plt.plot(x, y)
plt.axvline(0, color='red')
plt.show()

### Bernoulli
Bernoulli distributions model an event which only has two possible outcomes (i.e., success and failure) which occur with probability $p$ and $1-p$.

If $X$ is a Bernoulli random variable with likelihood of success $p$, we write $X \sim \text{Bernoulli}(p)$.

We have actually seen an example of a Bernoulli distribution when considering a coin flip. That is, there are exactly two outcomes, and a heads occurs with probability $p = \frac{1}{2}$ and a tails with probability $1-p = \frac{1}{2}$.

### Binomial 
Binomial distributions model a discrete random variable which repeats Bernoulli trials $n$ times.

If $X$ is a Binomial random variable over $n$ trials with probability $p$ of success, we write $X \sim \text{Binom}(n, k)$. Under this distribution, the probability of $k$ successes is $P(X = k) = {n \choose k}p^k(1-p)^{n-k}$.

### Poisson
A Poisson distribution can be used to model the discrete events that happen over a time interval. An example could be an expected count of customers arriving at a restaurant during each hour.

### Gamma
The Gamma distribution is similar to Poisson in that it models discrete events in time, except that it represents a time until an event. This could be the departure times of employees from a central office. For example, employees depart from a central office beginning at 3pm, and by 8pm most have left.

### Others
See [here](https://en.wikipedia.org/wiki/List_of_probability_distributions) for the most commonly used probability distributions.

## Coefficient of Determination ($R^2$)

Most datasets come with many variables to unpack. Looking at the $R^{2}$ can inform us of the linear relationship present between two variables. In the tips dataset, the tips tend to increase linearly with the total bill. The coefficient of determination, $R^{2}$, tells us how much variance is explained by a best fit regression line through the data. An $R^{2}=1$  would indicate too good of a fit, and $R^{2}=0$ would indicate no fit.

## Correlation Coefficient (Pearson's $r$)

Correlations can inform data scientists that there is a statistically significant relationship between one or more variables in a dataset. Although correlation can allow inferences about a causal relationship to be made, data scientists must note that [correlation is not causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation). The Pearson Correlation coefficient is on a scale from -1 to 1, where 0 implies no correlation, -1 is 100% negative correlation, and 1 is 100% positive correlation.

In [None]:
df = sns.load_dataset('mpg')

sns.heatmap(df.corr(), cmap='Blues', annot=True)
plt.show()

## Hypothesis Testing

Designing an experiment involves separating some of the ideas that you might have had before, during, and after the experiment. Let's say you are selling popcorn at a movie theater, and you have three sizes: small, medium, and large, for \$3.00, \$4.50, and \$6.50, respectively. At the end of the week, the total sales are as follows: \$200, \$100, and \$50 for small, medium, and large, respectively. You currently have an advertisement posted for your medium popcorn, but you think that if you were to sell more large sizes, you may make more money. So you decide to post an ad for large size popcorn. Your hypothesis is as follows: I will get more weekly sales on average with a large ad compared to a medium ad.

### A-A Testing
To test this hypothesis, we should first look at some historical data so that we can validate that our control is what we think it is. Our hypothesis for this case is that there is no difference in week-to-week sales using the ad for medium popcorn. If we test this hypothesis, we can use a 1-sample t-test to compare against the population mean, or a F-test to compare some week in the past to all other weeks.

### A-B Testing
Assuming we have validated the historical data using an A-A test for the old ad for medium popcorn, we can now test against the new ad for the large popcorn. If we then collect data for the new ad for several weeks, we can use a 2-sided t-test to compare. In this experiment we will collect data for several weeks or months using the control (the medium ad) and repeat for the experiment group (the large ad).

The most important aspect of hypothesis testing is the assumptions you make about the control and the test groups. The null hypothesis in all cases would be the inverse of your hypothesis. In A-A testing, the null hypothesis is that there is no difference amongst samples, and in the case of A-B testing, the null states that there is no difference between a control sample and a test sample. A successful A-A test is one in which you fail to reject the null. In other words, there are no differences inside your control group; it is more or less homogenous. A successful A-B test is one where you reject the null hypothesis, observing a significant difference.

### Evaluating an Experiment
Using a t-test or another statistical test like F-test, ANOVA or Tukey HSD, we can measure the results of our experiment with two statistics. The t-statistic informs us of the magnitude of the observed difference between samples, and the p-value tells us how likely it is that the observed difference is due to random chance or noise. Most statisticians and data scientists use 0.05 as an upper limit to a p-value, so any test that results in a p-value less than 0.05 would indicate that the difference observed is not likely due to random chance.

# Resources

[seaborn datasets](https://github.com/mwaskom/seaborn-data)



# Exercises

## Exercise 1

Find a dataset from the list below and plot the distributions of all the numeric columns. In each distribution, you should also plot the median, $-1\sigma$, and $+1\sigma$.


Here's a full list of [Seaborn built-in datasets](https://github.com/mwaskom/seaborn-data).

### Student Solution

In [None]:
# Your answer goes here

---

### Answer Key

**Solution**

In [None]:
#@title #### Change the variable to examine standard deviation

# Load tips dataset from Seaborn

tips = sns.load_dataset('tips')
var="size" #@param ['size', "tip", "total_bill"]
df = tips

df[var].hist(color='lightblue')
plt.grid(False)
plt.axvline(df[var].median()-df[var].std(), color='purple')
plt.axvline(df[var].median()+df[var].std(), color='orangered')
plt.axvline(df[var].median(), color='teal', ls=':', lw=5.0)

plt.title(var+" distribution")
plt.legend(['-σ1','+σ1','median'])
plt.show()

print("-σ1",df[var].median()-df[var].std())
print("median:",df[var].median())
print("+σ1",df[var].median()+df[var].std())

---

## Exercise 2

Load a dataset and take a simple random sample. Then return a dataframe with the standard error and standard deviation for each column.

### Student Solution

In [None]:
# Your answer goes here

---

### Answer Key

**Solution**

In [None]:
df = sns.load_dataset('mpg')

sample = df.sample(25)

e = pd.DataFrame()
e["Population SE"] = df.std()/(df.shape[0]**.5)
e["Sample SE"] = sample.std()/(sample.shape[0]**.5)
e["population σ"] = df.std()
e["sample s"] = sample.std()
display(e)

As we can see, the samples tend to have a higher standard deviation than the sample population, and as a result, a higher standard error.

---

## Exercise 3


Using a dataset that you found already, or a new one, create two visualizations that share the same figure using `plt`, as well as their mean and median lines. The first visualization should show the frequency, and the second should show the probability.

### Student Solution

In [None]:
# Your answer goes here

---

### Answer Key

**Solution**

In [None]:
df = sns.load_dataset('mpg')

fig = plt.figure()

plt.subplot(1,2,1)
ax3 = df['acceleration'].hist()
plt.axvline(df['acceleration'].mean(), color='yellow')
plt.axvline(df['acceleration'].median(), 
            color='cyan', ls=':', lw=5.0)
plt.title("Frequency")

plt.subplot(1,2,2)
ax3 = sns.distplot(df['acceleration'])
plt.xlabel('')
plt.title("KDE")

plt.axvline(df['acceleration'].mean(), color='orangered')
plt.axvline(df['acceleration'].median(), 
            color='darkblue', ls=':', lw=5.0)

plt.legend(['mean', 'median'])
plt.suptitle("Acceleration", y=-.01)
plt.tight_layout()
plt.show()

print("mean:",df['acceleration'].mean(),
        "\nmedian:",df['acceleration'].median())

---

## Exercise 4


Plot two variables against each other, and calculate the $R^{2}$ and p-value for a regression line that fits the data.

### Student Solution

In [None]:
# Your answer goes here

---

### Answer Key

**Solution**

In [None]:
import scipy.stats

# Load tips data set and plot total bill vs tip
tips = sns.load_dataset('tips')

def getR(data,x,y):
    result = stats.pearsonr(data[x],data[y])
    r = round(result[0], 5)
    p = np.format_float_scientific(result[1], precision=4)

    g = sns.jointplot(x=x, y=y, data=data, kind='reg')
    plt.text(0,10,r'$R^{2}=%s$'%r)
    plt.text(0,9,r'$p_{value}=%s$'%p)
    plt.show()

getR(tips,'total_bill','tip')

---