#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Statistical Analysis of Data


Using probabilistic functions we can generate samples, and calculate statistics of variables. Statistics are numbers that can be used to describe, or summarize variable data points, and distributions are the data that fit these descriptions.  Data scientists use many methods to perform statistical analyses, and it is important to understand how concepts like the normal distribution, and central limit theorem are used to model the universe, and make inferences about events that happen in both time and space.


**Statistical Analysis**
* Sampling vs. Population
>* Simple Random Sample (SRS)
>*  Bias in Sampling
* Variables and Measurements
*   Measuring Central Tendency
>* Mean, Median, and Mode
*   Measuring Variance
>* Standard Deviation
>* Standard Error

**Normality**
* Distributions Types
* Coefficient of Variation (R-squared)
* Correlation Coefficient (Pearson's r)

**Experimental Design**
* Hypothesis Testing
>* t-test, t-stat, and p-value


## Overview

### Learning Objectives

* Understand basic statistical sampling, and generate a representative sample from a dataset.
* Be able to sniff out bias in an experiment.
* Discern a good measure of central tendency for a set of data points.
* Calculate standard deviation, standard error, and percentile.
* Describe what a skewed distribution is, and how it is a deviation from normality.
* Generate a hypothesis, and compare the means of 2 sample distributions with a t-test, and report on the results.

### Prerequisites

* Data Visualization
* Intermediate Pandas
* Probability

### Estimated Duration

90 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There are 4 exercises in this Colab so there are 12 points available. The grading scale will be 12 points.

### Load Packages

In [0]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import ttest_1samp, pearsonr

## Statistical Analysis

### Sampling vs. Population

What is the difference between a sample and a population?  You can think of a sample as its own population, which is just a subset of the global population.  You could imagine a biologist tagging some sample of birds, tracking their movements with gps, and using that data to make inferences about the patterns of the larger population of species.  

Defining assumptions is an abstraction that allows scientists to test theories. The first assumption is an unbiased sample comes from the same distribution as the population, assuming that distribution is normal. We can test this hypothesis using a single sided t-test, a statistical method to compare sample means to the population means.

### Simple Random Sample

A simple random sample (SRS) is one of the most common statistical sampling techniques, and it involves a randomly selection of a subset of data.  A SRS is unbiased because every member of the population has an equal opportunity to be selected.  True randomness cannot be proven statistically, so we must use pseudorandom functions which, for most common statistical applications, will suffice as a statistically random method. 

### Bias in Sampling

Bias, as with a weighted coin that falls on heads more often, can be present in many stages of an experiment or data analysis.  Some bias like selection bias is easy to detect. For example, a sample obtained from the Census Bureau in 2010 collected information on residents across the U.S.  Surely not every resident responded to their requests, so the ones that did are assumed to be a representative sample.  This experiment has some selection bias, as the folks who tend to be at home during the day are either very young, or very old.  Another example would be political polling; those at home ready to answer the phone tend to be older, yielding a bias sample of voters.

Confirmation bias, a form of cognitive bias, can affect online and offline behavior. Those who believe that the earth is flat, are more likely to spread misinformation that supports flat-earth rather than facts which dispel the claim, perpetuating such a falsehood.  Youtube's algorithm is then bias to show more flat earth video suggestions to those who've watched at least one. Those video suggestions then feed back into the users' confirmation bias.

There are other types of bias which may further confound an experiment, or data collection strategy, these biases are beyond the scope of this course, but should be noted. An exhaustive list of cognitive biases can be found [here](https://en.wikipedia.org/wiki/List_of_cognitive_biases). Data Scientists of all skill levels can experience pitfalls in their design and implementation strategies, if not aware of the source of some bias in experiment design, or error in data sources or collection strategies.






### Variables and Measurements

We have already learned about data types like string, integer, and float.  These data types make up variable types that are categorized according to their measurements scales.  We can start to think about variables divided into two groups: numerical, and categorical variables. 

#### Numerical Variables
Numerical data can be represented by both numbers and strings, and can be further subdivided into discrete and continuous variables. 

***Discrete*** data is anything that can be counted, like number of user signups for your web app, or the number of waves you caught while you were surfing.

Conversely,  ***continuous*** data cannot be counted, and must actually be measured. Continuous data comes on 2 measurement scales:
1. Intervals are ordered datasets, and the difference between values is the same. An example of interval data, is the Kelvin temperature scale. Interval data does not actually have a zero value. Even though absolute zero is the bottom limit for temperature, Kelvin's floor value of zero is not actually measureable, because the absence of temperature or true zero does not exist. 
2. Ratios, on the other hand, are ordered values that can be zero. An example of ratio could be could be the proportion of employees at company G over the age of 65.  Most people retire at 65 so it is possible to measure a zero ratio in that case.

#### Categorical Variables
Categorical data can take the form of both string or integer. However, those integers have no numerical value, they are purely a minimal labeling convention.

***Nominal*** means that the data is labeled without any specific order.  In machine learning, these categories would be called classes, or levels.  A feature can be binary (containing only two classes), or multicategory (containing more than two classes).  In the case of coin flip data, you have either a heads or a tails because the coin cannot land on both or none.  If a feature has more than two categories that would be called a multiclass feature.  For instance, life can be divided into 7 classes (animal, plant, fungus, protists, archaebacteria, and eubacteria) 

***Ordinal data*** is categorical data where the order has significance, or is otherwise ranked.  This could be uber driver ratings on a scale of 1 to 5 stars, or Gold, Silver, and Bronze olympic medals.  It should be noted that the differences between each level of ordinal data are assumed to be equivalent, when in reality they may not be.  For example, the difference between a 3 and 4-star rating may not be the same as the difference between a 4 and a 5-star rating.



### Measuring Central Tendency

Central Tendency is the point around which most of the data in a dataset is clustered.  Some measures of central tendency include the mean, often referred to as the average, the median, and the mode.

Mean is easy to calculate, it is just the sum of a series divided by the number of samples. Mean is not robust to outliers, and if your data is not normally distributed then the mean may not be a good measure of central tendency

Median is the middle most data point in a series.  If your set contains 4 samples, the median is equal to the mean of the 2nd and 3rd data point. Median can often be close to the mean, but is more robust to outliers.  If you have evidence that your sample comes from a non-normal distribution, median may be a better estimate of central tendency than mode.

Mode is the most frequent data point in a series.  Sometimes there is no mode, which indicates that all of the data points are unique. Other times a series can be multimodal, where each unique mode has an equal probability of being sampled in a distribution.  Mode give insight into a distribution's frequency, including some possible source of error.




### Measuring Variance

#### Standard Deviation
Population Variance,  $\sigma^{2}$, is Population Standard deviation squared.  Data Scientists typically talk about variance in the context of variability, or how large the difference between each ordered point in a sample is to its population central tendency.  Sample Standard Deviation is equal to the square root of the sum of these differences, divided by the sample size. The formula for standard deviation is:

$$s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$


#### Standard Error
Data Scientists work with datasets, so we are mainly concerned with sample variance, and therefore use sample standard deviation to estimate the standard deviation of a population. Standard error, a.k.a. Standard Error of Mean (SEM) is the standard deviation of a sampling distribution. This statistic can be used to measure the difference in the means of two distributions. The formula is as follows:

$$SE=\frac{s}{\sqrt{n}}=\sqrt{\frac{s_{i}^{2}+s_{j}^{2}}{n_{i}+n_{j}}}$$


## Normality
What does it mean to be normal? In statistics, normal can be used to describe the shape of a distribution of data.  Many natural phenomena are normally distributed, from human height distribution, to light wave interference. A normally distributed variable in a dataset would describe a variable whose data points come from a normal, or approximately normal distribution.  It is also referred to as the gaussian distribution, or the student's t-distribution.

To understand the probability of an event, a probability density function can be used to plot the expected frequency of that event.The formula for the Probability Density Function (PDF) is as follows, where $x$ is equal to a sample data point, $\mu$ is the population mean, $\sigma$ is the population standard deviation, and $e$ is Euler's number. This function produces a curve whose area is equal to 1.

\begin{equation}
f(x|\mu,\sigma^2)=\frac1{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}


The plots above show two histograms, one of which estimates the probability using a Kernel Density Estimate (KDE) for acceleration in the mpg dataset. The y-axis on the left plot shows frequency, and the y-axis on the right plot shows probability density. The distribution for this variable appears normal, and the median and mean are nearly equal.

### Distribution Types

Now that we have a handle on the different variable data types and their respective measurement scales, we can begin to understand the different categories of distributions that these variable types come from. For humans to understand distributions, we generally visualize data on a measurement scale.

#### Binomial 
We have already seen the binomial distribution in our coin flip experiment, where we plotted the heads frequency distribution for 1000 iterations of a 100 coin flip experiment. This distribution models a discrete random variable which, in the case of the coin flip experiments, controls the distribution of K heads results over N iterations.

#### Conditional
A Conditional distribution contains only values less than, greater than, or equal to some set value. An example is the distribution of heads results in a 100 flip, 1000 repetition experiment where the number of heads is greater than 500.

#### Bernoulli
We also have seen the bernoulli distribution in our coin flip experiment, and it represents a discrete count of both heads and tails results for exactly one experiment.

#### Poisson
A poisson distribution can be used to model the discrete events that happen over a time interval. An example could be an expected count of customers arriving at a restaurant during each hour. Some hours the total will be higher, and some hours the total will be lower

#### Gamma
Gamma distribution is similar to poisson in that it models discrete events in time, except that it represents a time until an event.  This could be the departure times of employees from a central office.  Employees depart from a central office beginning at 3pm, and by 8pm most have left.

#### Others
There are other distributions that are mixed, and can take on more exotic shapes, a full list can be found [here](https://en.wikipedia.org/wiki/List_of_probability_distributions).

### Relationships Between Variables

Most datasets come with many variables to unpack.  Looking at the $R^{2}$ can inform us of the linear relationship present between two variables.  In the tips data set, the tips tend to increase linearly with the total bill. The coefficient of determination, $R^{2}$ tells us how much variance is explained by a best fit regression line through data.  An $R^{2}=1$  would indicate too good of fit and $R^{2}=0$ would indicate no fit.  Additionally, a low $p_{value}$ can tell us how likely that this observed fit is NOT due to random chance.

#### Correlation

Correlations can inform data scientists that there is a statistically significant relationship between one or more variables in a dataset. Although correlation can allow inferences about a causal relationships to be made, data scientists must note that correlation is not causation. The Pearson Correlation coefficient is on a scale from -1 to 1, where 0 implies no correlation, -1 is 100% negative correlation, and 1 is 100% positive correlation.

In [0]:
df = sns.load_dataset('mpg')

sns.heatmap(df.corr(), cmap='Blues', annot=True)
plt.show()

### Hypothesis Testing

Devising an experiment involves separating some of the ideas that you might have had before, during, and after the experiment. Let's say you are selling popcorn at a movie theater, and you have three sizes, small, medium, and Jumbo for `$`3.00, `$`4.50, or `$`6.50 . At the end of the week, the total sales are as follows 200, 100, 50 for sm, med, and lg respectively.  You also have an advertisement for medium popcorn on sale for `$`4.50, and you think that if you sell more jumbo sizes, that you make more money, so you decide to post an ad for jumbo popcorn.  Your hypothesis is as follows: I will get more weekly sales on average with a jumbo ad versus a medium ad.  

#### A-A Testing
To test this hypothesis we first must look at some historical data so that we can validate that our control is what we think it is.  Our hypothesis for this case is that there is no difference in week-to-week sales using the Ad for Medium popcorn.  If we test this hypothesis, we can use an F-test to compare some week in the past to all other weeks, or a 1-sample t-test to compare against the population mean.

#### A-B Testing
Assuming that we have validated the historical data using an A-A test for the old ad for medium popcorn, we can now test against the new ad for Jumbo popcorn.  If we then collect data for the new ad for several weeks, we can use a 2-sided t-test to compare.  In this experiment, we will collect data for several weeks or months using the control (The medium ad), and repeat for the test (The Jumbo Ad).

The most important aspect of hypothesis testing is the assumptions that you make about the control and the test groups. The null hypothesis in all cases would be the inverse of your hypothesis.  In A-A testing, the null hypothesis is that there is no difference amongst samples, and in the case of A-B testing, the null states that there is no difference between a control, and a test sample.  A successful A-A test is one in which you fail to reject the null, in other words, there are no differences inside your control group, it is more or less homogenous.  A successful A-B test is one where you you reject the null hypothesis, observing a significant difference.

#### Evaluating an Experiment
Using a t-test or another statistical test like F-test, ANOVA or Tukey HSD, we can measure the results of our experiment with two statistics.  The t-statistic informs us of the magnitude of the observed difference between samples, and the p-value tells us how likely that observed difference is due to random chance or noise.  Most statisticians and data scientists use 0.05 as an upper limit to a p-value, so any test that results in a p-value less than 0.05 would indicate that the difference observed is not likely due to random chance.


# Resources

[seaborn datasets](https://github.com/mwaskom/seaborn-data)



# Exercises

## Exercise 1

Find a dataset from the list below and plot the distributions of all the numeric columns.  In each distribution, you must also plot the median, $-1\sigma$, and $+1\sigma$


Full list of [Seaborn built-in datasets](https://github.com/mwaskom/seaborn-data)

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
#@title #### Change the variable to examine standard deviation

# Load tips dataset from Seaborn

tips = sns.load_dataset('tips')
var="total_bill" #@param ['size', "tip", "total_bill"]
df = tips

df[var].hist(color='lightblue')
plt.grid(False)
plt.axvline(df[var].median()-df[var].std(), color='purple')
plt.axvline(df[var].median()+df[var].std(), color='orangered')
plt.axvline(df[var].median(), color='teal', ls=':', lw=5.0)

plt.title(var+" distribution")
plt.legend(['-σ1','+σ1','median'])
plt.show()

print("-σ1",df[var].median()-df[var].std())
print("median:",df[var].median())
print("+σ1",df[var].median()+df[var].std())

**Validation**

In [0]:
# TODO(b/132249958)

## Exercise 2

Load  a dataset and take a simple random sample and return a dataframe with the standard error, and standard deviation for each column.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
df = sns.load_dataset('mpg')

sample = df.sample(25)

e = pd.DataFrame()
e["Population SE"] = df.std()/(df.shape[0]**.5)
e["Sample SE"] = sample.std()/(sample.shape[0]**.5)
e["population σ"] = df.std()
e["sample s"] = sample.std()
display(e)

As we can see, the samples tend to have a higher standard deviation than the sample population, and as a result a higher standard error.

**Validation**

In [0]:
#TODO(b/132249958)

## Exercise 3


Using a dataset that you found already, or a new one, create 2 visualizations that share the same figure using `plt`, as well as their mean and median lines.  The first visualization should show the frequency, and the second should show the probability.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
df = sns.load_dataset('mpg')

fig = plt.figure()

plt.subplot(1,2,1)
ax3 = df['acceleration'].hist()
plt.axvline(df['acceleration'].mean(), color='yellow')
plt.axvline(df['acceleration'].median(), 
            color='cyan', ls=':', lw=5.0)
plt.title("Frequency")

plt.subplot(1,2,2)
ax3 = sns.distplot(df['acceleration'])
plt.xlabel('')
plt.title("KDE")

plt.axvline(df['acceleration'].mean(), color='orangered')
plt.axvline(df['acceleration'].median(), 
            color='darkblue', ls=':', lw=5.0)

plt.legend(['mean', 'median'])
plt.suptitle("Acceleration", y=-.01)
plt.tight_layout()
plt.show()

print("mean:",df['acceleration'].mean(),
        "\nmedian:",df['acceleration'].median())

**Validation**

In [0]:
# TODO(b/132249958)

## Exercise 4


Plot 2 variables against each other, and calculate the $R^{2}$ and $p_{value}$ for a regression line that fits the data.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
# Load tips data set and plot total bill vs tip
tips = sns.load_dataset('tips')

def getR(data,x,y):
    result = pearsonr(data[x],data[y])
    r = round(result[0], 5)
    p = np.format_float_scientific(result[1], precision=4)

    g = sns.jointplot(x=x, y=y, data=data, kind='reg')
    plt.text(0,10,r'$R^{2}=%s$'%r)
    plt.text(0,9,r'$p_{value}=%s$'%p)
    plt.show()

getR(tips,'total_bill','tip')

**Validation**

In [0]:
# TODO(b/132249958)