### 1. What Are Summary Statistics?

* Definition: Summary statistics are _numbers that summarize_ and provide information about your data.
* A statistic is a numerical value that is computed from a sample of data.
  * It's used to infer or estimate characteristics of the larger population from which the sample is drawn.
* Summary statistics typically include measures of central tendency, measures of variability, and quantiles.


### The Central Tendendcy
* These are statistics that represent the center point or typical value of a dataset.
  * They're like the 'average' values that summarize where most data points lie. 
  * The three main measures of central tendency are:
* Mean: This is what most people think of as the average. 
  * For example, if your data is `[2, 3, 5]`, the mean is $(2+3+5)/3 = 3.33$.
* Median: This is the middle value in your data when it is arranged in ascending order. 
  * For example, in the data `[2, 3, 5, 7]`, the median is $(3+5)/2 = 4$.
* Mode: This is the number that appears most frequently in your data. 
  * For example, in `[1, 2, 2, 3]`, the mode is `2`.

### Mean
* The formula to compute the mean for a set of n values $x_1, x_2, ..., x_n$ is:
$$ Mean = \bar{x} = \frac{\sum_{i=1}^{n}x_i}{n}, $$ 

* The symbol  (pronounced “x-bar”) being used to represent the mean of a sample from a 
population.

* This also a an esitmate (measure) of the mean.
  * Why is it an esitmate? 


### Median

* The median is referred to as a robust estimate of location since it is not influenced by 
extreme values (outliers) that could skew the results.
  * An outlier is any value that is very  distant from the other values in a data set.
    * Does not look like other values in the data.
    * Ouliers may be erroneous or rare data.
* Outliers can lead to a poor estimate of the mean
  * While the  median will still be valid.


#### Measures of Variability

* Location is just one dimension in summarizing a feature.
* Variability, also referred to as dispersion, measures whether the data values are tightly grouped or spread out.
  * Variability is a very important concept in statistics, especially in hypothesis testing and machine learning.
* These statistics describe the spread or dispersion of your data, showing how much the data points differ from each other and from the central tendency measures.

* Range: difference between the highest and lowest values in your dataset.
  * For example, in [1, 2, 3, 10], the range is 10 - 1 = 9.

* Variance: how much the data points differ from the mean.
  * It's the average of the squared differences from the Mean. 

* Standard Deviation: the square root of the variance.
  * It's widely used because it is in the same units as the data, making it easier to interpret.
  * A larger standard deviation means the data is more spread out.

### Mean Absolute Deviation
$$
\begin{align*}
\text{Mean absolute deviation} &= \frac{\sum_{i=1}^{n}|x_i - \bar{x}|}{n}\\  
\text{Variance} &= s^2 =  \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\\
\text{Standard deviation} &= s =  \sqrt{\text{Variance}} \\
\text{Median Absolute deviation} &= Median(|x_1 - median|, |x_2 - median|, ... |x_n - median|)
\end{align*}
$$

* Of the above, only the median absolute deviation is not influenced by outliers and extreme values
  *  $s$ and $s^2$ are especially sensitive to outliers.


### Order Statistics

* A different approach to estimating dispersion is based on looking at the spread of the  sorted data. 
  * Statistics based on sorted (ranked) data are referred to as order statistics. 
* The most basic measure is the range: the difference between the largest and smallest numbers.





### Quantiles

* Quantiles are values that divide your data into equal parts.  

* The most common quantiles are:
* Quartiles: These split the data into four equal parts. 
  * The first quartile (Q1) is the median of the first half of the data.
  * The second quartile (Q2) is the median of the total data
  * The third quartile (Q3) is the median of the second half of the data.

* Percentiles: These divide the data into 100 equal parts.
  * the $P_{th}$ percentile is a value such that at least P percent of the values take on this value or less and at least (100 – P) percent of the values take on this value or more.

  * For example, the 25th percentile is the point below which 25% of the sorted data lies.
  * Thee median is the same thing as the 50th percentile.
  * To find the 80th percentile:
    - sort the data
    - starting with the smallest value, proceed 80 percent of the way to the largest values



In [None]:
### The Interquartile Range (IQR)

*  The difference between the 25th percentile and the 75th percentile.

* Example: given dataset: {1,2,3,3,5,6,7,9}
  * The 25th percentile is at 2.5, 
  * The 75th percentile is at 6.5
  * So the interquartile range is 6.5 – 2.5 = 4. 

In [None]:
### Why Do We Need Summary Statistics?

* They provide a quick snapshot of the data
* Understanding: They help in understanding the distribution, tendency, and spread of data.
* Comparison: They make it easier to compare different datasets.


In [1]:
import numpy as np
import pandas as pd

# Sample data
data = np.array([5, 10, 15, 20, 25, 30, 10, 35, 40, 45, 50])

# Creating a DataFrame
df = pd.DataFrame(data, columns=['SomeCol'])

# Calculating summary statistics
mean_value = df['SomeCol'].mean()

median_value = df['SomeCol'].median()
mode_value = df['SomeCol'].mode()[0]
std_dev = df['SomeCol'].std()
variance = df['SomeCol'].var()


In [2]:
print(f"The mean is {mean_value}")

The mean is 25.90909090909091


In [3]:
print(f"The median is {median_value}")

The median is 25.0


In [4]:
print(f"The 'mode is {mode_value}")

The 'mode is 10


In [5]:
print(f"The standard deviation is {std_dev}")

The standard deviation is 15.300029708824395


In [6]:
print(f"The variance deviation is {std_dev}")

The variance deviation is 15.300029708824395


# A/B Testing

Have two version of a webpage and a percentage of user will go to A and a percentage will go to B.

Analyze result to see which webpage led to better result.
- online revenue
