** Descriptive Versus Inferential Statistics **
- Descriptive statistics is the branch of statistics that deals with the collection, organization, and presentation of data. It is used to describe and summarize data.
- Inferential statistics is the branch of statistics that deals with making predictions or inferences about a population based on a sample of data. It is used to draw conclusions about a population based on a sample of data.

** Populations, Samples, and Bias **
- A population is the entire group of individuals or items that we are interested in studying.
- A sample is a subset of the population that we collect data from.
- Bias is the tendency for a sample to differ from the population in a systematic way.


** Descriptive Statistics **
- Descriptive statistics are used to summarize and describe the main features of a dataset.
- Common descriptive statistics include measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and measures of shape (skewness, kurtosis).
- Descriptive statistics can be used to summarize the data in a meaningful way and to identify patterns and trends in the data.

**Mean and Weighted Mean**
- The mean is the average of a set of numbers. It is calculated by adding up all the numbers in the set and dividing by the total number of numbers.
- The weighted mean is a type of mean that takes into account the weights of the numbers in the set. It is calculated by multiplying each number by its weight, adding up the weighted numbers, and dividing by the total weight.

In [None]:
# Number of pets each person owns
sample = [1, 3, 2, 5, 7, 0, 2, 3]
mean = sum(sample) / len(sample)
print(mean) # prints 2.875

In [None]:
# Example 3-2. Calculating a weighted mean in Python
# Three exams of .20 weight each and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]
weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)
print(weighted_mean) # prints 81.4

In [None]:
# Three exams of .20 weight each and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [1.0, 1.0, 1.0, 2.0]
weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)
print(weighted_mean) # prints 81.4

** Median **
- The median is the middle value of a set of numbers when they are arranged in order. If there is an even number of values, the median is the average of the two middle values.

In [None]:
 # Number of pets each person owns
sample = [0, 1, 5, 7, 9, 10, 14]

def median(values):
    ordered = sorted(values)
    print(ordered)
    n = len(ordered)
    mid = int(n / 2) - 1 if n % 2 == 0 else int(n/2)
    if n % 2 == 0:
        2.0
    else:
        return ordered[mid]
print(median(sample)) # prints 7

** Mode **
- The mode is the value that appears most frequently in a set of numbers.


In [None]:
# Number of pets each person owns
from collections import defaultdict
sample = [1, 3, 2, 5, 7, 0, 2, 3]

def mode(values):
    counts = defaultdict(lambda: 0)
    for s in values:
        counts[s] += 1
    max_count = max(counts.values())
    modes = [v for v in set(values) if counts[v] == max_count]
    return modes
print(mode(sample)) # [2, 3]

Variance and Standard Deviation
- Variance is a measure of how spread out the numbers in a dataset are. It is calculated by taking the average of the squared differences between each number and the mean.
- Standard deviation is the square root of the variance. It is a measure of how spread out the numbers in a dataset are, with a larger standard deviation indicating a greater spread.


In [None]:
 # Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]
def variance(values):
    mean = sum(values) / len(values)
    _variance = sum((v - mean) ** 2 for v in values) / len(values)
    return _variance

print(variance(data)) # prints 21.387755102040813

In [None]:
from math import sqrt
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]
def variance(values):
    mean = sum(values) / len(values)
    _variance = sum((v - mean) ** 2 for v in values) / len(values)
    return _variance

def std_dev(values):
    return sqrt(variance(values))

print(std_dev(data)) # prints 4.624689730353898

In [None]:
from math import sqrt
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]
def variance(values, is_sample: bool = False):
    mean = sum(values) / len(values)
    _variance = sum((v - mean) ** 2 for v in values) /
    (len(values) - (1 if is_sample else 0))
    return _variance

def std_dev(values, is_sample: bool = False):
    return sqrt(variance(values, is_sample))

print("VARIANCE = {}".format(variance(data, is_sample=True))) # 24.95238095238095
print("STD DEV = {}".format(std_dev(data, is_sample=True))) # 4.99523582550223

The Normal Distribution
- The normal distribution is a bell-shaped distribution that is symmetrical around the mean. It is characterized by two parameters: the mean and the standard deviation.
- The normal distribution is important in statistics because many natural phenomena follow this distribution, and it is used in many statistical tests and models.


In [None]:
def normal_pdf(x: float, mean: float, std_dev: float) -> float:
    return (1.0 / (2.0 * math.pi * std_dev ** 2) ** 0.5) *

math.exp(-1.0 * ((x - mean) ** 2 / (2.0 * std_dev ** 2)))

The Cumulative Distribution Function (CDF)
- The cumulative distribution function (CDF) is a function that gives the probability that a random variable takes on a value less than or equal to a given value.
- The CDF is used to calculate probabilities for continuous random variables and is an important concept in probability theory and statistics.


In [None]:
from scipy.stats import norm
mean = 64.43
std_dev = 2.99
x = norm.cdf(64.43, mean, std_dev)
print(x) # prints 0.5

** The Inverse CDF **
- The inverse cumulative distribution function (inverse CDF) is the function that gives the value of a random variable for a given probability.
- The inverse CDF is used to calculate quantiles for continuous random variables and is an important concept in probability theory and statistics.

In [None]:
from scipy.stats import norm
x = norm.ppf(.95, loc=64.43, scale=2.99)
print(x) # 69.3481123445849

In [None]:
import random
from scipy.stats import norm
    for i in range(0,1000):
    random_p = random.uniform(0.0, 1.0)
    random_weight = norm.ppf(random_p, loc=64.43, scale=2.99)
    print(random_weight)

** Z-Scores **
- A z-score is a measure of how many standard deviations a data point is from the mean of a dataset.
- Z-scores are used to standardize data and compare data points from different datasets.

In [None]:
def z_score(x, mean, std):
    return (x - mean) / std

def z_to_x(z, mean, std):
    return (z * std) + mean

mean = 140000
std_dev = 3000
x = 150000
# Convert to Z-score and then back to X
z = z_score(x, mean, std_dev)
back_to_x = z_to_x(z, mean, std_dev)
print("Z-Score: {}".format(z)) # Z-Score: 3.333
print("Back to X: {}".format(back_to_x)) # Back to X: 150000.0

** Inferential Statistics **
- Inferential statistics are used to make predictions or inferences about a population based on a sample of data.
- Common inferential statistics include hypothesis testing, confidence intervals, and regression analysis.
- Inferential statistics are used to draw conclusions about a population based on a sample of data and to make predictions about future outcomes.


** The Central Limit Theorem **
- The central limit theorem states that the sampling distribution of the sample mean will be approximately normally distributed, regardless of the shape of the population distribution, as long as the sample size is large enough.
- The central limit theorem is important in statistics because it allows us to make inferences about a population based on a sample of data, even if the population distribution is not normal.


In [None]:
import random
import plotly.express as px
sample_size = 31
sample_count = 1000
# Central limit theorem, 1000 samples each with 31
# random numbers between 0.0 and 1.0
x_values = [(sum([random.uniform(0.0, 1.0) for i in range(sample_size)]) / \
sample_size)
    for _ in range(sample_count)]
y_values = [1 for _ in range(sample_count)]
px.histogram(x=x_values, y = y_values, nbins=20).show()

** Confidence Intervals **
- A confidence interval is a range of values that is likely to contain the true value of a population parameter.
- Confidence intervals are used to estimate the precision of a sample estimate and to make inferences about a population based on a sample of data.


In [2]:
from scipy.stats import norm
def critical_z_value(p):
    norm_dist = norm(loc=0.0, scale=1.0)
    left_tail_area = (1.0 - p) / 2.0
    upper_area = 1.0 - ((1.0 - p) / 2.0)
    return norm_dist.ppf(left_tail_area), norm_dist.ppf(upper_area)
print(critical_z_value(p=.95))
# (-1.959963984540054, 1.959963984540054)

(np.float64(-1.959963984540054), np.float64(1.959963984540054))


In [4]:
from math import sqrt
from scipy.stats import norm
def critical_z_value(p):
    norm_dist = norm(loc=0.0, scale=1.0)
    left_tail_area = (1.0 - p) / 2.0
    upper_area = 1.0 - ((1.0 - p) / 2.0)
    return norm_dist.ppf(left_tail_area), norm_dist.ppf(upper_area)

In [5]:
def confidence_interval(p, sample_mean, sample_std, n):
    # Sample size must be greater than 30
    lower, upper = critical_z_value(p)
    lower_ci = lower * (sample_std / sqrt(n))
    upper_ci = upper * (sample_std / sqrt(n))
    return sample_mean + lower_ci, sample_mean + upper_ci
print(confidence_interval(p=.95, sample_mean=64.408, sample_std=2.05, n=31))
# (63.68635915701992, 65.12964084298008)

(np.float64(63.68635915701992), np.float64(65.12964084298008))


Understanding P-Values
When we say something is statistically significant, what do we mean by that? We hear
it used loosely and frequently but what does it mean mathematically? Technically, it
has to do with something called the p-value, which is a hard concept for many folks
to grasp. But I think the concept of p-values makes more sense when you trace it back
to its invention. While this is an imperfect example, it gets across some big ideas.

Hypothesis Testing
Past studies have shown that the mean recovery time for a cold is 18 days, with a
standard deviation of 1.5 days, and follows a normal distribution
You have a new drug that you think will reduce the recovery time for a cold.

In [7]:
from scipy.stats import norm
# Cold has 18 day mean recovery, 1.5 std dev
mean = 18
std_dev = 1.5
# 95% probability recovery time takes between 15 and 21 days.
x = norm.cdf(21, mean, std_dev) - norm.cdf(15, mean, std_dev)
print(x) # 0.9544997361036416

0.9544997361036416


In [8]:
from scipy.stats import norm
# Cold has 18 day mean recovery, 1.5 std dev
mean = 18
std_dev = 1.5
# Probability of 16 or less days
p_value = norm.cdf(16, mean, std_dev)
print(p_value) # 0.09121121972586788

0.09121121972586788


** Two-Tailed Test **
- A two-tailed test is a statistical test in which the null hypothesis is rejected if the test statistic is either significantly greater than or significantly less than the critical value.
- A two-tailed test is used when the alternative hypothesis is that the population parameter is not equal to a specified value.

In [9]:
from scipy.stats import norm
# Cold has 18 day mean recovery, 1.5 std dev
mean = 18
std_dev = 1.5
# What x-value has 2.5% of area behind it?
x1 = norm.ppf(.025, mean, std_dev)
# What x-value has 97.5% of area behind it
x2 = norm.ppf(.975, mean, std_dev)
print(x1) # 15.060054023189918
print(x2) # 20.93994597681008

15.060054023189918
20.93994597681008


In [10]:
from scipy.stats import norm
# Cold has 18 day mean recovery, 1.5 std dev
mean = 18
std_dev = 1.5
# Probability of 16 or less days
p1 = norm.cdf(16, mean, std_dev)
# Probability of 20 or more days
p2 = 1.0 - norm.cdf(20, mean, std_dev)
# P-value of both tails
p_value = p1 + p2
print(p_value) # 0.18242243945173575

0.18242243945173575


In [11]:
from scipy.stats import t
# get critical value range for 95% confidence
# with a sample size of 25
n = 25
lower = t.ppf(.025, df=n-1)
upper = t.ppf(.975, df=n-1)
print(lower, upper)
# -2.063898561628021 2.0638985616280205

-2.063898561628021 2.0638985616280205


Big Data Considerations and the
Texas Sharpshooter Fallacy
- The Texas sharpshooter fallacy is a logical fallacy in which a person cherry-picks data after the fact to suit their argument or hypothesis.
- The Texas sharpshooter fallacy is a common mistake in data analysis and can lead to incorrect conclusions and biased results.
- To avoid the Texas sharpshooter fallacy, it is important to define the hypothesis or research question before collecting and analyzing data and to use appropriate statistical methods to test the hypothesis.
