# Basic Statistics

Some of the contents are from stat.yale.edu<br>
Revised by Junyao Yang

### Statistics theory is widely applied in data science, including but not limit to:
    1. Descriptive Statistics
    2. Probability Theory and Distribution(Normal Dist, Student T Dist, and Binomial Dist)
    3. Inferential Statistics
    4. Correlation and Regression
    5. Statistical Testing
    6. Bayesian Statistics
    
### Here are some fundamental statistical concepts that are particularly relevant to business students:<br>
    1. Descriptive Statistics
        a. Mean, median and mode
        b. Range and interquartile range
        c. Standard deviation
    2. Probability
        a. Basic probability
        b. Bayesian theory
    3. Sampling and Sampling distributions
        a. Sampling techniques
        b. Sampling distribution
    4. Inferential statistics
        a. Confidence intervals
        b. Statistics test
        c. Regression Analysis
    5. Financial Statistics
        a. Return analysis
        b. Risk analysis
    6. Business Forecasting
        a. Time and non-time series
        b. Trend analysis

<span style='color:blue'>Descriptive Statistics, Sampling and Sampling distributions, Statistical inference</span> will be briefly covered in this week. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
fname = "../../data/kc_house_data.csv"
df = pd.read_csv(fname)
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
### Descriptive Statistics is very simple. 
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540182.2,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367362.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [8]:
### Mean: The average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers.
### Median: The middle number; found by ordering all data points and picking out the one in the middle.
### Mode: The most frequent number—that is, the number that occurs the highest number of times.
df_sub = df.drop('date', axis = 1)
df_sub.median(axis = 0), df_sub.mode(axis = 0)

(id               3.904930e+09
 price            4.500000e+05
 bedrooms         3.000000e+00
 bathrooms        2.250000e+00
 sqft_living      1.910000e+03
 sqft_lot         7.618000e+03
 floors           1.500000e+00
 waterfront       0.000000e+00
 view             0.000000e+00
 condition        3.000000e+00
 grade            7.000000e+00
 sqft_above       1.560000e+03
 sqft_basement    0.000000e+00
 yr_built         1.975000e+03
 yr_renovated     0.000000e+00
 zipcode          9.806500e+04
 lat              4.757180e+01
 long            -1.222300e+02
 sqft_living15    1.840000e+03
 sqft_lot15       7.620000e+03
 dtype: float64,
             id     price  bedrooms  bathrooms  sqft_living  sqft_lot  floors  \
 0  795000620.0  350000.0       3.0        2.5       1300.0    5000.0     1.0   
 1          NaN  450000.0       NaN        NaN          NaN       NaN     NaN   
 2          NaN       NaN       NaN        NaN          NaN       NaN     NaN   
 3          NaN       NaN       NaN    

# Sampling 
### 1. allowing researchers to study a subset of a population 
### 2. make inferences about the entire population

Different sampling techniques are employed based on the research objectives, population characteristics, and available resources.<br> 
Here are some common sampling techniques:

&emsp;1.<span style='color:blue'>Simple Random Sampling:</span> Every individual in the population has an equal chance of being selected<br>
&emsp;2.<span style='color:blue'>Stratified Random Sampling:</span> The population is divided into subgroups (strata) based on certain characteristics<br>
&emsp;3.<span style='color:blue'>Systematic Sampling:</span> A fixed interval is selected, and every nth individual is included in the sample<br>
&emsp;4.<span style='color:blue'>Cluster Sampling:</span> The population is divided into clusters, and a random sample of clusters is selected.<br>
        &emsp;&emsp;This method is useful when it is more practical to sample groups rather than individuals.<br>
&emsp;5.<span style='color:blue'>Convenience Sampling:</span> Participants are chosen based on their availability and willingness to participate.<br>
        &emsp;&emsp;This method is quick and convenient but may lead to a biased sample.<br>
&emsp;6.<span style='color:blue'>Purposive Sampling: </span> Researchers intentionally choose individuals who meet specific criteria.<br>
        &emsp;&emsp;This method is subjective and relies on the researcher's judgment.<br>
&emsp;7.<span style='color:blue'>Probability Proportional to Size Sampling:</span> Frequent occurrence data have a higher probability of being included in the sample.<br>
        &emsp;&emsp;Useful when dealing with heterogeneous populations. <br>
        &emsp;&emsp;A group that consists of diverse and dissimilar individuals with variations in characteristics, attributes, or traits.<br>
&emsp;8.<span style='color:blue'>Multistage Sampling:</span> Involves multiple stages of sampling, often combining different methods.<br>

The choice of a sampling technique depends on the nature of the research, the characteristics of the population, and the resources available.<br>
It's important to select a method that minimizes bias and allows for generalization of findings to the broader population.

# Inferential statistics

Inferential statistics is a branch of statistics that involves making inferences, predictions, or generalizations about a population based on a sample of data from that population.

### 1. Confidence Interval
In statistical inference, one wishes to estimate population parameters using observed sample data.<br>
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)<br>

The common notation for the parameter in question is $\theta$. Often, this parameter is the population mean $\mu$, which is estimated through the sample mean $\bar{x}$.

The level C of a confidence interval gives the probability that the interval produced by the method employed includes the true value of the parameter $\theta$.

1. <span style='color:blue'>Point Estimate (x̄ or p̂):</span>
The sample statistic that serves as the best estimate of the population parameter.

2. <span style='color:blue'>Margin of Error (E): </span>
The range within which the true population parameter is expected to fall.<br>
Calculated based on the standard error of the point estimate and the chosen level of confidence.<br>

Confidence Interval = Point Estimate ± Margin of Error

Based on the Central Limit Theory(CLT), it's good to have a sample size greater or equal to 30 for a basic normal distribution assumption. 

##### Example
Suppose a student measuring the boiling temperature of a certain liquid observes the readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different samples of the liquid. He calculates the sample mean to be 101.82. If he knows that the standard deviation for this procedure is 1.2 degrees, what is the confidence interval for the population mean at a 95% confidence level?

In other words, the student wishes to estimate the true mean boiling temperature of the liquid using the results of his measurements. If the measurements follow a normal distribution, then the sample mean will have the distribution $N(\mu, \frac{\sigma}{\sqrt{n}})$. Since the sample size is 6, the standard deviation of the sample mean is equal to 1.2/sqrt(6) = 0.49.

The selection of a confidence level for an interval determines the probability that the confidence interval produced will contain the true parameter value. Common choices for the confidence level C are 0.90, 0.95, and 0.99. These levels correspond to percentages of the area of the normal density curve. For example, a 95% confidence interval covers 95% of the normal curve -- the probability of observing a value outside of this area is less than 0.05. Because the normal curve is symmetric, half of the area is in the left tail of the curve, and the other half of the area is in the right tail of the curve. As shown in the diagram to the right, for a confidence interval with level C, the area in each tail of the curve is equal to (1-C)/2. For a 95% confidence interval, the area in each tail is equal to 0.05/2 = 0.025.

The value $z^{*}$ representing the point on the standard normal density curve such that the probability of observing a value greater than $z^{*}$ is equal to $p$ is known as the upper $p$ critical value of the standard normal distribution. For example, if $p$ = 0.025, the value $z^{*}$ such that $P(Z > z^{*}) = 0.025$, or $(Z < z^{*}) = 0.975$, is equal to 1.96. For a confidence interval with level C, the value p is equal to (1-C)/2. A 95% confidence interval for the standard normal distribution, then, is the interval (-1.96, 1.96), since 95% of the area under the curve falls within this interval.



### Confidence Intervals for Unknown Mean and Known Standard Deviation

For a population with unknown mean $\mu$ and known standard deviation $\sigma$, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is $\bar{x} \pm z^{*}\frac{\sigma}{\sqrt{n}}$, where $z^{*}$ is the upper (1-C)/2 critical value for the standard normal distribution.

Note: This interval is only exact when the population distribution is normal. For large samples from other population distributions, the interval is approximately correct by the Central Limit Theorem.

##### Example:<br>
In the example above, the student calculated the sample mean of the boiling temperatures to be 101.82, with standard deviation 0.49. The critical value for a 95% confidence interval is 1.96, where (1-0.95)/2 = 0.025. A 95% confidence interval for the unknown mean $\mu$ is ((101.82 - (1.96*0.49)), (101.82 + (1.96*0.49))) = (101.82 - 0.96, 101.82 + 0.96) = (100.86, 102.78).

As the level of confidence decreases, the size of the corresponding interval will decrease. Suppose the student was interested in a 90% confidence interval for the boiling temperature. In this case, C = 0.90, and (1-C)/2 = 0.05. The critical value $z^{*}$ for this level is equal to 1.645, so the 90% confidence interval is ((101.82 - (1.645*0.49)), (101.82 + (1.645*0.49))) = (101.82 - 0.81, 101.82 + 0.81) = (101.01, 102.63)

I am 95% confident that the true boiling temperature of this liquid is in the range of (101.01, 102.63). 


In [10]:
from scipy.stats import norm
def confidence_interval_known_std(data, alpha=0.05, std_dev=None):
    """
    Construct a confidence interval for the population mean with a known standard deviation.

    Parameters:
    - data: NumPy array or list containing the sample data.
    - alpha: Confidence level (default is 0.05 for a 95% confidence interval).
    - std_dev: Known population standard deviation.

    Returns:
    - Confidence interval (lower bound, upper bound).
    """
    sample_mean = np.mean(data)
    z_critical = norm.ppf(1 - alpha / 2)  # Z-score for a two-tailed test

    margin_of_error = z_critical * (std_dev / np.sqrt(len(data)))

    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

# Example usage:
sample_data = np.array([22, 25, 28, 18, 32, 27, 23, 20, 29, 30])
known_std_dev = 5.0  # Replace with your known standard deviation

confidence_interval = confidence_interval_known_std(sample_data, std_dev=known_std_dev)
print("Confidence Interval:", confidence_interval)

Confidence Interval: (22.30102483847719, 28.498975161522807)


### Confidence Intervals for Unknown Mean and Unknown Standard Deviation

In most practical research, the standard deviation for the population of interest is not known. In this case, the standard deviation $\sigma$ is replaced by the estimated standard deviation s, also known as the standard error. Since the standard error is an estimate for the true value of the standard deviation, the distribution of the sample mean $\bar{x}$ is no longer normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$. Instead, the sample mean follows the t distribution with mean $\mu$ and standard deviation $\frac{s}{\sqrt{n}}$. The t distribution is also described by its degrees of freedom. For a sample of size n, the t distribution will have n-1 degrees of freedom. The notation for a t distribution with k degrees of freedom is $t(k)$. As the sample size n increases, the t distribution becomes closer to the normal distribution, since the standard error approaches the true standard deviation $\sigma$ for large n.

For a population with unknown mean $\mu$ and unknown standard deviation, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is $\bar{x} \pm t^{*}\frac{s}{\sqrt{n}}$, where $t^{*}$ is the upper (1-C)/2 critical value for the t distribution with n-1 degrees of freedom, t(n-1).

In [14]:
from scipy.stats import t
def confidence_interval_unknown_std(data, alpha=0.05):
    """
    Construct a confidence interval for the population mean with an unknown standard deviation.

    Parameters:
    - data: NumPy array or list containing the sample data.
    - alpha: Confidence level (default is 0.05 for a 95% confidence interval).

    Returns:
    - Confidence interval (lower bound, upper bound).
    """
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  # Use sample standard deviation (ddof=1 for unbiased estimate)
    n = len(data)

    t_critical = t.ppf(1 - alpha / 2, df = n - 1)  # T-score for a two-tailed test

    margin_of_error = t_critical * (sample_std / np.sqrt(n))

    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

# Example usage:
sample_data = np.array([22, 25, 28, 18, 32, 27, 23, 20, 29, 30])

confidence_interval = confidence_interval_unknown_std(sample_data)
print("Confidence Interval:", confidence_interval)

(22.30102483847719, 28.498975161522807)

Confidence Interval: (22.127030421684026, 28.67296957831597)


(22.30102483847719, 28.498975161522807)

Because of the distribution assumption, z vs t. The t distribution is going to make a wider "guess" comparing to a normal distribution.

### 2. Tests of Significance

Once sample data has been gathered through an observational study or experiment, statistical inference allows analysts to assess evidence in favor or some claim about the population from which the sample has been drawn. The methods of inference used to support or reject claims based on sample data are known as tests of significance.

Every test of significance begins with a null hypothesis $H_{0}$. $H_{0}$ represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write $H_{0}$: there is no difference between the two drugs on average.

The alternative hypothesis, $H_{a}$, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write $H_{a}$: the two drugs have different effects, on average. The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case we would write $H_{a}$: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "reject $H_{0}$ in favor of $H_{a}$" or "do not reject $H_{0}$"; we never conclude "reject $H_{a}$", or even "accept $H_{a}$".

If we conclude "do not reject $H_{0}$", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against $H_{0}$ in favor of $H_{a}$; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

Hypotheses are always stated in terms of population parameter, such as the mean $\mu$. An alternative hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either larger or smaller than the value given by the null hypothesis. A two-sided hypothesis claims that a parameter is simply not equal to the value given by the null hypothesis -- the direction does not matter.

Hypotheses for a one-sided test for a population mean take the following form:
$H_{0} = k$,
$H_{a} > k$
or
$H_{0} = k$,
$H_{a} < k$

Hypotheses for a two-sided test for a population mean take the following form:
$H_{0} = k$,
$H_{a} \neq k$

A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

##### Example
Suppose a test has been given to all high school students in a certain state. The mean test score for the entire state is 70, with standard deviation equal to 10. Members of the school board suspect that female students have a higher mean score on the test than male students, because the mean score $\bar{x}$ from a random sample of 64 female students is equal to 73. Does this provide strong evidence that the overall mean for female students is higher?

The null hypothesis $H_{0}$ claims that there is no difference between the mean score for female students and the mean for the entire population, so that $H_{0}$ = 70. The alternative hypothesis claims that the mean for female students is higher than the entire student population mean, so that $H_{a}$ > 70.

### Significance Tests for Population Mean with known Population Standard Deviation

Once null and alternative hypotheses have been formulated for a particular claim, the next step is to compute a test statistic. For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem), if the standard deviation $\sigma$ is known, the appropriate significance test is known as the z-test, where the test statistic is defined as:
$$z = \frac{\bar{x} - \mu_{0}}{\frac{\sigma}{\sqrt{n}}}$$

The test statistic follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the $P-value$ for the standard normal distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Given the null hypothesis that the population mean  is equal to a given value 0, the P-values for testing H0 against each of the possible alternative hypotheses are:

$P(Z > z)$ for $H_{a}: > 0 $ <br>
$P(Z < z)$ for $H_{a}: < 0 $ <br>
$2P(Z>|z|)$ for $H_{a}: 0 $. <br>

The probability is doubled for the two-sided test, since the two-sided alternative hypothesis considers the possibility of observing extreme values on either tail of the normal distribution.

##### Example:

In the test score example above, where the sample mean equals 73 and the population standard deviation is equal to 10, the test statistic is computed as follows:
$z = (73 - 70)/(\frac{10}{\sqrt{64}}) = 3/1.25 = 2.4$. Since this is a one-sided test, the P-value is equal to the probability that of observing a value greater than 2.4 in the standard normal distribution, or P(Z > 2.4) = 1 - P(Z < 2.4) = 1 - 0.9918 = 0.0082. The P-value is less than 0.01, indicating that it is highly unlikely that these results would be observed under the null hypothesis. The school board can confidently reject $H_{0} given this result, although they cannot conclude any additional information about the mean of the distribution.

In [12]:
from scipy.stats import norm

# H0 is my sample mean equals to the pop mean
# Ha is sample mean doesn't equal to the pop mean

# Generate a sample dataset (replace this with your actual data)
np.random.seed(100)
sample_data = np.random.normal(loc = 15, scale = 5, size = 30)  # Mean = 15, Standard Deviation = 5, Sample Size = 30

# Population parameters (replace this with your known or hypothesized population mean)
pop_mean = 15  # the population mean you want to test
pop_std_dev = 5  # Known population standard deviation

# Calculate the z-statistic
z_statistic = (np.mean(sample_data) - pop_mean) / (pop_std_dev / np.sqrt(len(sample_data)))

# Set the significance level(0.95) = 1 - confident level(0.95) (alpha)
alpha = 0.05

# Calculate the critical z-value for a two-tailed test
critical_z_value = norm.ppf(1 - alpha / 2)

# Perform the hypothesis test
p_value = 2 * (1 - norm.cdf(np.abs(z_statistic)))

# Print the results
print("Z-Statistic:", z_statistic)
print("Critical Z-Value:", critical_z_value)
print("P-Value:", p_value)

# Compare the p-value with the significance level
if p_value < alpha:
    print("Reject the null hypothesis. There is significant evidence to suggest a difference.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest a difference.")
    
# H0: my true population mean equals to 15, H0: true pupulation mean = 20
# Ha: my true population mean is not equal to 15, Ha: true pupulation mean != 20

Z-Statistic: 0.9012336827794607
Critical Z-Value: 1.959963984540054
P-Value: 0.3674640856208802
Fail to reject the null hypothesis. There is not enough evidence to suggest a difference.


### Significance Tests for Population Mean with Unknown Population Standard Deviation

In most practical research, the standard deviation for the population of interest is not known. In this case, the standard deviation $\sigma$ is replaced by the estimated standard deviation s, also known as the standard error. Since the standard error is an estimate for the true value of the standard deviation, the distribution of the sample mean $\bar{x}$ is no longer normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$. Instead, the sample mean follows the t distribution with mean ${\mu}$ and standard deviation $\frac{s}{\sqrt{n}}$. The t distribution is also described by its degrees of freedom. For a sample of size n, the t distribution will have n-1 degrees of freedom. The notation for a t distribution with k degrees of freedom is t(k). As the sample size n increases, the t distribution becomes closer to the normal distribution, since the standard error approaches the true standard deviation $\sigma$ for large n.

For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem) with unknown standard deviation, the appropriate significance test is known as the t-test, where the test statistic is defined as:
$$t = \frac{\bar{x} - \mu_{0}}{\frac{s}{\sqrt{n}}}$$

The test statistic follows the t distribution with n-1 degrees of freedom. The test statistic z is used to compute the P-value for the t distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis.

In [17]:
from scipy.stats import ttest_1samp

# Generate a sample dataset (replace this with your actual data)
np.random.seed(42)
sample_data = np.random.normal(loc=15, scale=5, size=30)  # Mean = 15, Standard Deviation = 5, Sample Size = 30

# Population parameter (replace this with your known or hypothesized population mean)
pop_mean = 14  # the population mean you want to test

# Perform the one-sample t-test
t_statistic, p_value = ttest_1samp(sample_data, pop_mean)

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Compare the p-value with the significance level
if p_value < alpha:
    print("Reject the null hypothesis. There is significant evidence to suggest a difference.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest a difference.")

T-Statistic: 0.07213517949624637
P-Value: 0.9429895415189937
Fail to reject the null hypothesis. There is not enough evidence to suggest a difference.


### The Sign Test
Another method of analysis for matched pairs data is a distribution-free test known as the sign test. This test does not require any normality assumptions about the data, and simply involves counting the number of positive differences between the matched pairs and relating these to a binomial distribution. The concept behind the sign test reasons that if there is no true difference, then the probability of observing an increase in each pair is equal to the probability of observing a decrease in each pair: p = 1/2. Assuming each pair is independent, the null hypothesis follows the distribution B(n,1/2), where n is the number of pairs where some difference is observed.

To perform a sign test on matched pairs data, take the difference between the two measurements in each pair and count the number of non-zero differences n. Of these, count the number of positive differences X. Determine the probability of observing X positive differences for a $B(n,1/2)$ distribution, and use this probability as a P-value for the null hypothesis.

##### Example<br>
In the "Helium Football" example above, 2 of the 39 trials recorded no difference between kicks for the air-filled and helium-filled balls. Of the remaining 37 trials, 20 recorded a positive difference between the two kicks. Under the null hypothesis, p = 1/2, the differences would follow the B(37,1/2) distribution. The probability of observing 20 or more positive differences, P(X>20) = 1 - P(X<19) = 1 - 0.6286 = 0.3714. This value indicates that there is not strong evidence against the null hypothesis, as observed previously with the t-test.

In [5]:
from scipy.stats import binom_test

# Generate a sample dataset (replace this with your actual data)
np.random.seed(42)
sample_data = np.random.normal(loc=15, scale=5, size=30)  # Sample data

# Hypothesized median (replace this with your hypothesized value)
hypothesized_median = 14

# Perform the one-sample sign test
signs = np.sign(sample_data - hypothesized_median)
num_positive = np.sum(signs > 0)
num_negative = np.sum(signs < 0)

# Use a binomial test to check if the number of positive signs is significantly different from the number of negative signs
p_value = binom_test(min(num_positive, num_negative), n=num_positive + num_negative)

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("Number of positive signs:", num_positive)
print("Number of negative signs:", num_negative)
print("P-Value:", p_value)

# Compare the p-value with the significance level
if p_value < alpha:
    print("Reject the null hypothesis. There is significant evidence to suggest a difference in medians.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest a difference in medians.")

Number of positive signs: 13
Number of negative signs: 17
P-Value: 0.5846647117286922
Fail to reject the null hypothesis. There is not enough evidence to suggest a difference in medians.
