# Intermediary Statistics

# Normal Distribution
**Normal Distribution** is the most common or normal form of distribution of Random Variables, hence the name *"normal distribution."* It is also called **Gaussian Distribution** in Statistics or Probability. We use this distribution to represent a large number of random variables. It serves as a foundation for statistics and probability theory.

We observe that the curve traced by the upper values of the Normal Distribution is in the shape of a Bell, hence Normal Distribution is also called the **"Bell Curve".**

# Features 
- **Symmetry**: The normal distribution is symmetric around its mean. This means the left side of the distribution mirrors the right side.
- **Mean, Median, and Mode**: In a normal distribution, the mean, median, and mode are all equal and located at the center of the distribution.
- **Bell-shaped Curve**: The curve is bell-shaped, indicating that most of the observations cluster around the central peak, and the probabilities for values further away from the mean taper off equally in both directions.
- **Standard Deviation**: The spread of the distribution is determined by the standard deviation. About 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

# Examples:
- Distribution of Height of People.
- Distribution of Errors in any Measurement.
- Distribution of Blood Pressure of any Patient, etc.

# Normal Distribution Standard Deviation
For *smaller values of the standard deviation*, the values in the graph come closer and the *graph becomes narrower*. While for *higher values of the standard deviation* the values in the graph are dispersed more and the *graph becomes wider*.

![](https://miro.medium.com/v2/resize:fit:958/1*iKBF2jUuunveJw7f1v7csA.png)

# Empirical Rule of Standard Deviation
Generally, the normal distribution has a positive standard deviation and the standard deviation divides the area of the normal curve into smaller parts and each part defines the percentage of data that falls into a specific region This is called the Empirical Rule of Standard Deviation in Normal Distribution.

#### Empirical Rule states that,

- 68% of the data approximately fall within one standard deviation of the mean, i.e. it falls between {Mean - One Standard Deviation, and Mean + One Standard Deviation}
- 95% of the data approximately fall within two standard deviations of the mean, i.e. it falls between {Mean - Two Standard Deviation, and Mean + Two Standard Deviation}
- 99.7% of the data approximately fall within a third standard deviation of the mean, i.e. it falls between {Mean - Third Standard Deviation, and Mean + Third Standard Deviation}

![](https://media.geeksforgeeks.org/wp-content/uploads/20230901155813/Probability-Distribution-Curve.png "Distribution Curve")

**Using Empirical Rule we distribute data broadly in three parts. And thus, empirical rule is also called "68 - 95 - 99.7" rule.**


**Key Idea**: A Normal distribution is a pattern, and patters allows us to categorize data with more confidence.

# p-value
The p-value indicates the probability that the observed result or an even more extreme result will occur if the null hypothesis is true.

The p-value is used to decide whether the null hypothesis is rejected or not rejected (not rejected). If the p-value is smaller than the defined significance level (often 5%), the null hypothesis is rejected, otherwise not.

**My understanding**: So lets say we have a null hypothesis which will be the control and we start the testing now after multiple testing, if the test group is not changed from the control group then the p value is high (>0.05) which means we fail to reject the null hypothesis but if we see a change that the test group is different such that the p value is significant (<0.05) then we can reject the null hypothesis. i.e only <5% of the time the null hypothesis was true.

**Significance level:**

The significance level is determined before the test. If the calculated p-value is below this value, the null hypothesis is rejected, otherwise it is not rejected. As a rule, a significance level of 5 % is chosen.

- `alpha < 0.01 : very significant result.`
- `alpha < 0.05 : significant result.`
- `alpha > 0.05 : not significant result.`

The significance level thus indicates the probability of a 1st type error. What does this mean? If there is a p-value of 5% and the null hypothesis is rejected, the probability that the null hypothesis is valid is 5%, i.e. there is a 5% probability of making a mistake. If the critical value is reduced to 1%, the probability of error is accordingly only 1%, but it is also more difficult to confirm the alternative hypothesis.

It is to note that the p-value indicates that the test subjects are different and **not** how much different from each other.

# Test for normality

# Types of Normality Tests
Test | Notes
--------------------------------------|--------------------------------------
Shapiro-Wilk Test	|Very common, powerful for small to medium samples (n < 2000)
**Kolmogorov-Smirnov Test** | Compares sample distribution to a normal distribution (less powerful)
**Lilliefors Test (adjusted KS Test)** | Corrects the KS test for when mean and variance are estimated
**Anderson-Darling Test** | Modification of KS test, more sensitive to tails
**Jarque-Bera Test** | Based on skewness and kurtosis (used a lot in econometrics)
**D'Agostino's K-squared Test** | Also based on skewness and kurtosis

## How to interpret results
$H_0 : \text{The distibution is Gaussian.}$

- $\text{if p-value > 0.05 -> The distribution appears to have a normal distribution.}$

- $\text{if p-value < 0.05 -> The distribution doesn't look Gaussian i.e reject the $H_0$ hypothesis }$


## Different tests for normality

### Shapiro-Wilk Test

**Description:**  
The Shapiro-Wilk test checks whether a sample comes from a normally distributed population. It is based on the correlation between the data and the corresponding normal scores.

**When to use:**  
Use it for small to medium sample sizes (typically n < 2000). It's very powerful and often the first choice for normality testing.

---

### Kolmogorov-Smirnov Test

**Description:**  
The Kolmogorov-Smirnov (KS) test compares the empirical distribution function of the sample with the cumulative distribution function of a normal distribution.

**When to use:**  
Use it when you want a general goodness-of-fit test, but note that it is less powerful for specifically testing normality, especially with small samples.

---

### Lilliefors Test

**Description:**  
The Lilliefors test is an adaptation of the KS test that accounts for the estimation of the mean and variance from the sample.

**When to use:**  
Use it when you apply the KS test but the population parameters (mean and variance) are unknown and must be estimated from the sample.

---

### Anderson-Darling Test

**Description:**  
The Anderson-Darling test is a refinement of the KS test that gives more weight to the tails of the distribution when assessing normality.

**When to use:**  
Use it when you are particularly concerned about how well the distribution fits in the tails (extremes of the data).

---

### Jarque-Bera Test

**Description:**  
The Jarque-Bera test measures departure from normality based on skewness and kurtosis.

**When to use:**  
Use it in large-sample contexts, particularly in econometrics or finance, where skewness and kurtosis are critical.

---

### D'Agostino's K-squared Test

**Description:**  
D'Agostino's K-squared test combines measures of skewness and kurtosis to test for normality.

**When to use:**  
Use it for medium to large sample sizes when you want a more analytical approach to normality based on both skewness and kurtosis(Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur).

---

## What to do when the data is not normal:

1) **Use Non-Parametric Tests**: These tests don't assume normality

Situation | Normal Test | Non-Parametric Alternative
--------|-----------------|-----------------------------------
Compare two groups | t-test | Mann-Whitney U test
Compare more than two groups | ANOVA | Kruskal-Wallis test
Paired samples | Paired t-test | Wilcoxon signed-rank test
Correlation | Pearson | Spearman/Kendall


2) **Transform Your Data**

Apply a transformation to make the data more normal-like:

- Log transformation (good for right-skewed data)

- Square root transformation (good for count data)

- Box-Cox transformation (automatically finds the best transformation)

- Reciprocal transformation (use with caution; can be aggressive)


3) **Use Robust Statistical Methods**

Some modern methods are robust to non-normality:

- Robust regression

- Bootstrapping (resampling techniques to estimate p-values, confidence intervals)

4) **Acknowledge and Move On**

If the sample size is large (n > 30-50), the Central Limit Theorem says the sampling distribution of the mean is approximately normal, even if the data itself isn't.
- In this case, you might be safe to continue with normal-based tests.

---

# Z-score
Z-Score in statistics is a measurement of how many standard deviations away a data point is from the mean of a distribution. A z-score of 0 indicates that the data point's score is the same as the mean score. A positive z-score indicates that the data point is above average, while a negative z-score indicates that the data point is below average.

For example, a Z-score of 2 indicates the value is 2 standard deviations away from the mean. To use a z-score, we need to know the population mean ($\mu$) and also the population standard deviation ($\sigma$).

*Z-score is a statistical measure that describes a value's position relative to the mean of a group of values. It is expressed in terms of standard deviations from the mean. The Z-score indicates how many standard deviations an element is from the mean.*

$$ z = \frac{(X-\mu)}{\sigma}$$

$$z = \text{Z-Score}$$
$$X = \text{Value of Element}$$
$$\mu = \text{Population Mean}$$
$$\sigma = \text{Population Standard Deviation}$$

---

## How to interpret Z - score
- **Z-Score = 0**: A Z-score of 0 indicates that the data point is exactly at the mean of the distribution.
Positive Z-Score: A positive Z-score indicates that the data point is above the mean. For example, a Z-score of 1.5 means the data point is 1.5 standard deviations above the mean.
- **Negative Z-Score**: A negative Z-score indicates that the data point is below the mean. For example, a Z-score of -2 means the data point is 2 standard deviations below the mean.
- **Magnitude of Z-Score**: The magnitude of the Z-score shows how far away the data point is from the mean. A larger absolute value of the Z-score indicates that the data point is farther from the mean, while a smaller absolute value indicates it is closer.
- **Common Thresholds**:
    - Z-Score > 2 or < -2: Often considered unusual or significant, indicating the data point is more than 2 standard deviations away from the mean.
    - Z-Score > 3 or < -3: Typically considered an outlier, suggesting the data point is extremely far from the mean.

---

## Properties of Z-Score
- The magnitude of the Z-score reflects how far a data point is from the mean in terms of standard deviations.
- An element having a z-score of less than 0 represents that the element is less than the mean.
- Z-scores allow for the comparison of data points from different distributions.
- An element having a z-score greater than 0 represents that the element is greater than the mean.
- An element having a z-score equal to 0 represents that the element is equal to the mean.
- An element having a z-score equal to 1 represents that the element is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean, and so on.
- An element having a z-score equal to -1 represents that the element is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean, and so on.
- If the number of elements in a given set is large, then about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; about 99% have a z-score between -3 and 3. This is known as the Empirical Rule, and it states the percentage of data within certain standard deviations from the mean in a normal distribution as demonstrated in the image below

---

## Outliers Using the Z-Score Value
- After calculating the z-score, we will determine the cutoff value for the z-score after which the data point could be considered as an outlier. This cutoff value is a hyper-parameter that we decide depending on our project.
- A data point whose z-score value is greater than 3 means the data point does not belong to the 99.73 % point of the dataset. 
- Any data point whose z-score is greater than our decided cutoff value will be considered an outlier.

---

## Application of Z-Score
- Z-scores are often used for feature scaling to bring different features to a common scale. Normalizing features ensures that they have zero mean and unit variance, which can be beneficial for certain machine learning algorithms, especially those that rely on distance measures.
- Z-scores can be used to identify outliers in a dataset. Data points with Z-scores beyond a certain threshold (usually 3 standard deviations from the mean) may be considered outliers.
- Z-scores can be used in anomaly detection algorithms to identify instances that deviate significantly from the expected behavior.
- Z-scores can be applied to transform skewed distributions into more normal distributions.
- When working with regression models, Z-scores of residuals can be analyzed to check for homoscedasticity (constant variance of residuals).
- Z-scores can be used in feature scaling by looking at their standard deviations from the mean.

---

|Z- Score                                                                                                                 | Standard Deviation
|-------------------------------------------------------------------------------------------------------------------------|------------
|Transform raw data into a standardized scale.                                                                            |Measures the amount of variation or dispersion in a set of values.      
|Makes it easier to compare values from different datasets because they take away the original units of measurement.      |Standard Deviation retains the original units of measurement, making it less suitable for direct comparisons between datasets with different units.      
|Indicate how far a data point is from the mean in terms of standard deviations, providing a measure of the data point’s relative position within the distribution                                                                                                  |Expressed in the same units as the original data, providing an absolute measure of how spread out the values are around the mean

---

# Confidence Interval
Confidence Interval is a range where we are certain that true value exists. The selection of a confidence level for an interval determines the probability that the confidence interval will contain the true parameter value. This range of values is generally used to deal with population-based data, extracting specific, valuable information with a certain amount of confidence, hence the term 'Confidence Interval'.

Example: If we calculate a 95% confidence interval for a population's average height, and we randomly select a sample of 50 students and calculate their average height to be 165 cm for instance, and the result is a range of 160 to 170 cm, this suggests that if we were to take multiple samples and create confidence intervals in the same manner, we should anticipate that approximately 95% of those intervals would contain the population's true average height.

$$CI = \mu ± \text{(t or z)}(\frac{\sigma}{\sqrt{n}})$$
$$\mu = \text{mean}$$
$$\text{t or z} = \text{chosen t-value/z-value from the table}$$
$$\sigma = \text{the standard deviation}$$
$$n = \text{number of observations}$$

# Confidence Level
The confidence level describes the uncertainty associated with a sampling method. 

Suppose we used the same sampling method (say sample mean) to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter, and some would not. 

A 90% confidence level means that we would expect 90% of the interval estimates to include the population parameter. A 95% confidence level means that 95% of the intervals would include the population parameter.

## Confidence interval vs Level

|Aspects|Confidence Interval|Confidence Level    
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------
Definition     |A confidence interval is a range of values calculated from sample data that is likely to include the true unknown parameter of a population.|The confidence level represents the degree of confidence that the true parameter falls within the calculated confidence interval.|
Representation |Numerical range (e.g., [Lower Bound, Upper Bound])|Typically expressed as a percentage (e.g., 95%)|
Interpretation |The range within which the true parameter is expected to fall with a certain level of confidence.|The level of confidence in the estimation being made|
Example        |A 95% confidence interval for the mean height is [65, 70]|We are 95% confident that the true mean height falls within the interval

## Factors influencing Confidence Interval
The width of a confidence interval is primarily influenced by three factors:

- **Sample Size (n)**: Larger sample sizes tend to result in narrower confidence intervals because they provide more precise estimates of the population parameter.
- **Variability in the Data (Standard Deviation or Standard Error)**: Greater variability in the data leads to wider confidence intervals, as there is more uncertainty in the estimates.
- **Confidence Level (CL)**: Higher confidence levels, such as 95% or 99%, result in wider intervals. This reflects the trade-off between precision and confidence - higher confidence requires a wider range.

## When to use 

- These intervals give important information about the dependability of study findings and provide a measure of uncertainty around a point estimate.
- Confidence intervals help researchers express the accuracy of their estimates and make stronger conclusions when working with sample data.
- Confidence intervals, in essence, give a reasonable range for the true population parameter while acknowledging the inherent variability in data.
- Confidence intervals are frequently used by researchers in hypothesis testing so they can determine whether a given value falls within the interval and thus influence conclusions about the statistical significance of the data.
- Whether in medical research, social sciences, or business analytics, the judicious use of confidence intervals enhances the credibility and depth of statistical inferences, fostering a nuanced understanding of the underlying phenomena.

# T-test
The t-test is used to compare the averages of two groups to see if they are significantly different from each other.

## Assumptions in T-test
- **Independence**: The observations within each group must be independent of each other means that the value of one observation should not influence the value of another observation.
- **Normality**: The data within each group should be approximately normally distributed i.e, the data within each group being compared should resemble a normal bell-shaped distribution.
- **Homogeneity of Variances**: The variances of the two groups being compared should be equal. This assumption ensures that the groups have a similar spread of values.
- **Absence of Outliers**: There should be no outliers in the data as outliers can influence the results especially when sample sizes are small.

## Prerequisites

1. **Hypothesis Testing**: Hypothesis testing is a statistical method used to make inferences about a population based on a sample of data.

2. **P-value**: The p-value is the probability of observing a test statistic given that the null hypothesis is true.

     - A small p-value usually less than 0.05 means the results are unlikely to be due to random chance so we reject the null hypothesis.
     - A large p-value means the results could easily happen by chance so we don't reject the null hypothesis.

3. **Degree of freedom (df)**: The degree of freedom tells us the number of independent variables used for calculating the estimate between 2 sample groups.

    - In a t-test the degree of freedom is calculated as the total sample size minus 1, i.e $df = \sum(n_s) - 1$ where "$n_s$" is is the number of observations in the sample. Suppose, we have 2 samples A and B. The df would be calculated as $df = (\sum(n_A) - 1) + (\sum(n_B) - 1)$

4. **Significance Level**: The significance level is the predetermined threshold that is used to decide whether to reject the null hypothesis. Commonly used significance levels are 0.05, 0.01, or 0.10.

5. **T-statistic**: The t-statistic is a measure of the difference between the means of two groups. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

    - If the t-value is large => the two groups belong to different groups. 
    - If the t-value is small => the two groups belong to the same group.
  
6. **T-Distribution**: The t-distribution commonly known as the Student's t-distribution is a probability distribution with tails that are thicker than those of the normal distribution.

7. **Statistical Significance**: Statistical significance is determined by comparing the p-value to the chosen significance level.

    - If the p-value is less than or equal to the significance level the result is considered statistically significant and the null hypothesis is rejected.
    - If the p-value is greater than the significance level the result is not statistically significant and there is insufficient evidence to reject the null hypothesis.
    - In the context of a t-test these concepts are applied to compare means between two groups. It check whether the means are significantly different from each other.. The p-value from the t-test is then compared to the significance level to make a decision about the null hypothesis.

## Types of T-tests
There are three types of t-tests and they are categorized as dependent and independent t-tests.

![](https://statisticseasily.com/wp-content/uploads/2024/02/paired-t-test-2-1024x576.jpg "Types of t-tests")

### One sample T-test
One sample t-test is used for comparison of the sample mean of the data to a particularly given value. We can use this when the sample size is small. (under 30) data is collected randomly and it is approximately normally distributed.

$$t = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$

$$t = \text{t statistic}$$
$$\bar{x} = \text{mean of the sample}$$
$$\mu = \text{mean of the population}$$
$$\sigma = \text{standard deviation of the sample}$$
$$n = \text{sample size}$$

### Independent sample T-test
An Independent sample t-test commonly known as an unpaired sample t-test is used to find out if the differences found between two groups is actually significant or just a random occurrence. We can use this when:

- the population mean or standard deviation is unknown. (information about the population is unknown)
- the two samples are separate/independent. For eg. boys and girls (the two are independent of each other)

$$t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

$$t = \text{t statistic}$$
$$\bar{x_1} = \text{mean of the sample 1}$$
$$\bar{x_2} = \text{mean of the sample 2}$$

$$\sigma_1 = \text{standard deviation of the sample 1}$$
$$\sigma_2 = \text{standard deviation of the sample 2}$$

$$n_1 = \text{sample size of 1st sample}$$
$$n_2 = \text{sample size of 2st sample}$$

### Paired Two-sample T-test
Paired sample t-test also known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0. The test is done on dependent samples usually focusing on a particular group of people or things. In this each entity is measured twice resulting in a pair of observations. 

We can use this when:

- Two similar samples are given. [Eg, Scores obtained in English and Math (both subjects)]
- The dependent variable data is continuous.
- The observations are independent of one another.
- The dependent variable is approximately normally distributed.

$$t = \frac{\bar{x_d}}{\frac{\sigma}{\sqrt{n}}}$$

$$t = \text{t statistic}$$
$$\bar{x} = \text{mean of the sample}$$

$$\sigma = \text{standard deviation of the sample}$$
$$n = \text{sample size}$$

### t-statistic	Interpretation
- **Large positive t (e.g., t = 5, 10, 20)**: Group 1 mean is much larger than Group 2
- **Large negative t (e.g., t = -5, -10, -20)**: Group 1 mean is much smaller than Group 2
- **t ≈ 0**: Means are almost the same, no significant difference

# Chi-Square Test
Chi-squared test, or $\chi^2$ test, indicates that there is a relationship between two entities. For example, it can be demonstrated when we look out for people's favorite colors and their preference for ice cream. The test is instrumental in telling whether these two variables are associated with each other. For instance, it is possible that individuals who prefer the color blue also tend to be in favor of chocolate ice cream.

This test checks whether or not observed data fits those that would be expected assuming that association is absent at all, where there is a huge deviation.

$$\chi^2 = \frac{\sum{(O_i - E_i)^2}}{E_i}$$

$$\sum: \text{The symbol means sum, so each cell of your contingency table must be computed.}$$
$$O_i: \text{This shorthand captures the idea of actual number of observations in a given cell of a contingency table, or what was actually counted.}$$
$$E_i: \text{The number of times you would expect to see a particular result under conditions where we assume the hypothesis of no association(Ho)}$$
$$(O_i - E_i): \text{The difference between the expected and actual frequencies is computed in this section of the formula.}$$

## Why Chi-Square Tests Matter
Chi-square tests are important in various fields of study such as marketing, biology, medicine or even social sciences; that is why they are extremely valuable:

- **Revealing Associations**: Chi-square tests help researchers identify significant relationships between different categories, aiding in the understanding of causation and prediction.
- **Validating Assumptions**: Chi-square tests check if your observed data matches what you expected. This helps you know if your ideas are on track or if you need to reconsider them.
- **Data-Driven Decisions**: Chi-square tests validate our beliefs based on empirical evidence and boost confidence in our inferences.

## Addressing Assumptions and Considerations
- Chi-square tests suppose that the observations are independent from one another; they are distinct.
- Each cell in the table should have a minimum of five values in it for better results. Otherwise, think about the Fisher's exact test as an alternative measure if a table cell has less than five numbers in it.
- Chi-square tests do not indicate a causal relationship but they identify association between variables.

## What are Categorical Variables?
- Categorical variables are like sorting things into different groups. But instead of using numbers, we're talking about categories or labels. For example, colors, types of fruit, or types of cars are all categorical variables.
- They're termed as "categorical" simply because bit by bit they segment things like "red," "green" or "blue" into separate clusters. Unlike height or weight whose measurements are contiguous, categorical data has definite options without numerical order between them. That is why if you ask whether someone prefers apples to oranges then it means that the person is discussing categorical data

## Characteristics of Categorical Variables
- **Distinct Groups**: Categorical variables put things into different groups that don't overlap. For example, when we talk about hair color, someone can be a redhead, black-haired, blonde, or brunette. Each person falls into just one of these groups.
- **Non-Numerical**:There is no hierarchy in categorical terms for these are just names and not actions hence its futility to compare like blondes are better than brunettes based on it; referring blondes as bad women due to their hair types will be unfair as it may not make any sense referring such attributions with regards to colours either feminist perspective could be used but the simplest explanation remains that 'they are merely dissimilar.
- **Limited Options**: Categorical variables are characterized by a fixed number of possibilities. One may have such choices as red, blonde, brown, black hair color. The number of categories may fluctuate, but they all remain distinct and bounded in scope.