```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. Distributions: 
    - https://mathworld.wolfram.com/PoissonDistribution.html
    - https://mathworld.wolfram.com/BinomialDistribution.html
    - https://mathworld.wolfram.com/UniformDistribution.html

```

# Statistics - Basics

## Correlation vs. Regression
Correlation and regression are far from the same concept. So, let’s see what the relationship is between correlation analysis and regression analysis.

There is a single expression that sums it up nicely: **correlation does not imply causation**!

### The Relationship between Variables
**Correlation** measures the degree of relationship between two variables. **Regression analysis** is about how one variable affects another or what changes it triggers in the other.

### Causality
**Correlation** doesn’t capture causality but the degree of interrelation between the two variables. **Regression** is based on causality. It shows no degree of connection, but cause and effect.

### Are X and Y Interchangeable?
A property of **correlation** is that the correlation between x and y is the same as between y and x. You can easily spot that from the formula, which is symmetrical. **Regressions** of y on x and x on y yield different results. Think about income and education. Predicting income, based on education makes sense, but the opposite does not.

### Graphical Representation of Correlation and Regression Analysis
The two methods have a very different graphical representation. **Linear regression analysis** is known for the best fitting line that goes through the data points and minimizes the distance between them. Whereas, **correlation** is a single point.

<img src="images/stats_correlation_and_regression.jpg" alt="" style="width: 600px;"/>

## Probability Distribution

When we use the term normal distribution in statistics, we usually mean a `probability distribution`. Good examples are the Normal distribution, the Binomial distribution, and the Uniform distribution. **A distribution in statistics is a function that shows the possible values for a variable and how often they occur**.

<img src="images/stats_probability_distributions.png" alt="" style="width: 600px;"/>

The distribution of an event consists not only of the input values that can be observed, but is made up of all possible values.

<img src="images/stats_probability_rolling_a_die.png" alt="" style="width: 600px;"/>

So, the distribution of the event – rolling a die – will be given by the following table. The probability of getting one is 0.17, the probability of getting 2 is 0.17, and so on… you are sure that you have exhausted all possible values when the sum of probabilities is equal to 1 or 100%. For all other values, the probability of occurrence is 0.

<img src="images/stats_probability_rolling_a_die2.png" alt="" style="width: 600px;"/>

Each probability distribution is associated with a graph describing the likelihood of occurrence of every event. Here’s the graph for our example. This type of distribution is called a `uniform distribution`.

<img src="images/stats_probability_rolling_a_die3.png" alt="" style="width: 600px;"/>

It is crucial to understand that the distribution in statistics is defined by the underlying probabilities and not the graph. The graph is just a visual representation. 

Now think about rolling two dice. What are the possibilities? One and one, two and one, one and two, and so on. 

<img src="images/stats_probability_rolling_a_die4.png" alt="" style="width: 600px;"/>

Here’s a table with all the possible combinations. 

<img src="images/stats_probability_rolling_a_die5.png" alt="" style="width: 600px;"/>

We are interested in the sum of the two dice. So, what’s the probability of getting a sum of 1? It’s 0, as this event is impossible. What’s the probability of getting a sum of 2? There is only one combination that would give us a sum of 2 – when both dice are equal to 1. So, 1 out of 36 total outcomes, or 0.03. Similarly, the probability of getting a sum of 3 is given by the number of combinations that give a sum of three divided by 36. Therefore, 2 divided by 36, or 0.06. We continue this way until we have the full probability distribution.

Let’s see the graph associated with it. So, looking at it we understand that when rolling two dice, the probability of getting a 7 is the highest. We can also compare different outcomes such as: the probability of getting a 10 and the probability of getting a 5. It’s evident that it’s less likely that we’ll get a 10.

<img src="images/stats_probability_rolling_a_die6.png" alt="" style="width: 600px;"/>

### The Normal Distribution
The normal distribution is essential when it comes to statistics. Not only does it approximate a wide variety of variables, but decisions based on its insights have a great track record. Also, distributions of sample means with large enough sample sizes could be approximated to normal (even the original distributions from which the samples were drawn are not normal).

The statistical term for it is **Gaussian distribution**. Though, many people call it the **Bell Curve**, as it is shaped like a bell. It is symmetrical and its mean, median and mode are equal. It has no skew. It is perfectly centred around its mean.

<img src="images/stats_normal_distribution.jpg" alt="" style="width: 600px;"/>

On the plane, you can notice that the highest point is located at the mean. This is because it coincides with the mode. The spread of the graph is determined by the standard deviation, as it is shown below.

<img src="images/stats_normal_distribution2.jpg" alt="" style="width: 600px;"/>

## Hypothesis Testing

There are four hypothesis testing steps in data-driven decision-making:
1. Formulate a hypothesis.
2. Find the right test for your hypothesis.
3. Execute the test.
4. Make a decision based on the result.

### A Hypothesis
A hypothesis is an idea that can be tested (compared with something else).

So, if I tell you that apples in New York are expensive, this is an idea, or a statement, but is not testable, until I have something to compare it with. For instance, if I define expensive as: any price higher than $1.75 dollars per pound, then it immediately becomes a hypothesis.

### An Example

#### Two-sided or а two-tailed test

Here’s a simple topic that can be tested.

According to Glassdoor (the popular salary information website), the mean data scientist salary in the US is 113,000 dollars. So, we want to test if their estimate is correct.

There are two hypotheses that are made: the null hypothesis, denoted H zero, and the alternative hypothesis, denoted H one or H A. The null hypothesis is the one to be tested and the alternative is everything else. In our example,

The null hypothesis would be: The mean data scientist salary is 113,000 dollars,

While the alternative: The mean data scientist salary is not 113,000 dollars.

Now, you would want to check if 113,000 is close enough to the true mean, predicted by our sample. In case it is, you would accept the null hypothesis. Otherwise, you would reject the null hypothesis.

The concept of the null hypothesis is similar to: innocent until proven guilty. We assume that the mean salary is 113,000 dollars and we try to prove otherwise.

#### One sided or one-tailed test

This was an example of a two-sided or а two-tailed test. You can also form one sided or one-tailed tests. Say your friend, Paul, told you that he thinks data scientists earn more than 125,000 dollars per year. You doubt him so you design a test to see who’s right.

The null hypothesis of this test would be: The mean data scientist salary is more than 125,000 dollars.

The alternative will cover everything else, thus: The mean data scientist salary is less than or equal to 125,000 dollars.

It is important to note that outcomes of tests refer to the population parameter rather than the sample statistic! As such, the result that we get is for the population.

Another crucial consideration is that, generally, the researcher is trying to reject the null hypothesis. Think about the null hypothesis as the status quo and the alternative as the change or innovation that challenges that status quo. In our example, Paul was representing the status quo, which we were challenging.

Alright. We showed you the four hypothesis testing steps.

## Type I and Type II Errors
In general, we can have two types of errors – `type I error` and `ype II error`.

**Type I error** is when you reject a true null hypothesis and is the more serious error. It is also called `a false positive`. The probability of making this error is `alpha` – the `level of significance`. Since you, the researcher, choose the alpha, the responsibility for making this error lies solely on you.

**Type II error** is when you accept a false null hypothesis. The probability of making this error is denoted by `beta`. Beta depends mainly on sample size and population variance. So, if your topic is difficult to test due to hard sampling or has high variability, it is more likely to make this type of error. As you can imagine, if the data set is hard to test, it is not your fault, so Type II error is considered a smaller problem.

We should also mention that the `probability of rejecting a false null hypothesis` is equal to 1 minus beta. This is the researcher’s goal – to reject a false null hypothesis. Therefore, 1 minus beta is called `the power of the test`. Generally, researchers increase the power of a test by increasing the sample size.

### An Example

You are in love with this girl from the other class, but are unsure if she likes you.

There are two errors you can make.
First, if she likes you back and you don’t invite her out, you are making the type I error.

The null hypothesis in this situation is: she likes you back. It turns out that she really did like you back. Unfortunately, you did not invite her out, because after testing the situation, you wrongly thought the null hypothesis was false. In other words, you made a type I error – you rejected a true null hypothesis and lost your chance. It is a very serious problem, because you could have been made for each other, but you didn’t even try.

Now imagine another situation. She doesn’t like you back, but you go and invite her out. The null hypothesis is still: she likes you back, but this time it is false. In reality she doesn’t really like you back, that is. However, after testing, you accept the null hypothesis and wrongly go and invite her out. She tells you she has a boyfriend that is much older, smarter and better at statistics than you and turns her back.

You made a type II error – accepted a false null hypothesis. However, it is no big deal, as you go back to your normal life without her and soon forget about this awkward situation.

# Statistics 1

## Mean, Median, Mode, Weighted Mean, Harmonic Mean

### Example 1
```
Provide the mean and median for this data set: 4, 6, 8, 10, 17
mean = 45/5 = 9
median = 8
```
### Example 2
```
Provide the mode for the dataset: 3,5,7,10,3,3,9,2,5,10,9.
3
```


## Min, Max, Range, Variance, Standard Deviation

## z-score
How many standard deviations our data point lies from the mean?

<img src="images/z-score.png" alt="" style="width: 200px;"/>

Simply put, a `z-score` (also called a `standard score`) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is.

A z-score can be placed on a normal distribution curve. Z-scores range from -3 standard deviations (which would fall to the far left of the normal distribution curve) up to +3 standard deviations (which would fall to the far right of the normal distribution curve). In order to use a z-score, you need to know the mean μ and also the population standard deviation σ.

Z-scores are a way to compare results to a “normal” population. Results from tests or surveys have thousands of possible results and units; those results can often seem meaningless. For example, knowing that someone’s weight is 150 pounds might be good information, but if you want to compare it to the “average” person’s weight, looking at a vast table of data can be overwhelming (especially if some weights are recorded in kilograms). A z-score can tell you where that person’s weight is compared to the average population’s mean weight.

see: https://www.statisticshowto.com/probability-and-statistics/z-score/

z-score, standard score example:

<img src="images/z-score-iq-example.png" alt="" style="width: 400px;"/>


<img src="images/z-score-empirical-rule.png" alt="" style="width: 400px;"/>

### Example 1
```
For a certain data set we have a mean of 100, a median of 95, and a standard deviation of 25. What is the z-score for the data point 138?
Answer: z-score = (138-100)/25 = 1.52
```

## Empirical Rule ( 68-95-99.7) - Three Sigma Rule
When you use a standard normal distribution (aka Gaussian Distribution):

- About 68% of values fall within one standard deviation of the mean.
- About 95% of the values fall within two standard deviations from the mean.
- Almost all of the values—about 99.7%—fall within three standard deviations from the mean.

These facts are the `68 95 99.7 rule`. It is sometimes called the `Empirical Rul`e because the rule originally came from observations (empirical means “based on observation”).

The Normal/Gaussian distribution is the most common type of data distribution. All of the measurements are computed as distances from the mean and are reported in standard deviations.

The Gaussian curve is a symmetric distribution, so the middle 68.2% can be divided in two. Zero to 1 standard deviations from the mean has 34.1% of the data. The opposite side is the same (0 to -1 standard deviations). Together, this area adds up to about 68% of the data.

<img src="images/Empirical-rule-FINAL.jpg" alt="" style="width: 600px;"/>

## Percentile Rank

<img src="images/percentile_rank.png" alt="" style="width: 600px;"/>

Example of PR calculation for score of 85

<img src="images/percentile_rank_85.png" alt="" style="width: 600px;"/>

<img src="images/PR_and_NCE.gif" alt="" style="width: 600px;"/>



See: https://en.wikipedia.org/wiki/Percentile_rank

### Example 1
```
Six students earn the following test scores: 60, 70, 80, 90, 95, 100. The student that scored 95 is in the _____ percentile.
Answer: (4 + 0.5) / 6 * 100 = 75
```

## Probability
Always consider a type of probability you encounter and ask how it was calculated/infered to assess how reliable it might be
- **Objective** probabilities (based on calculations)
    - **Classical** (coin flip, roll a die) - we know all possible outcomes and they are equally likely to occur
        - How: `number of wins / all possible outcomes = % of probability`
    - **Empirical** (number of successful shots by a footbal player) 
        - possible outcomes are not equally likely to occur, each attempt is different and can be influenced by many factors. 
        - based on past data - we can only use historical data to infer, the more data we have, the more trust we can in the probability 
        - not perfect but can be done if we have some data in repeating situations
        - gives a nice idea of what to expect
        - How: `number of successful shots / all shots by the player = % of probability (ratio)`
- **Subjective** (based on experience)
    - Uses people's opinions and experience and perhaps some related data that influence the statement about probability
    - A guess
    
### Example 1
```
When people attend a fundraiser, 40% of them donate money. If three people attend this month’s fundraiser -- Jane, Kate, and Liza -- what are the exact chances that just one of them will donate money?
Answer:
Probability of donating Amount = 40/100 = 0.4
Probability of not donating Amount = 1 - 0.4 = 0.6
Three possibilities
- Jane Donates but Kate & liza do not donate
- Kate donate but jane & liza do not donate
- Liza donate but Jane & Kate do nt donate

p = (0.4)(0.6)(0.6) + (0.4)(0.6)(0.6) + (0.4)(0.6)(0.6) = 0.432 = 43.2%
```
### Example 2
```
On an exam, 60 of 100 people got a passing score. 50 of 100 people studied for the exam. 45 of the people that studied got a passing score. Studying for the exam and passing the exam are _____.
Answer: dependent events

```

## Permutations: the order of things
The number of ways in which objects can be arranged (order matters)
- `n!` (n factorial)
- for 5 objects: 5! = 5x4x3x2x1 = 120

How many permutation we have when selecting x objects out of n?
- `n! / (n-x)!`
- if we have 8 players, how many permutations we may have on the podium (3 places)?
- 8! / (8-3)! = 8x7x6 = 336

### Example 1
```
There are 5 total candidates available for 2 different jobs. How many permutations are there for those two jobs from the pool of 5 candidates?
Answer: 5! / (5-2)! = 5x4 = 20
```

## Combinations: permutations without regard for order
The number of ways in which objects can be chosen (order not important)
- `n! / [(n-x)! * x!]` where n is total number of objects and x is number of objects chosen at one time
- if we have 10 students in a class, how many combinations of 4 person team we could randomly choose?
- 10! / [(10-4)! * 4!] = 10x9x8x7 / 4x3x2x1 = 5040 / 24 = 210 possible teams of four

- what is the probability that 2 specific students (Tom and Kate) end up in the same team?
- how many other students can fill the last free spots (2 first sports are already taken by our 2 students, 10 were in total, so we have 8 other students left)
- 8! / [(8-2)! * 2!] = 28

- so, we have 210 outcomes and 28 desired outcomes
- the probability that 2 specific students end up in the same team is 28/210 = 14%

Eight adults are carpooling to an event. At random 2 will be chosen as the drivers for the rest of the group. How many combinations of 2 drivers are there among this group of 8?
- To solve you divide 8! by the product of 6!*2! This would be (8*7*6*5*4*3*2*1) / [(6*5*4*3*2*1)*(2*1)] This reduces to (8*7)/2 which equals 28.
- 8! / [(8-2)! * 2!] = 28

You have 10 employees chosen at random to be placed in a team of 4 people. You want Li and Raoul to be on the team. What will the formula 8!/[(8-2)! 2!] provide you?
- The number of combinations for the eight employees other than Li and Raoul to be on the team.
- When you calculate this, you will also know the probability of Li and Raoul being on the team.


## Random Experiment and Random Variable
`Random experiments` are opportunities to observe the outcome of a chance event. If we were rolling dice, the `random experiment` is observing and recording the outcome, which brings us to a random variable. A `random variable` is the numerical outcome of a random experiment. If we rolled a two and a three, our `random variable` would be five. 

## Random Variables
As the result of the outcome is unknown (random), we call the result from an experiment a random variable
- Discrete experimental results often characterised by whole numbers (no decimals are allowed (always whole numbers) and there is a limited number of possible outcomes)
- Continuous - there is infinite number of possible outcomes (there are endless of possibilities in terms of outcomes)

The distinction between the two is important because you will calculate probabilities differently in each type of situation.

### Discrete Probability Distribution

Probability distribution of drinks orders during a party:

<img src="images/probability_distribution_discrete.png" alt="" style="width: 400px;"/>

<img src="images/probability_distribution_discrete2.png" alt="" style="width: 400px;"/>

Relative frequency:

<img src="images/probability_distribution_discrete_relative_frequency.png" alt="" style="width: 500px;"/>

Mean of discrete probability distribution:

<img src="images/probability_distribution_discrete_relative_frequency_mean.png" alt="" style="width: 600px;"/>

An average consumer ordered 1.46 drinks during the party.

Standard deviation of discrete probability distribution:

<img src="images/probability_distribution_discrete_relative_frequency_std.png" alt="" style="width: 650px;"/>

3.02 (sum of squared weights) - 2.13 (mean squared) = 0.89 (variance) -> 0.94 (variance squared root -> standard deviation / sigma)

#### Expected Value
Total of the weighted payoffs associated with the decision. `Expected monetary value (EMV)` is a variation of the mean for a discrete probability distribution that includes subtracting the cost of the investment.
```
You hold a lottery ticket. There is a 10% chance you win $500, a 40% chance you win $25, and a 50% you win $0. What is the expected monetary value of your lottery ticket?
Answer: For each outcome multiply the possible winnings times the percentage, then add up all three products. (0.1*500)+(0.4*25)+(0.5*0) = $60
```

#### Binomial random variable
An experiment that has only two possible outcomes. With binomial random variables, you can use n, the number trial, and the chance of success represented as p to predict a result.

Binomial vs. Normal distribution

<img src="images/CompareBinomialAndNormalDistribution.png" alt="" style="width: 400px;"/>

### Example 1
```
10 customers ordered pizza. Five ordered 1 pizza, two ordered 2 pizzas, two ordered 3 pizzas, and one ordered 4 pizzas. What's the mean number of pizzas ordered for this discrete distribution?

    pizzas ordered    frequences    relative freq.    weights
        1                5                5/10=0.5        1*0.5=0.5
        2                2                2/10=0.2        2*0.2=0.4
        3                2                    0.2         3*0.2=0.6
        4                1                    0.1         4x0.1=0.4
        
                                                        Mean (sum of weights) = 1.9

```

### Continuous Probability Distribution

<img src="images/probability_distribution_continuous.png" alt="" style="width: 400px;"/>

Probability density curves can be used to show the distribution of outcomes. The area under the curve represents the probability of outcomes. The probability of A is X, and the probability of B is Y. In reality, the probability in a single point is equal to 0, so we usually check the probability over a ranges of random variables. The all area under the curve is equal to 1 (or 100%).

<img src="images/probability_distribution_continuous_density_distribution_curve.png" alt="" style="width: 400px;"/>

<img src="images/probability_distribution_continuous_density_distribution_curve2.png" alt="" style="width: 400px;"/>



### Normal distribution & Central Limit Theorem

The `"fuzzy" central limit theorem` says that data which are influenced by many small and unrelated random effects are approximately normally distributed.

<img src="images/probability_distribution_continuous_normal.png" alt="" style="width: 800px;"/>

## Z-transformations

### Example 1
What percentage of men weight more than 211 pounds? The weight is normally distributed with mean of 150 pounds and std of 25.

First, calculate the `z-score`

<img src="images/z-score-example.png" alt="" style="width: 600px;"/>

It would appear here far to the right

<img src="images/z-score-example1b.png" alt="" style="width: 600px;"/>

Then, find the percentage of men who would weight more than 211 pounds by using `standard normal distribution / z-score table` and find the probability value for z-score of 2.44

<img src="images/z-score-example1c.png" alt="" style="width: 600px;"/>

According to our mean and std 99.27% of all men weight 211 pounds or less, which means that the percentage of men that weight more than 211 pounds is 1-0.9927 = 0,0073 or 0,73%.

### Example 2
What is the probability a man weight between 140 and 170 pounds?

First, we need two z-scores: one for 140, another for 170. Next, check the standard normal distribution table (chart value) values for the scores. 

<img src="images/z-score-example2b.png" alt="" style="width: 600px;"/>

How to find a value for a negative z-score? Because the bell curve is symetrical, we can substract number found for the positive z-score from 1.0.

<img src="images/z-score-example2d.png" alt="" style="width: 400px;"/>

<img src="images/z-score-example2e.png" alt="" style="width: 400px;"/>

So, the result is:

<img src="images/z-score-example2f.png" alt="" style="width: 400px;"/>


### Example 3
Comparing distributions

<img src="images/example_ztransform.png" alt="" style="width: 600px;"/>

See: http://www.statistics4u.info/fundstat_eng/ee_ztransform.html

### Example 4
```
A student scores 1510 on a standardized test, for a z-score of 2.17. The z-score table shows that the z-score for 2.17 is 0.9850. Therefore, the probability someone scored >1510 on this test is 0.015 or 1.5%.
Answer: 0.9850 indicates that this student was equal to or greater than 98.50% of other student scores. Thus 1.5% are likely to score higher.
```

### Example 5
```
The average height for men is 5'9". Michael's height is 6'7". If the data is normally distributed and you calculated the Z-score, how can you find the percentage of men who are 6'7" or taller?
Answer: The Standard Distribution Table gives you the percentage of men shorter than 6'7", thus giving you the probability of being taller. The table associates Z-scores to percentages at or below the Z-score, which leaves the percentage of men taller than the mean.
```

# Statistics 2

## Inferential Statistics
Based on samples of data infer about whole population.

## Sampling
The challenge is getting the right answers, especially when the world, even your small slice of it is very big. 

Measuring everything is just way 
- too expensive, 
- too time consuming and 
- in some cases, it's just impossible. 

Political operatives can't poll every voter. Cell phone companies can't measure the quality level of every single item the produce. A farmer can't measure the actual size of every tomato grown. Scientists, they can't track the health of every single person in the country. Instead of measuring everything, they just measure a small group or subset of the total population. That small subset of measurements is a `sample`. And under the right circumstances, this `sample can act as a representative of the entire population`. The best samples are chosen at random.

### Random Sample
The most dependable type of data comes from what we call a `simple random sample`. This means that 
- the sample is chosen such that each individual in the population has the same probability of being chosen at any stage during the sampling process. 
- And each subset of k individuals has the same probability of be chosen for the sample as any other subset of k individuals.

The simple random sample can be rather elusive. Eliminating bias and maintaining data independence is quite challenging. As a result, `alternatives to the simple random sample` are sometimes utilized (but the simple random sample is still the only way to get dependable statistical outcomes). These alternative methods are simpler to organize, easier to carry out, and often, they seem both logical and sound:
- **Systematic sample** - Choose one unit and then every k unit thereafter. So if we're measuring customer satisfaction at a store, perhaps you might ask the first person to come out of the store for their opinions, then you might ask every tenth customer after them, for their opinion.
- **Sources of bias** - The sampling time and sampling location as well as the presence of a sampler, may introduce bias or inhibit independence.
- **Opportunity sample** - the sampler simply takes the first n number of units that come along.
- **Stratified sample** - Is one where the total population is broken up into homogeneous groups. Let's say, we're trying to figure out the average amount of sugar in a single cookie, regardless of the type of cookie. We could break up the population into so many different cookie types. Chocolate chip, peanut butter, oatmeal, sugar, ginger, snickerdoodle, oatmeal raisin. From there, we might take a sample of 30 cookies from each category. Perhaps chocolate chip cookies make up 50% of all cookies and ginger cookies make up only 3% of all the cookies. Our very fair-looking system might actually be biased against the most popular cookies.
- **Cluster sample** - is similar to stratified samples in that we are breaking things up into groups. What's the difference? In stratified groups, all the members of each group were the same. In clusters, the groups are likely to have a mix of characteristics. They're heterogeneous. Suppose we are testing a new product. We might ask for samples of people in 20 major cities, what they think about the new product. While the people in a single sample might all be from the same city, each sample might contain men and women, people of different races, politics and socio-economic backgrounds.

The `simple random sample` will always be the gold standard, but these `alternative sampling methods` should not be completely dismissed.

### Sample Size
A `sample` is a group of units drawn from a population, and the `sample size` is the number of units drawn and measured for that particular sample. The total population itself may be very large or, perhaps, immeasurable, so a `sample` is just looking at a slice of the population in the hopes of providing us a representative picture of the entire population. The larger the sample size, the more accurate our measurement or, at least, the more confidence we have that our sample is actually providing us a glimpse of the whole population. 