# Some common distributions - Exercises

## Exercise #1

> Your friend claims that changing the font to comic sans will result in more ad revenue on your web sites. When presented in random order, 9 pages out of 10 had more revenue when the font was set to comic sans. If it was really a coin flip for these 10 sites, what’s the probability of getting 9 or 10 out of 10 with more revenue for the new font?

To solve this problem, we can use the binomial distribution formula, which calculates the probability of obtaining a certain number of successes (in this case, having more revenue with Comic Sans) out of a fixed number of trials (10 trials, representing the 10 websites).

The formula for the binomial distribution is:

$$
P(X=k) = \binom{n}{k} p^{k}(1-p)^{n-k}
$$

Where

- $P(X=k)$ is the probability of getting exactly $k$ successes
- $n$ is the number of trials (in this case, 10 websites),
- $k$ is the number of successes (in this case, 9 or 10 websites)
- $p$ is the probability of success on each trial (0.5, since it's like flipping a fair coin).

We need to calculate the probability of getting $9$ or $10$ successes out of $10$ trials. In Python we can compute this probabity in two ways through the `scipy` package. 

The first one is to compute the probability of getting the number of success we target out of $10$ trials directly from the above formula:

$$
\binom{10}{9} \cdot 0.5^{9} \cdot (1-0.5)^{1} + \binom{10}{10} \cdot 0.5^{10} \cdot (1-0.5)^{0}
$$

using the [`scipy.special.binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.binom.html) function from `scipy` package.

In [1]:
import scipy.special

scipy.special.binom(10, 9)*0.5**9*(1-0.5) + scipy.special.binom(10, 10)*0.5**10

0.0107421875

The other approach, is to use the binomial distribution (`scipy.stats.binom`) and compute 1 - the probability to obtain up to 8 success.

In [2]:
import numpy as np
from scipy.stats import binom

1-binom.cdf(8, n = 10, p = 0.5)

0.0107421875

## Exercise #2

> A software company is doing an analysis of documentation errors of their products. They sampled their very large codebase in chunks and found that the number of errors per chunk was approximately normally distributed with a mean of 11 errors and a standard deviation of 2. When randomly selecting a chunk from their codebase, whats the probability of fewer than 5 documentation errors?

To find the probability of fewer than 5 documentation errors per chunk, we need to calculate the cumulative probability up to 5 errors based on the normal distribution with a mean $\mu$ of 11 errors and a standard deviation $\sigma$ of 2 errors.

We can use the Z-score formula to standardize the value of $5$ and then look up the cumulative probability from the standard normal distribution table or use a calculator that provides this functionality. 

$$
Z = \frac{x-\mu}{\sigma}
$$

Where:

- $x$ is the value we're interested in (in this case, 5)
- $\mu$ is the mean (11)
- $\sigma$ is the standard deviation (2).

In Python, we can compute the value using the `scipy.stats.norm.cdf` function that outputs the cumulative density function for a normal distribution.

In [3]:
from scipy.stats import norm

print(norm.cdf(x = 5, loc = 11, scale = 2))

# using the normalised Z-scored value
print(norm.cdf(x = -3))

0.0013498980316300933
0.0013498980316300933


## Exercise #3

> The number of search entries entered at a web site is Poisson at a rate of 9 searches per minute. The site is monitored for 5 minutes. What is the probability of 40 or fewer searches in that time frame?

To solve this problem, we'll use the Poisson distribution formula, which gives the probability of a given number of events occurring in a fixed interval of time or space, given the average rate of occurrence.

The Poisson distribution formula is:
 
​$$
P(X = k; \lambda) = \frac{\lambda^{x}e^{-\lambda}}{x!}
$$
 

- $P(X=k)$ is the probability of observing $k$ events,
- $\lambda$ is the average rate of occurrence (in this case, the average number of searches per minute),
- $k$ is the number of events observed.

Given that the rate of searches is 9 searches per minute, we have $\lambda = 9$

We are interested in the probability of observing 40 or fewer searches in 5 minutes. Since the average rate is given per minute, we need to adjust the rate for the 5-minute interval. So, the new average rate for 5 minutes is $\lambda' = \lambda \times 5 = 9 \times 5 = 45$.

In Python, we can use the `scipy.stats.poisson` module to compute the probability of observing 40 or fewer searches in a 5-minute interval.

In [4]:
from scipy.stats import poisson

# Average rate of searches per minute
lambda_ = 9

# Average rate for 5 minutes
lambda_5 = lambda_ * 5

# Probability of observing 40 or fewer searches in 5 minutes
probability = poisson.cdf(40, lambda_5)

print("Probability of 40 or fewer searches in 5 minutes:", np.round(probability,4))

Probability of 40 or fewer searches in 5 minutes: 0.2555


## Exercise #4

> Suppose that the number of web hits to a particular site are approximately normally distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. What’s the probability that a given day has fewer than 93 hits per day expressed as a percentage to the nearest percentage point?

Like Exercise #2, to find the probability that a given day has fewer than $93$ hits per day, we'll use the Z-score formula to standardize the value of $93$ and then find the corresponding probability from the standard normal distribution table or using a calculator.

The Z-score formula is:

$$
Z = \frac{x-\mu}{\sigma}
$$

Where:

- $x$ is the value we're interested in (in this case, 93 hits per day)
- $\mu$ is the mean (100 hits per day)
- $\sigma$ is the standard deviation (10 hits per day).

Using Python:

In [5]:
x = 93
mu = 100
sigma = 10

Z = (x-mu)/sigma

print(Z)
np.round(norm.cdf(Z), 2)*100

-0.7


24.0

## Exercise #5

> Suppose that the number of web hits to a particular site are approximately normally distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. What number of web hits per day represents the number so that only 5% of days have more hits?

To find the number of web hits per day such that only 5% of days have more hits, we need to find the value $X$ such that the cumulative probability up to $X$ is %95 \%$. 

Since the distribution is normal, we'll use the Z-score formula to find the corresponding Z-score for the cumulative probability of $95 \%$, then use the inverse standard normal distribution (in Python we will use the `norm.ppf` module) to find the value of $X$ corresponding to that Z-score.

First, we need to find the Z-score corresponding to a cumulative probability of $95 \%$. This Z-score will be denoted as $Z_{0.95}$

$$
Z_{0.95} \sim 1.645
$$

Next, we'll use the Z-score formula to find the value of $X$ corresponding to this Z-score:

$$
Z_{0.95} = \frac{X-100}{10}
$$

and so:

$$
X = 1.645 \times 10 + 100 = 116.45
$$

In [6]:
mu = 100
sigma = 10

Z_95 = norm.ppf(q = 0.95)
x = Z_95*sigma + mu
np.round(x,3)

116.449

## Exercise #6

> Suppose that the number of web hits to a particular site are approximately normally distributed with a mean of 100 hits per day and a standard deviation of 10 hits per day. Imagine taking a random sample of 50 days. What number of web hits would be the point so that only 5% of averages of 50 days of web traffic have more hits?

To find the number of web hits per day such that only 5% of averages of 50 days of web traffic have more hits, we'll need to consider the distribution of the sample means. The distribution of sample means follows a normal distribution with a mean equal to the population mean ($\mu$) and a standard deviation equal to the population standard deviation divided by the square root of the sample size ($\sigma/\sqrt{n}$).

Given:
- Population mean $\mu = 100 \text{ hits per day}$
- Population standard deviation $\sigma = 10 \text{ hits per day}$
- Sample size $n = 50 \text{ days}$

The standard deviation of the sample mean ($\sigma_{\bar{n}}$) is calculated as:

$$
\sigma_{\bar{n}} = \frac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{50}} \sim 1.414
$$

We want to find the value of $\bar{x}$ (the sample mean) that corresponds to the 95th percentile of the sampling distribution, which corresponds to a Z-score of approximately 1.645 

$$
X = 1.645 \times 1.414 + 100 \sim 102.32
$$

In [7]:
mu = 100
sigma = 10
n_of_samples = 50

sigma_mean = sigma/np.sqrt(n_of_samples)

Z_95 = norm.ppf(q = 0.95)
x = Z_95*sigma_mean + mu
np.round(x,3)

102.326

In [8]:
mu = 100
sigma = 10
n_samples = 50

x = norm.ppf(
    q = 0.95,
    loc = 100,
    scale = sigma/np.sqrt(n_samples)
)

np.round(x, 3)

102.326

## Exercise #7

> You don’t believe that your friend can discern good wine from cheap. Assuming that you’re right, in a blind test where you randomize 6 paired varieties (Merlot, Chianti, ...) of cheap and expensive wines. What is the change that she gets 5 or 6 right?

This exercise is very similar to Exercise #1 and we can use the binomial distribution formula, which calculates the probability of obtaining a certain number of successes (in this case, correctly identifying expensive wines) out of a fixed number of trials (the number of pairs of wines). Since your friend doesn't have the ability to discern good wine from cheap, the probability of guessing correctly on each trial (identifying the expensive wine) is 0.5, as it's like flipping a fair coin. We need to calculate the probability of getting 5 or 6 successes out of 6 trials.

In [9]:
import scipy.special

p = scipy.special.binom(6, 5)*0.5**5*(1-0.5) + scipy.special.binom(6, 6)*0.5**6
np.round(p*100,1)

10.9

In [10]:
from scipy.stats import binom

p= 1-binom.cdf(4, n = 6, p = 0.5)
np.round(p*100,1)

10.9

## Exercise #8

> The number of web hits to a site is Poisson with mean 16.5 per day. What is the probability of getting 20 or fewer in 2 days?

To solve this exercise, we will use the same approach as in exercise #3 as in this case we are interested in the probability of observing 20 or fewer web hits in less then 2 days. Since the average rate is given per day, we need to adjust the rate for the 2-days interval. 

In [11]:
from scipy.stats import poisson

# Average rate of hits per day
lambda_ = 16.5

# Average rate for 2 days
lambda_2 = lambda_ * 2

# Probability of observing 20 or fewer hits in 2 days
probability = poisson.cdf(20, lambda_2)

print("Probability of 20 or fewer hits in 2 days:", np.round(100*probability,1))

Probability of 20 or fewer hits in 2 days: 1.0
