# Exploring Distributions with Scipy

In [None]:
import itertools
from functools import partial

import numpy as np
import pandas as pd

from scipy import stats
from scipy.special import gamma

In [None]:
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
import seaborn as sns

In [None]:
# some colours
LIGHT_BLUE = '#348ABD'
PURPLE = '#A60628'
DARK_GREEN = '#467821'
colours = [LIGHT_BLUE, PURPLE, DARK_GREEN]

sns.set_theme()

In [None]:
figsize(11, 6)

In [None]:
r2 = partial(np.round, decimals=2)
r4 = partial(np.round, decimals=3)

## Poisson

The Poisson distribution represents the number of events occurring over a specific time period, given that several conditions hold:

- An event either occurs or does not, that is, there are no half-events.
- The average number of events per the time period being considered is known.
- An event should have the same probability of occurring at any point in the time period. For example, it should not be more likely to occur at the start of a day than at the end of a day.
- Each event is independent.

Any number of events could occur during the time period being considered.

These conditions need not hold perfectly for the Poisson distribution to be a good model, but they need to hold at least approximately. For instance, while it is technically true that the number of people doing some task over a given time period does not satisfy the last condition (as there are a finite number of people on the planet), for all practical purposes the condition holds.

The Poisson distribution is described by a single parameter $\lambda$. This value represents:

- the average number of events over a given time period
- the variance for number of events over a given time period

The pdf is defined as:

$$
P(k) = \frac{\lambda^k e^{-\lambda}}{k!}
$$

$P(k)$ describes the probability of observing _k_ events within a given time period, where &lambda; is the average number of events that occur in that same time period.

&lambda; is called a parameter of the distribution, and it controls the distribution's shape. For the Poisson distribution, &lambda; can be any positive number. By increasing &lambda;, we add more probability to larger values, and conversely by decreasing &lambda; we add more probability to smaller values. One can describe &lambda; as the intensity of the Poisson distribution.

One useful property of the Poisson distribution is that its expected value is equal to its parameter: $E[Z|\lambda] = \lambda$

The list of applications of the Poisson distribution is very long, but here are just a few:

- The number of customers who call to complain about a service problem per month
- The number of visitors to a website per minute
- The number of arrivals of at a car wash per hour


In [None]:
# samples - counts between 0 and 15
x = np.arange(16)
# constrasting values for lambda
params = [1.5, 4.25, 10]
samples = [stats.poisson.pmf(x, l) for l in params]

In [None]:
colnames = [f'p={p}' for p in params]
colnames

In [None]:
df = pd.DataFrame(
    dict(itertools.chain([('x', x)], (zip(colnames, samples))))
).melt(
    id_vars = 'x',
    value_vars = colnames,
    var_name = 'lambda',
    value_name = 'k'
)
df.head()

In [None]:
p = sns.barplot(
    data = df,
    x = 'x',
    y = 'k',
    hue = 'lambda',
    dodge = False,
    saturation = 1
);
p.set(
    xlabel = '$k',
    ylabel = '$p(k)',
    title = 'Probability mass function of a Poisson random variable; differing $\lambda$ value'
);
p.get_legend().set_title('$\lambda');

In [None]:
# plot
x = np.arange(16)
# two parameters for lamda
params = [1.5, 4.25, 10]

for i in range(len(params)):
    plt.bar(
        x,
        stats.poisson.pmf(x, params[i]),
        color=colours[i],
        label=f"$\lambda = {params[i]:.1f}$",
        alpha=0.60,
        edgecolor=colours[i],
        lw="1"
    )
# labels and titles
plt.xticks(x + 0.2, x)
plt.legend()
plt.ylabel("probability of $k$")
plt.xlabel("$k$")
plt.title("Probability mass function of a Poisson random variable; differing $\lambda$ values");

### Examples

1. Which of the following situations is described well by a Poisson distribution?

- The number of cars traveling along a road in a given day
- The number of games one team wins against another team in a single day
- The number of cars traveling along a road in a given hour
- The number of players on a baseball team that get sick in a given week

Correct answer: The number of cars traveling along a road in a given hour

The number of cars traveling along a road in a given hour is well described by a Poisson distribution, as

- the number of cars is practically unbounded,
- the probability of a car driving along the road at the start of the hour is approximately the same as the probability of a car driving along the road at the end of the hour, and
- each car has a roughly independent choice of whether to drive along the road.

The number of cars traveling along the road in a given day is not Poisson distributed, as certain times ("rush hours") are likely to be more populated than others. The number of players on a baseball team that get sick in a given week is bounded above by the number of players on the team, so it cannot be Poisson distributed. Additionally, the event of getting sick is likely not independent for each player, since teammates can infect each other. Finally, the number of games one team wins against another in a single day is not Poisson distributed, as the games are not independent of each other -- the outcome of one can affect the games that follow.

2. A person knows that on average, they receive 9 letters in the mail per week. What is the probability (rounded to the nearest hundredth) that, in a given week, the person receives exactly 9 letters?

The number of letters that this person receives in a week meets the requirements for a Poisson distribution.

- The event of getting a letter either occurs or does not.
- We know that the average number of letters in the mail per week is 9.
- The probability of receiving a letter is the same throughout the week (since mail is not delivered on Sundays, we will not consider Sunday as a part of the week).
- Each piece of mail received is independent of the others. For example, getting a letter from a pen pal and a credit card bill are not related.
- There is no limit to the amount of mail that can be received in a week.

Since the number of letters follows a Poisson distribution, we can use the given formula, with $k=9$ and $\lambda=9$

In [None]:
# k = 9, lambda = 9
p_dist = stats.poisson(9)
r2(p_dist.pmf(9))

In [None]:
x = np.arange(0, 20)
plt.bar(
    x,
    p_dist.pmf(x),
    align='center',
    label="$\lambda = 9$",
    alpha=0.6
);
plt.xticks(x);
plt.ylabel("probability of $k$");
plt.xlabel("$k$");
plt.title('$pois(\lambda=9)$');

2. A person knows that they average 9 letters in the mail per week. What is the standard deviation in the number of letters the person receives in a given week?

The average number of letters per week is 9, so $\lambda=9$

The variance for the Poisson distribution is also &lambda;. Thus, the standard deviation in the number of letters per week is $\sqrt{9} = 3$

In [None]:
p_dist.std()

3. A person keeps track of the number of letters in the mail they receive per week for a year.

In [None]:
letters = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11])
frequency = np.array([3, 5, 7, 8, 8, 6, 6, 5, 4])

In [None]:
pd.DataFrame(dict(letters=letters, frequency=frequency))

What is the (approximate) probability that they receive more than 10 letters in a given week the following year, assuming it follows a Poisson distribution?

In [None]:
# total letters divided by 52 weeks gives number of letters per week
total_letters = letters.dot(frequency)
lambda_ = total_letters // 52
print(f'Total: {total_letters}: lambda: {lambda_}')

We want to find the probability that this person receives more than 10 letters in a week. This is equal to:

$$
P_x(11) + P_x(12) + P_x(13) + \ldots
$$

giving the infinite sum

$$
\sum_{k=11}^{\inf}{P_x(k)}
$$

In maxima

Given the following power series

$$
e^x = \sum_{k=0}^{\inf}{x^k/k!}
$$

This is about a 10% chance that this person receives more than 10 letters in a given week.

In [None]:
p_dist = stats.poisson(lambda_)
# everything to the right of 0-10
r4(1 - p_dist.cdf(10))

3. A batch of cookie dough makes 100 cookies. What is the smallest number of chocolate chips that should be added to the dough so that, when thoroughly mixed and sliced into 100 cookies, the probability that a cookie contains zero chocolate chips is less than 1%?

Since the chocolate chips are thoroughly mixed into the dough, the number of chocolate chips in each cookie is approximately Poisson distributed. Let &lambda; be the average number of chocolate chips in each cookie after we add the chocolate chips.

We want &lambda; such that $P_x(0) \lt 0.01$ and since

$$
P_x(0) = \frac{\lambda^0 e^{-\lambda}}{0!} = e^{-\lambda}
$$

then we want $e^{-\lambda} \lt 0.01$

Since $e^{-x}$ is a decreasing function we have $e^{-\lambda} < 0.01$ when $\lambda > 4.605$

This batch makes 100 cookies, so we need to add at least $100 \cdot 4.605 = 460.5$ chocolate chips. And because we add whole chocolate chips, we must round up to 461 to ensure the probability of a cookie with zero chocolate chips is less than 1%.

4. An emergency room averages 3 patients per hour. Assuming each patient takes the full hour to treat, what is the fewest number of doctors that should be on call so there is at least a 90% chance of not having more patients than doctors during a given hour?

If there are $X=k$ patients in an hour, the emergency room needs at least k doctors. So we want to find the number of patients k so that $P(X \le k) \ge 0.9$

In [None]:
p_dist = stats.poisson(3)
p_dist.ppf(0.9)

So if the emergency room has 5 doctors on call, there is at least a 90% chance of having at least as many doctors as patients.

## Binomial

The binomial distribution describes a sequence of identical, independent Bernoulli trials. That is, each trial has the same probability of success, and the results of one trial do not affect any of the following trials.

For instance, if a fair coin is flipped 100 times, the number of heads can be described by a binomial distribution, as each coin flip can be represented by the same Bernoulli distribution--even if the coin is unfair.

### Examples

1. A fair coin is flipped 10 times. What is the probability that an even number of heads appear?

In [None]:
# 0 - 10
n = 10
p = 0.5
x = np.arange(0, n+1)
# flag even as true, odd as false
even_numbers = x[x % 2 == 0]
b_dist = stats.binom(n, p)
r2(b_dist.pmf(even_numbers).sum())

In [None]:
n = 10
p = 0.5

about 1/2

2. A manufacturer knows that 20% of the widgets he produces are defective. If he needs to sell at least 10 (non-defective) widgets, how many times does he need to produce a (possibly defective) widget so that he has at least a 90% chance of meeting his quota?

We need to find n so that

$$
\sum_{i=10}^{n}{{n \choose i} (0.8)^i (0.2)^{n-i} \ge 0.9}
$$

In [None]:
def px(n: int, p: float) -> float:
    dist = stats.binom(n, p)
    return sum([dist.pmf(i) for i in np.arange(10, n+1)])

In [None]:
# 10 - 15
n_vals = np.arange(10, 16)
p = 0.8
p_vals = [px(n, p) for n in n_vals]
pd.DataFrame(dict(n=n_vals, probability=r4(p_vals)))

The sum is first greater than 0.9 when $n=15$

3. A weighted coin has probability $p=0.10$ of showing heads. What is the expected number of heads resulting from flipping the coin 100 times?

In [None]:
stats.binom(100, 0.1).mean()

4. A coin is flipped 20 times, and 16 heads are observed. Is this enough to conclude that the coin is biased?

If the coin is actually fair, then the probability of heads is $p=0.5$

Assuming this is true, we calculate the probability of getting at least 16 heads in 20 flips, which is the sum of the following:

In [None]:
b_dist = stats.binom(20, 0.5)
r4(b_dist.pmf(np.arange(16, 21)).sum())

Because this is less than 1%, we can reject the assumption that the coin is fair at the 99% confidence level, which means we conclude that the coin is biased at the 99% confidence level.

## Exponential

The exponential distribution describes the time between events that follow a Poisson distribution. For a random variable to be Poisson distributed, the following conditions must be met (at least approximately):

- The average number of events per time period being considered is known.
- An event should have the same probability of occurring at any point in the time period. For example, it should not be more likely to occur at the start of a day than at the end of a day.
- Each event is independent.
- Any number of events could occur during the time period being considered.

he probability density function of an exponential random variable $z$ is

$$
P(z|\lambda) = \lambda e^{-\lambda z}
$$

Like a Poisson random variable, an exponential random variable can take on only non-negative values. But unlike a Poisson variable, the exponential can take on any non-negative values, including non-integral values such as 4.25 or 5.612401

This property makes it a poor choice for count data, which must be an integer, but a great choice for time data, temperature data (measured in Kelvins, of course), or any other precise and positive variable. The graph below shows two probability density functions with different &lambda; values.

Given a specific &lambda;, the expected value of an exponential random variable is equal to the inverse of &lambda;, that is:

$$
E[z|\lambda]=\frac{1}{\lambda}
$$

In [None]:
# candidate values of z
x = np.linspace(0, 4, 100)
# the exponential distribution
expo = stats.expon
# sample lambda variabiales
lambda_ = [0.5, 1]

for l_param, colour in zip(lambda_, colours[:2]):
    y=expo.pdf(x, scale=1/l_param)
    plt.plot(
        x, y,
        lw=3,
        color=colour,
        label=f"$\lambda = {l_param:0.3f}$"
    )
    plt.fill_between(x, y, color=colour, alpha=.25)

plt.legend()
plt.ylabel("PDF at $z$")
plt.xlabel("$z$")
plt.ylim(0,1.2)
plt.title("Probability density function of an Exponential random variable; differing $\lambda$");

### Examples

1. Which of the following situations is described well by an exponential distribution?

- Rolling a fair die
- The heights of students in a school
- The number of hairs on a person's head
- The time between two flashes of lightning during a storm

The time between two flashes of lightning during a storm

The time between two flashes of lightning describes the time between events in a Poisson process, as the flashes of lightning are well-described by a Poisson distribution. Thus the time between flashes of lightning is exponentially distributed.

Heights are the classical example of normally distributed data, so they are not exponentially distributed. Rolling a fair die results in a discrete uniform distribution. Finally, the number of hairs on a person's head is unrelated to a Poisson process, and thus is not described by an exponential distribution (and probably not by any particularly nice distribution in general).

2. The lifetime of a part follows an exponential distribution, and each part lasts an average of 1 year. A manufacturer offers to replace any part that breaks within 3 months. What is the (approximate) probability the manufacturer will have to replace a given part?

3 months is a quarter of a year. With $\lambda=1$, the probability of a part breaking within the first quarter-year is the integral of the PDF from $t = 0$ to $t = 3$

In [None]:
r2(stats.expon(scale=1).cdf(0.25))

3. At a call center, calls come in every 20 minutes on average. What is the (approximate) probability that no calls will come in for a 30 minute period?

The time between calls can be represented by an exponential distribution with $\lambda = 3$, since one call every 20 minutes is 3 calls per hour. The probability that at least half an hour passes between calls is
$$
\begin{aligned} \int_{0.5}^{\infty}f(t)\,dt &= \int_{0.5}^{\infty}3e^{-3t}\,dt \\ &= -e^{-3t} \bigg\rvert_{0.5}^{\infty} \\ &= \lim_{b \to \infty} \left( -e^{-3b} - -e^{-3(0.5)} \right) \\ &= 0 + e^{-1.5} \\ &\approx 0.223 \end{aligned}
$$


In [None]:
# everything to the right
r2(1-stats.expon(scale=1/3).cdf(0.5))

4. A machine takes an average of 10 minutes to produce a part. How long (approximately, in minutes) should the operator wait to be at least 95% sure that the machine has produced the part?

The manufacturing time can be represented by an exponential distribution with $\lambda = 6$, the number of parts produced per hour. The probability that at most nn hours passes before a part is produced is

$$
\int_{0}^{n}f(t)\,dt = \int_{0}^{n}\big(6e^{-6t}\big)\,dt
$$


We need to find when this is at least 0.95. So we solve:

The operator should wait at least $0.499(60) = 29.94 \approx 30$ minutes to be 95% sure that the machine has produced the part.

In [None]:
np.ceil(stats.expon(scale=1/6).ppf(0.95) * 60)

5. A lottery is hit every 4 months on average. Given that 3 months have already passed since the last jackpot was awarded, what is the (approximate) probability that the jackpot will be awarded within the next 3 months?

The exponential distribution is memoryless, so the fact that the last 3 months have passed without an awarded jackpot is irrelevant. The distribution has $\lambda = 3$ jackpots per year, so the probability that a jackpot is awarded within the next 3 months, or quarter-year is given by
$$
\begin{aligned} \int_0^{0.25} f(t) \,dt &= \int_0^{0.25}3e^{-3t} \,dt \\ &= -e^{-3t} \bigg\rvert_0^{0.25} \\ &\approx 0.527 \end{aligned}
$$


In [None]:
r4(stats.expon(scale=1/3).cdf(0.25))

## Gamma

The gamma distribution is used to model the length of time before an event occurs. This is in contrast to the exponential distribution from the last quiz, which models the time between events.

For instance, the gamma distribution is often used to model waiting times. In particular, an insurance company may use it to model a lifespan, where the "event" is death.

The gamma distribution is described by two parameters _k_ (the shape parameter) and $\theta$ (the scale parameter), with the probability density function

$$
\large f_X (x)= \frac{1}{\Gamma(k)\theta^k}x^{k-1}e^{-\frac{x}{\theta}}
$$

 
Here, $\Gamma(x)$ is the gamma function, which is defined as

$$
\Gamma (x)=\int_0^{\infty} t^{x-1} e^{-t}\, dt
$$

It can be calculated that $E[X] = k\theta$ and $\text{Var}[X] = k\theta^2$ 

When $k = \theta = 1$ the resulting gamma distribution is in fact the standard exponential distribution defined by $f_X(x) = e^{-x}$ since $\Gamma(1) = 1$.

The gamma distribution has been used to model the size of insurance claims and the amount of rainfall into a reservoir among other things.

### Examples

1. Which of the following situations is described well by a gamma distribution?

- Rolling a fair die
- The heights of students in a school
- The number of hairs on a person's head
- The size of raindrops

Correct answer: The size of raindrops

The gamma distribution is used to model the time until an event occurs, which can be applied to the size of raindrops (by modelling the time until the drop falls).

Heights are the classical example of normally distributed data, so they are not described by a gamma distribution. Rolling a fair die results in a discrete uniform distribution. Finally, the number of hairs on a person's head is unrelated to a Poisson process, and thus is not described by a gamma distribution (and probably not by any particularly nice distribution in general).

2. Suppose X follows a gamma distribution where $\theta = k = 1$. What is $P(x > 0.5)$

In [None]:
# scale is 1 by default
r4(1 - stats.gamma(1, scale=1).cdf(0.5))

## Normal

1. The ages of the members of an organization containing 10,000 people are normally distributed with mean 27 and standard deviation 7. Approximately how many members of the organization are teenagers (people older than 13 and not yet 20 years old)?

The 68-95-99.7 rule gives us the following approximations:

In [None]:
n_dist = stats.norm(loc=27, scale=7)

In [None]:
def ci(dist: stats.rv_continuous, percentage: float):
    dx = (1 - percentage/100) / 2
    return dist.ppf((dx, 1-dx))

In [None]:
ci(n_dist, 68)

In [None]:
ci(n_dist, 95)

In [None]:
ci(n_dist, 99.7)

- 68% of the members are between the ages of 20 and 34
- 95% are between 13 and 41
- 99.7% are between 6 and 48

Also, the symmetry of the normal distribution tells us that about half of the data is above the mean, and half below. This symmetry extends to the approximations given above. That is,

- 34% of the members are between the ages of 20 and 27, and 34% are between the ages of 27 and 34
- 47.5% are between 13 and 27, 47.5% are between 27 and 41
- 49.85% are between 6 and 27, 49.85% are between 27 and 48


We want to know approximately how many are between 13 and 20, which would be those who are between 13 and 27, but not between 20 and 27. That is, $47.5 - 34 =13.5\%$ of the members.

13.5% of 10000 is 1350 members.

In [None]:
int(10000 * (n_dist.cdf(20) - n_dist.cdf(13)))

2. A stock portfolio averages a 15% return with a 30% standard deviation. What is the approximate probability that the portfolio loses money in a given year, assuming the returns are normally distributed?

The mean of this distribution is 0.15, and the standard deviation is 0.30. The portfolio will lose money if the return is negative, which corresponds to a z-score of -0.5

In [None]:
mu = 0.15
s = 0.3
z = (0 - mu)/s

In [None]:
r4(stats.norm.cdf(z))

In [None]:
n_dist = stats.norm(mu, s)

In [None]:
x = np.linspace(n_dist.ppf(0.01), n_dist.ppf(0.99), 100)

In [None]:
plt.plot(
    x,
    n_dist.pdf(x)
);
plt.ylabel("PDF at $z$")
plt.xlabel("$z$");
plt.title(f'$N(\mu={mu}, \sigma={s})$');

In [None]:
r4(n_dist.cdf(0))

so there is approximately a 31% chance of the portfolio losing money.

3. A tail risk is defined as an investment that moves more than three standard deviations from the mean of a normal distribution of investment returns. What is the probability a tail risk occurs?

Note: In this case, tail risk also includes a good outcome.

By definition, a tail risk occurs when the z-score is not between -3 and 3. By the empirical rule, roughly 99.7% of the data is between z=−3 and z=3, so roughly 0.3% of the data falls above a z-score of 3 or below a z-score of -3.

In [None]:
ci(stats.norm(), 99.7)

4. The SAT is designed to have approximately normally distributed scores, with a mean of 500 and a standard deviation of 100. What (approximate) score is necessary to be at the 70th percentile of scorers?

For a score to be in the 70th percentile, it means that 70% of the scores are smaller than it.

$$
\mu + 0.53 \sigma = 500 + 0.53(100) = 553
$$

In [None]:
z_score = stats.norm().ppf(0.7)

In [None]:
mu = 500
s = 100
np.ceil(stats.norm(mu, s).ppf(0.7))

In [None]:
np.ceil(mu + z_score * s)

5. A portfolio consists of 9 independent stocks, each of which is normally distributed with an average return of &pound;0.15 and a standard deviation of &pound;0.40. What is the average return and standard deviation of the entire portfolio?

Note: When combining normal distributions, means and variances are additive.

There are 9 stocks each with a mean return of &pound;0.15. so the mean return of the enteire portfolio is

In [None]:
n_stocks = 9
mean_return = 0.15
total_return = n_stocks * mean_return
r2(total_return)

Each stocks return has a standard deviation of &pound;0.40. The variance of each return is therefore:

In [None]:
return_std = 0.4
total_variance = n_stocks * return_std ** 2
r2(total_variance)

and the standard deviation of the portfolio is therefore

In [None]:
r2(np.sqrt(total_variance))

6. A stock currently sells for &pound;30 per share, and the historical daily returns are known to be normally distributed with a mean of 3% and a standard deviation of 1.5%. An investor will exercise a call option on this stock if its price rises above &pound;31. What is the approximate probability that this occurs the next day?

To increase from &pound;30 to &pound;31 requests at least a 3.33% increase since $\frac{31-30}{30}=0.03\bar{3}$

Thus we want the probability that a normal variabe with mean 3 and standard deviation 1.5 will be greater then 3.33.

This is often done by converting to a z-score:

In [None]:
z_score = (3.33-3)/1.5;
# greater than
r2(1 - stats.norm.cdf(z_score))

7. An investor wishes to invest &pound;750. He can either invest all &pound;750 into a single stock that has a mean return of 3% and a standard deviation of x%, or he can invest &pound;30 into each of 25 separate stocks that have a mean return of 3% and a standard deviation of 1.5%. For what value of x are these two options equivalent?

Assume that the stock returns are independent.

The answer is 0.3%. Lets see why


When adding together random variables, means and variances are additive. This can be written as:

$$
E(X+Y)=E(X)+E(Y)E(X+Y)=E(X)+E(Y)
$$

$$
\sigma^{2}_{X+Y} = \sigma^{2}_{X}+\sigma^{2}_{Y}
$$

In this scenario, we are adding together the distributions of 25 stocks, and each one has a mean of 33 and a standard deviation of 1.5. So the variance of each stock is $1.5^{2}=2.25$ 

$$
E(X_1+X_2... + ...X_{25}) = 3*25 = 75
$$

$$
\sigma^{2}_{X_1+X_2... + ...X_{25}} = 2.25*25 = 56.25
$$

$$
\sigma_{X_1+X_2... + ...X_{25}} = \sqrt{56.25} = 7.5
$$

So the portfolio of 25 stocks follows the distribution of $N(75\%, 7.5\%)$.

Hmmm, but that doesn't seem to make sense. The stocks now have a mean return of 75%!


This is a special case where the mean and standard deviation are percentage returns, not a total value. Let's say we have nn stocks that each have a return of 3%. If we add them together, the return of the total investment is still 3%, we just have a bigger investment.

each stock is $\frac{1}{25}$ of the portfolio so if we divide both the mean and variance by 25, we'll have the correct percentage return.

$$
N(\frac{75\%}{25}, \frac{7.5\%}{25}) = N(3\%, 0.3\%)
$$

So if the single &pound;750 stock has a mean return of 3% and a standard deviation of 0.3%, the portfolios are equivalent.

8. An investor is choosing between two stocks that are both currently trading at the same share price. The daily returns of the first stock are historically normally distributed with a mean of 3% and a standard deviation of 1.5%. The daily returns of the second stock are historically normally distributed with a mean of 4% and a standard deviation of 2%. What is the approximate probability that, after one day, the second stock outperforms the first stock?

The return from the first stock minus the return from the second stock is normally distributed with mean -1% and standard deviation $\sqrt{1.5^2 + 2^2} = 2.5\%$

The probability that the second stock out performs the first is the probability that this random variable is negative,

In [None]:
r2(stats.norm(-1, 2.5).cdf(0))

9. An investor wishes to invest &pound;700.

There are two independent stocks the investor can choose to invest in, both of which are currently trading at the same share price. The daily returns of the first stock are historically normally distributed with a mean of 3% and a standard deviation of 1.5%. The daily returns of the second stock are historically normally distributed with a mean of 4% and a standard deviation of 2%.

How much should the investor choose to invest (in pounds) in the first stock to maximize the probability of having a positive profit over the course of a day?

Hint: Find the equation for the z-score of the combined portfolio. Then, find its maximum.

Suppose the investor puts x dollars into the first stock and y dollars into the second stock. The combined mean is $.03 x + .04 y$ and the combined variance is $(.015 x)^2 + (.02 y)^2$. This gives a z-score formula of

$$
\frac{.03 x + .04 y}{\sqrt{(.015 x)^2 + (.02 y)^2}}
$$

Next we turn it into a function of one variable by writing y in terms of x:

$$
\frac{.03 x + .04 (700 - x)}{\sqrt{(.015 x)^2 + (.02 (700 - x))^2}}
$$

In [None]:
def fx(x):
    n_z = .03 * x + .04 * (700 - x)
    d_z = (0.015 * x) ** 2 + (0.02 * (700 - x)) ** 2
    return n_z / np.sqrt(d_z)

In [None]:
x_vals = np.arange(0, 701)
# we are not bothered by the axes argument
y_vals = np.apply_over_axes(lambda x, axis: fx(x), x_vals, 0)

In [None]:
p = sns.lineplot(
    x = x_vals,
    y = y_vals
);
p.set(
    ylim=(0, 4)
);

It looks like 400, but to be more precise we can use `optimize`

In [None]:
from scipy.optimize import minimize

In [None]:
# we want the maximum, so take the negative
# 0 is our initial guess
res = minimize(lambda x: -fx(x), 0)

In [None]:
r2(res.x.item())

The investor should put &pound;400 into the first stock and &pound;300 into the second stock to maximize the probability of profit.

10. An investor buys an equal number of shares in two stocks whose returns are both normally distributed with mean 3% and standard deviation 1.5%. What is the approximate probability that the investor makes a profit?

Assume that the stock prices are equal, and that the stocks have independent return distributions.

In [None]:
mu = 3
sd = 1.5

n_dist = stats.norm(2 * mu, np.sqrt(2 * 1.5**2))

The sum of the return from the two stocks is normally distributed with mean 6% and standard deviation 2.12%. The probability that this return is positive is nearly 100%,

In [None]:
n_dist.mean()

In [None]:
n_dist.std()

In [None]:
r4(1 - n_dist.cdf(0))

## Lognormal

The log-normal distribution is what you might expect from the name: it describes a random variable whose logarithm is normally distributed. In this quiz, we'll see exactly how they're related when it comes to computations.

The log-normal distribution is described by two parameters $\mu$ and $\sigma$, the mean and standard deviation (respectively) of the distribution's logarithm. Thus a log-normal variable X can be written as

$$
X = e^{\mu + \sigma Z}
$$

where Z is a standard normal variable (with mean 0 and standard deviation 1).

From the mean and standard deviation of $\ln X$, we can calculate the expected value and variance of X:

$$
\begin{aligned} E[X] &= e^{\mu + \frac{\sigma^2}{2} } \\ \text{Var}[X] &= ( e^{\sigma^2}-1 )e^{2\mu + \sigma^2} \end{aligned}
$$

Note that we can also write $\text{Var}[X] = (e^{\sigma^2}-1 )\left( E[X] \right)^2$

The log-normal distribution is helpful in modeling stock prices.

It appears in a wide variety of natural phenomena because of its relationship with the normal distribution, and has even been used to model the length of chess games.


### Examples

1. Which of the following situations is described well by a log-normal distribution?

- The time between two flashes of lightning during a storm
- The income level of a random group
- The number of hairs on a person's head
- The heights of students in a school

answer: The income level of a random group

Incomes are classically log-normally distributed.

Heights are the classical example of normally distributed data, so they are not log-normally distributed. The time between two flashes of lightning describes the time between events in a Poisson process, as the flashes of lightning are well-described by a Poisson distribution. Thus the time between flashes of lightning is exponentially distributed. Finally, the number of hairs on a person's head is unrelated to a normal distribution, and thus is not described by a log-normal distribution (and probably not by any particularly nice distribution in general).

2. If X is a variable such that $\ln X$ is normally distributed with mean 1 and standard deviation 2, what is the mean of X?

In [None]:
mu = 1
sigma = 2
l_dist = stats.lognorm(sigma, scale=np.exp(mu))
r2(l_dist.mean())

2. If X is a variable such that $\ln X$ is normally distributed with mean 1 and standard deviation 2, what is the variance of X?

In [None]:
r2(l_dist.var())

In [None]:
x = np.linspace(l_dist.ppf(0.00), l_dist.ppf(0.99), 100)
plt.plot(
    x, l_dist.pdf(x)
);

In [None]:
def plot_dist_pdf(dist, lower=0.01, upper=0.99, num_points=100, xlabel=None):
    x = np.linspace(dist.ppf(lower), dist.ppf(upper), num_points)
    plt.plot(
        x, dist.pdf(x)
    );
    if xlabel:
        plt.xlabel(xlabel);
    

In [None]:
plot_dist_pdf(l_dist, lower=0, upper=0.8)

In [None]:
r2(l_dist.median())

In [None]:
l_samples = np.log(l_dist.rvs(1000))

In [None]:
print(f'Mean: {l_samples.mean():0.2f}, Standard Deviation: {l_samples.std():0.2f}')

3. A stock price is currently &pound;50, and the factor by which the price is multiplied after a year follows a log-normal distribution with $\mu = 0.1,\space \sigma = 0.3$. What is the (approximate) probability that the stock price will be below &pound;50 a year later?

In [None]:
u = 0.1
s = 0.3
l_dist = stats.lognorm(s, scale=np.exp(u))

In [None]:
plot_dist_pdf(l_dist, lower=0, xlabel='price multiplication factor')

In [None]:
# it has to be less than 1 to reduce the price
r2(l_dist.cdf(1))

Since the factor is log-normal we can express it as $e^{0.1+0.3 Z}$, where Z is the standard normal distribution

Then if the stock price one year later is S, we have

$$
S = 50 \cdot e^{0.1+0.3 \text Z}
$$

and we need to find $P(S < 50)$. This is equivalent to finding $P(e^{0.1+0.3 \text Z} < 1)$ since we need the multiplying factor to *decrease* the price.

Taking logs with is in turn equivalent to finding $P(0.1 + 0.3 Z < 0)$ or $P(Z < -1/3)$

In [None]:
r2(stats.norm().cdf(-1/3))

so the probability that the stock falls below &pound;50 is about 37%.

3. A stock price is currently &pound;50, and the factor by which the price is multiplied after a year follows a log-normal distribution with $\mu = 0.1,\space \sigma = 0.3$. What is the (approximate) expected value of the stock after one year

In [None]:
r2(50 * l_dist.mean())

4. A stock price is currently &pound;50, and the factor by which the price is multiplied after a year follows a log-normal distribution with $\mu = 0.1,\space \sigma = 0.3$. What is the (approximate) expected value of the stock after 25 years.

We found in the last question that the expected multiplying factor for one year later is

$$
e^{0.1+\frac{0.3^2}{2}}
$$

To find the expected price after 25 years, we would multiply by this factor 25 times, or simply multiply the current price by

$$
(e^{0.1+\frac{0.3^2}{2}})^{25}
$$

In [None]:
r2(50 * l_dist.mean() ** 25)

5. A stock price is currently &pound;50, and the factor by which the price is multiplied after a year follows a log-normal distribution with $\mu = 0.1,\space \sigma = 0.3$. After 25 years what is the probability that the stock price is a least &pound;610

if the stock price 25 years later is S, we have

$$
S = 50 \cdot (e^{0.1+0.3 \cdot z})^{25}
$$

we to find $P(S \ge 610)$ or

In [None]:
# so z needs to be greater than 0.0002
z = 0.0002
r2(1-stats.norm().cdf(z))

Since the standard normal distribution is symmetrical about 0, we know $P(Z\geq 0) = P(Z \leq 0) = 0.5$.

So we can conclude that the probability is about 50%