**(?)** Why bagging leaves an out-of-bag ratio of $37\%$? I mean, where comes the number of $37$?

Before trying to answer this mathematically, let's try to verify it programmatically.

In [1]:
import random

### Sampling With Replacement

`choices` and `choice` all sample with replacement.

In [8]:
random.choices(range(10), k=10)

[3, 4, 4, 3, 7, 6, 2, 7, 2, 8]

In [13]:
[random.choice(range(10)) for _ in range(10)]

[0, 8, 8, 8, 1, 0, 6, 9, 4, 0]

In [12]:
n_integers = 10_000
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.63


We kind of see how we could formulate the question:  
Taking a concret example, say in a ice cream shop having $n$ flavours, and a weird rule for
ordering ice cream: Each customer must order $n$ balls of ice cream, each ball being chosen
at random with replacement.

Result: Each customer will get around $0.63 n$ of distinct flavours. This result get more and more precise as $n$ increases.

So far, the only idea I got was to let $X$ be the random variable of final number of distinct flavours.
And compute the expected value of $X$ as
$$
  E(X) = \sum_{k=1}^{n} k p_{k}\,,
$$
where $p_{k}$ denotes the probability of obtaining exactly $k$ flavours.

However, even if later on we switch to statistics and say that the confidence interval of the mean of $X$ is closely around $0.63n\,,$ the whole story still seems not convincing to me because
> what we have seen in the above cell is value of $X$ itself being $0.63n$ instead of its mean.

**(?)** There must be a better formulation for this $0.63n$ phenomenon.

Let's modify the number of samples (`k`):

In [15]:
n_integers = 10_000
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers * 2))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.87
len(bagging) / len(S) = 0.86
len(bagging) / len(S) = 0.86


In [16]:
n_integers = 10_000
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers // 2))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.39
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.39


If we decrease `n_integers`: The bigger `n_integers` is, the closer the result to `0.63`.

In [17]:
n_integers = 10
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.50
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.80
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.80
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.50
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.40
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.70
len(bagging) / len(S) = 0.60
len(bagging) / len(S) = 0.60


In [18]:
n_integers = 100
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.68
len(bagging) / len(S) = 0.59
len(bagging) / len(S) = 0.67
len(bagging) / len(S) = 0.61
len(bagging) / len(S) = 0.58
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.68
len(bagging) / len(S) = 0.66
len(bagging) / len(S) = 0.65
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.61
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.68
len(bagging) / len(S) = 0.58


In [19]:
n_integers = 1000
S = range(n_integers)
for _ in range(20):
    bagging = set(random.choices(S, k=n_integers))
    print(f"len(bagging) / len(S) = {len(bagging) / len(S):.2f}")

len(bagging) / len(S) = 0.65
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.61
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.61
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.65
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.63
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.64
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.62
len(bagging) / len(S) = 0.63
