# Introduction: Poisson Process and Poisson Distribution

In this notebook, we'll look at a Poisson process and model both the probability of the expected number of events and the waiting time between events. Poisson processes occur frequently in real life (or many phenomonenon can be approximated by a Poisson process) and provide a relatively simple distribution to explore. 

# Poisson Process: Observing shooting stars

We'll work through the following Poisson Process:

The average time between shooting stars = 12 minutes (5 meteors / hour). 

In [1]:
# Standard data science
import pandas as pd
import numpy as np

np.random.seed(42)

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Visualizations
from chart_studio import plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot

# Cufflinks for dataframes
import cufflinks as cf
cf.go_offline()
cf.set_config_file(world_readable=True, theme='pearl')

In [2]:
from scipy.special import factorial

The only parameter of the Poisson distribution is $\lambda$, the event rate, (or rate parameter). This represents the expected number of events in an interval. If we have a rate in events / time, we can get to the expected events by multiplying the time.

We find the probability of a number of events, k, in an interval using the Poisson Probability Density Function:

$$P(k{\text{ events in interval}})=e^{-\lambda }{\frac {\lambda ^{k}}{k!}}$$

Let's work through an example.

### Poisson Probabilities

In [3]:
events_per_minute = 1/12
minutes = 60

# Expected events
lam = events_per_minute * minutes

k = 3
p_k = np.exp(-lam) * np.power(lam, k) / factorial(k)
print(f'The probability of {k} meteors in {minutes} minutes is {100*p_k:.2f}%.')

The probability of 3 meteors in 60 minutes is 14.04%.


We can do the same calculation by simulating 10,000 hours.

In [4]:
x = np.random.poisson(lam, 10000)
(x == 3).mean()

0.1377

Let's write a quick function to calculate probabilities for a number of events. 

In [5]:
def calc_prob(events_per_minute, minutes, k):
    # Calculate probability of k events in specified number of minutes
    lam = events_per_minute * minutes
    return np.exp(-lam) * np.power(lam, k) / factorial(k)
    
calc_prob(events_per_minute, minutes, 3)

0.14037389581428056

We can use this function to generate a distribution of probabilities for different numbers of events.

In [6]:
# Different numbers
ns = np.arange(12)
p_n = calc_prob(events_per_minute, minutes, ns)

print(f'The most likely value is {np.argmax(p_n)} with probability {np.max(p_n):.4f}')

The most likely value is 4 with probability 0.1755


It turns out in this situation, 4 and 5 events have the exact same probability (because our rate parameter is a whole number).

In [7]:
p_n[4:6]

array([0.17546737, 0.17546737])

## Poisson Distribution

To show the distribution, we plot the probabiliy on the y-axis versus the number of events on the x-axis. This represents the probability density function of the Poisson process.

In [8]:
def plot_pdf(x, p_x, title=''):
    # Plot PDF of Poisson distribution
    df = pd.DataFrame({'x': x, 'y': p_x})
    print(f'The most likely value is {np.argmax(p_x)} with probability {np.max(p_x):.4f}')
    annotations = [dict(x=x, y=y+0.01, text=f'{y:.2f}', 
                        showarrow=False, textangle=0) for x, y in zip(df['x'], df['y'])]
    df.iplot(kind='scatter', mode='markers+lines',
             x='x', y='y', xTitle='Number of Events',
             yTitle='Probability', annotations=annotations,
             title=title)

In [9]:
plot_pdf(ns, p_n, title='Probability of Number of Meteors in One Hour')

The most likely value is 4 with probability 0.1755


### Distribution with Differing Rates

Let's make the probability density function with differing numbers of meteors per hour.

In [10]:
def plot_different_rates(events_per_minute, minutes, ns, title=''):
    df = pd.DataFrame()
    annotations=[]
    colors = ['orange', 'green', 'red', 'blue', 'purple', 'brown']
    for i, events in enumerate(events_per_minute):
        probs = calc_prob(events, minutes, ns)
        annotations.append(dict(x=np.argmax(probs)+1, y=np.max(probs)+0.025, 
                                text=f'{int(events * minutes)} MPH<br>Meteors = {np.argmax(probs) + 1}<br>P = {np.max(probs):.2f}',
                                color=colors[i],
                               showarrow=False, textangle=0))
        df[f'Meteors per Hour = {int(events * minutes)}'] = probs
    df.index = ns
    df.iplot(kind='scatter', mode='markers+lines', colors=colors, size=8, annotations=annotations,
             xTitle='Events', yTitle='Probability', title=title)
    return df

In [11]:
df = plot_different_rates(events_per_minute=np.array([1/5, 1/12, 1/10, 1/15, 1/20, 1/30]),
                          minutes=60,
                          ns=list(range(15)), 
                          title='Probability of Meteors in 1 Hour at Different Rates')

We can also keep the rate the same, but try different lengths of time.

In [12]:
def plot_different_times(events_per_minute, minutes, ns, title=''):
    df = pd.DataFrame()
    annotations = []
    colors = ['orange', 'green', 'red', 'blue', 'purple', 'brown']
    for i, minute in enumerate(minutes):
        probs =  calc_prob(events_per_minute, minute, ns)
        annotations.append(dict(x=np.argmax(probs), y=np.max(probs)+0.025, 
                                color=colors[i],
                                text=f'{minute} Minutes<br>Meteors = {np.argmax(probs)}<br>P = {np.max(probs):.2f}',
                               showarrow=False, textangle=0))
        df[f'Minutes = {minute}'] = probs
    df.index = ns
    df.iplot(kind='scatter', mode='markers+lines', colors=colors,
             size=8, annotations=annotations,
             xTitle='Events', yTitle='Probability', title=title)
    return df

In [13]:
df = plot_different_times(events_per_minute=1/12, minutes=np.array([30, 60, 90, 120]),
                         ns=list(range(15)), title='Probability of Meteors in Time Intervals')

## Simulation of Observations

We can use `np.random.poisson` to simulate 10,000 hours of observation and then make a histogram of observations. We expect to see a peak at 4 or 5 meteors since that is the most likely value.

In [14]:
def plot_hist(x, title='',summary=True):
    df = pd.DataFrame(x)
    df.iplot(kind='hist', xTitle='Events', 
             yTitle='Count', title=title)
    if summary:
        print(df.describe())

In [15]:
N = 10000
counts = np.random.poisson(lam, size=N)
plot_hist(counts, title=f'Distribution of Number of Meteors in 1 Hour Simulated {N} Times')

                  0
count  10000.000000
mean       4.996400
std        2.229638
min        0.000000
25%        3.000000
50%        5.000000
75%        6.000000
max       20.000000


In [16]:
counts = np.random.poisson(lam * 3, size=N)
plot_hist(counts, title=f'Distribution of Number of Meteors in 3 Hours Simulated {N} Times')

                  0
count  10000.000000
mean      15.003700
std        3.844498
min        4.000000
25%       12.000000
50%       15.000000
75%       17.000000
max       31.000000


# Probability of Different Numbers of Events

Now let's take a look at the probability of seeing different numbers of meteors. We can find the probability by summing up the probabilities of more than a given number of events or less than or equal to a given number of events.

In [17]:
def pr_less_than_or_equal(events_per_minute, minutes, n_query, quiet=False):
    p_n = calc_prob(events_per_minute, minutes, np.arange(100))
    p = p_n[:n_query+1].sum() / p_n.sum()
    if not quiet:
        print(f'{int(events_per_minute*60)} Meteors Per Hour. Probability of {n_query} or fewer meteors in {int(minutes/60)} hour: {100*p:.2f}%.')
    return p

def pr_greater_than(events_per_minute, minutes, n_query, quiet=False):
    p = 1 - pr_less_than_or_equal(events_per_minute, minutes, n_query)
    if not quiet:
        print(f'{int(events_per_minute*60)} Meteors Per Hour. Probability of more than {n_query} meteors in {int(minutes/60)} hour: {100*p:.2f}%.')
    return p

assert pr_less_than_or_equal(events_per_minute, minutes, 10, True) + pr_greater_than(events_per_minute, minutes, 10, True) == 1

5 Meteors Per Hour. Probability of 10 or fewer meteors in 1 hour: 98.63%.


In [18]:
_ = pr_greater_than(events_per_minute=1/12, minutes=60, n_query=10)

5 Meteors Per Hour. Probability of 10 or fewer meteors in 1 hour: 98.63%.
5 Meteors Per Hour. Probability of more than 10 meteors in 1 hour: 1.37%.


In [19]:
_ = pr_greater_than(events_per_minute=1/12, minutes=60, n_query=4)

5 Meteors Per Hour. Probability of 4 or fewer meteors in 1 hour: 44.05%.
5 Meteors Per Hour. Probability of more than 4 meteors in 1 hour: 55.95%.


In [20]:
_ = pr_greater_than(events_per_minute=1/12, minutes=60, n_query=5)

5 Meteors Per Hour. Probability of 5 or fewer meteors in 1 hour: 61.60%.
5 Meteors Per Hour. Probability of more than 5 meteors in 1 hour: 38.40%.


In [21]:
_ = pr_greater_than(events_per_minute=1/12, minutes=120, n_query=3)

5 Meteors Per Hour. Probability of 3 or fewer meteors in 2 hour: 1.03%.
5 Meteors Per Hour. Probability of more than 3 meteors in 2 hour: 98.97%.


# Waiting Time

Next, let's look at the waiting time between events in a Poisson Process. This is a decaying exponential.

$$P(T > t) = e^{-\text{events per minute} * {t}}$$

In [22]:
def waiting_time_more_than(events_per_minute, t, quiet=False):
    p = np.exp(-events_per_minute * t)
    if not quiet:
        print(f'{int(events_per_minute*60)} Meteors per hour. Probability of waiting more than {t} minutes: {100*p:.2f}%.')
    return p
    
def waiting_time_less_than_or_equal(events_per_minute, t, quiet=False):
    p = 1 - waiting_time_more_than(events_per_minute, t, quiet=quiet)
    if not quiet:
        print(f'{int(events_per_minute*60)} Meteors per hour. Probability of waiting at most {t} minutes: {100*p:.2f}%.')
    return p

def waiting_time_between(events_per_minute, t1, t2):
    p1 = waiting_time_less_than_or_equal(events_per_minute, t1, True)
    p2 = waiting_time_less_than_or_equal(events_per_minute, t2, True)
    p = p2-p1
    print(f'Probability of waiting between {t1} and {t2} minutes: {100*p:.2f}%.')
    return p

assert waiting_time_more_than(events_per_minute, 15, True) + waiting_time_less_than_or_equal(events_per_minute, 15, True) == 1

In [23]:
_ = waiting_time_less_than_or_equal(events_per_minute, 12)

5 Meteors per hour. Probability of waiting more than 12 minutes: 36.79%.
5 Meteors per hour. Probability of waiting at most 12 minutes: 63.21%.


In [24]:
_ = waiting_time_less_than_or_equal(events_per_minute, 6)

5 Meteors per hour. Probability of waiting more than 6 minutes: 60.65%.
5 Meteors per hour. Probability of waiting at most 6 minutes: 39.35%.


In [25]:
_ = waiting_time_less_than_or_equal(events_per_minute, 30)

5 Meteors per hour. Probability of waiting more than 30 minutes: 8.21%.
5 Meteors per hour. Probability of waiting at most 30 minutes: 91.79%.


In [26]:
_ = waiting_time_less_than_or_equal(events_per_minute=1/2, t=5)

30 Meteors per hour. Probability of waiting more than 5 minutes: 8.21%.
30 Meteors per hour. Probability of waiting at most 5 minutes: 91.79%.


In [27]:
_ = waiting_time_between(events_per_minute, 5, 15)

Probability of waiting between 5 and 15 minutes: 37.27%.


In [28]:
_ = waiting_time_between(events_per_minute, 5, 30)

Probability of waiting between 5 and 30 minutes: 57.72%.


In [29]:
def plot_waiting_time(events_per_minute, ts, title=''):
    p_t = waiting_time_more_than(events_per_minute, ts, quiet=True)
    
    df = pd.DataFrame({'x': ts, 'y': p_t})
    df.iplot(kind='scatter', mode='markers+lines', size=8,
             x='x', y='y', xTitle='Waiting Time',
             yTitle='Probability', 
             title=title)
    
    return p_t

In [30]:
p_t = plot_waiting_time(events_per_minute, np.arange(100), title='Probability (T > t)')

## Average Waiting Time

The average waiting time is simply 1 / events per minute. We can illustrate this by simulating 100,000 minutes of waiting. 

In [31]:
np.random.seed(42)

events = np.random.choice([0, 1], size = 100000, replace=True, 
                          p=[1-events_per_minute, events_per_minute])

success_times = np.where(events==1)[0]
waiting_times = np.diff(success_times)
waiting_times[:10]

array([10, 22,  1, 16,  2,  3, 14, 43, 22,  5])

In [32]:
np.mean(waiting_times)

12.229818982387476

In [33]:
def plot_hist_waiting_time(x, title=''):
    df = pd.DataFrame(x)
    df.iplot(kind='hist', xTitle='Waiting Time between Events', bins=(0, 100, 1),
             yTitle='Count', title=title)

In [34]:
plot_hist_waiting_time(waiting_times, title='Waiting Time Distribution')

In [35]:
avg = []
for i in range(10000):
    avg.append(np.mean(np.diff(np.where(np.random.choice([0, 1], size = 100000, replace=True, 
                          p=[1-events_per_minute, events_per_minute]) == 1)[0])))

In [36]:
plot_hist(avg)

                  0
count  10000.000000
mean      12.001552
std        0.125612
min       11.564770
25%       11.917357
50%       12.001200
75%       12.084970
max       12.494751


In [37]:
avg = np.array(avg)
np.mean(avg)

12.001551985736299

## Visualizing Successes

Finally, let's look at 1 hour of observations and when we actually see the meteors. 

In [38]:
np.random.seed(6)

events = np.random.choice([0, 1], size = minutes, replace=True, 
                          p=[1-events_per_minute, events_per_minute])

success_times = np.where(events==1)[0]
waiting_times = np.diff(success_times)
success_times

array([15, 27, 31, 36, 43])

In [39]:
annotations = [go.layout.Annotation(x=x, y=1, text=f'Time: {x}', ax=0, ay=250) for x in success_times]

figure = go.Figure(data=[go.Scatter(x=success_times, 
                                    y=np.ones(shape=len(success_times)), 
                                    mode='markers')], 
                   
                   layout=go.Layout(annotations=annotations, yaxis=dict(range=(0, 1.1)), 
                                   xaxis=dict(title="Minutes", range=(0, 60)), title='Meteors over One Hour'))
iplot(figure)

# Binomial Versus Poisson Distribution

In [40]:
N = 30

trials = np.random.binomial(minutes, events_per_minute, size=N)
trials.mean()

trials_poisson = np.random.poisson(lam, size=N)
trials_poisson.mean()

5.133333333333334

4.333333333333333

In [41]:
plot_hist(trials, title=f'Binomial Distribution with N = {N}')
plot_hist(trials_poisson, title=f'Poisson Distribution with N = {N}')

               0
count  30.000000
mean    5.133333
std     2.129163
min     2.000000
25%     3.000000
50%     5.000000
75%     6.000000
max    10.000000


               0
count  30.000000
mean    4.333333
std     2.202402
min     1.000000
25%     2.250000
50%     4.000000
75%     6.000000
max     9.000000


In [42]:
N = 10000

trials = np.random.binomial(minutes, events_per_minute, size=N)
trials.mean()

trials_poisson = np.random.poisson(lam, size=N)
trials_poisson.mean()

5.0365

4.944

In [43]:
plot_hist(trials, title=f'Binomial Distribution with N = {N}')
plot_hist(trials_poisson, title=f'Poisson Distribution with N = {N}')

                  0
count  10000.000000
mean       5.036500
std        2.149612
min        0.000000
25%        4.000000
50%        5.000000
75%        6.000000
max       15.000000


                  0
count  10000.000000
mean       4.944000
std        2.218683
min        0.000000
25%        3.000000
50%        5.000000
75%        6.000000
max       15.000000


The Poisson and Binomial approximations are nearly identical when the number of trials is large and the probability of success is relatively small. The Poisson distribution can be thought of as a special case of the Binomial where the number of trials goes to infinity.

# Conclusions

In this notebook, we briefly outlined the basics of a Poisson process and Poisson distribution. We also walked through an example that you can adapt to different situations. 
To summarize, the Poisson distribution gives the probability of a number of events in an interval with the events generated by a Poisson process. The Poisson distribution is the extension of the Binomial distribution to situations in which the number of trials is large and the number of successes remains relatively small. As with all distributions, the Poisson gives us a range of possible outcomes in addition to one that is most likely.