# Statistical thinking - EDA, probabilistic thinking

# Statistical Inference - EDA
EDA = Exploratory data analysis = organize, plot, summarize data set

Contents
- Graphical EDA
- Quantitative EDA
- Thinking probabilistically (continuous variables)

You can use Numpy arrays and Pandas dataframes interchangeably for graphing

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone." - John Tukey

## Graphical EDA

### 1. Histogram

In [None]:
# plot a histogram
import matplotlib.pyplot as plt

# passed a DataFrame, but can do Numpy array
# '_' is a dummy variable commonly used in Python to prevent unnecessary output
# note: you can use ';' after each statement in Jupyter Notebook
_ = plt.hist(df_swing['dem_share'])
# always label your axes
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of countries')
plt.show()

In [None]:
# histograms with different binning
# "square root rule": number of bins = np.sqr(samples)

# specify bins arg with edges
bin_edges = [0,10,20,30,40,50,60,70,80,90,100]
_ = plt.hist(df_swing['dem_share'], bins = bin_edges)
plt.show()

# specify number of bins
_ = plt.hist(df_swing['dem_share'], bins = 20)
plt.show()

In [None]:
# Example of "square root rule" for bin determining

# Import numpy
import numpy as np

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
_ = plt.hist(versicolor_petal_length, bins = n_bins)

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

#### Set styling with Seaborn

In [None]:
import seaborn as sns

# seaborn default has nicer styling than matplotlib
sns.set()
_ = plt.hist()
_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of countries')
plt.show()

## Beware binning bias with Histograms

### 2. plot all the data: Bee Swarm plots
- alternative to Histogram with binning bias
- needs to be a Pandas dataframe


In [None]:
# example: bee swarm plot
_ = sns.swarmplot(x='state',y='dem_share',data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

### 3. Empirical cumulative distribution functions (ECDF)
- plot all the data
- alternative to bee swarm plot if the data is overlapping the edges
- among the most important plots in stats analysis

In [None]:
# x-axis is sorted data
import numpy as np
x = np.sort(df_swing['dem_share'])
y = np.arange(1, len(x)+1 / len(x)) # note: len(x) = n sample size

_ = plt.plot(x, y, marker='.',linestyle='none')
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')
plt.margins(0.02) # keeps data off plot edges with 2% buffer
plt.show()

In [None]:
# example 1: ecdf function

# write a function to compute the ECDF
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none')

# Label the axes
_ = plt.xlabel('versicolor petal length')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

In [None]:
# Compare multiple ECDFs in 1 plot

# example 2: continuation of example 1

# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot
plt.plot(x_set, y_set, marker='.', linestyle='none')
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
plt.plot(x_virg, y_virg, marker='.', linestyle='none')

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc = 'lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

## Quantitative EDA - summary stats
- compute summary stats to describe features of a data set

### - compute mean and median

In [None]:
# calculate mean
import numpy as np
np.mean(dem_share_PA)

# calculate Median = 50%ile on an ECDF
# mean affected by outliers in data, so use Median
# Median = middle value of data set, so immune to extremes
np.median(dem_share_UT)

In [None]:
# example: mean
# Compute the mean: mean_length_vers
mean_length_vers = np.mean(versicolor_petal_length)
# Print the result with some nice formatting
print('I. versicolor:', mean_length_vers, 'cm')

###  - compute percentiles

In [None]:
np.percentile(df_swing['dem_share'], [25,50,75])
# output: gives values that match the percentiles

In [None]:
# example
# Specify array of percentiles: percentiles 2.5, 25, 50, 75, 97.5th
percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Print the result
print(ptiles_vers)

### - compare percentiles to ECDF

In [None]:
# example: compare percentiles to ECDF

# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.
# x = ptiles_vers, y = percentiles/100, marker = 'D" are diamonds
# percentiles/100 to rescale percentiles and keep ECDF between 0 and 1
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red', linestyle = 'none')

# Show the plot
plt.show()

### - box and whisker plot aka 'box plot'
- note: outlier usually >2 IQR (interquartile range)
- box plots are a great alternative to bee swarm plots with a lot of data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)
_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

In [None]:
# example: Make a box plot of the iris petal lengths. 
# Pandas df that has petal lenght

# Create box plot with Seaborn's default settings
_ = sns.boxplot(x='species', y='petal length (cm)',data=df)

# Label the axes
_ = plt.xlabel('species')
_ = plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

### - compute variance
- variance = mean squared distance of the data from their mean, aka spread

In [None]:
np.var(dem_share_FL)

### - std dev and variance - np.std(), np.var()
- std dev = sqrt(variance)

In [None]:
np.std(dem_share_FL)

# or the long way
np.sqrt(np.var(dem_share_FL))

In [None]:
# explicitly compute the variance and compare to np.var()

# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq
diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)

# Print the results
print(variance_explicit, variance_np)

### - scatter plots

In [None]:
# generate a scatter plot: marker='.', linestyle='none'
_ = plt.plot(total_votes/1000, dem_share, marker='.', linestyle='none')
_ = plt.xlabel('total votes (thousands)')
_ = plt.ylabel('percent of vote for Obama')


### - covariance and Pearson correlation coefficient
p = Pearson correlation = covariance / ((std of x)(std of y))
 = variability due to codependence / independent variability
- dimensionless
- ranges from -1 (complete anticorrelation) to 1 (complete postive correlation)
- 0 means no correlation
- this is a good metric for correlation b/n 2 variables

In [None]:
# calculate covariance from scatter plot (above)
# Make a scatter plot of petal length and width
_ = plt.plot(versicolor_petal_length, versicolor_petal_width, marker='.', linestyle='none')

# Label the axes
_ = plt.xlabel('petal length')
_ = plt.ylabel('petal width')

# Show the result
plt.show()

### - compute covariance np.cov()
The covariance may be computed using the Numpy function np.cov(). For example, we have two sets of data x and y, np.cov(x, y) returns a 2D array where entries [0,1] and [1,0] are the covariances. Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y. This 2D output array is called the covariance matrix, since it organizes the self- and covariance.
Note that by symmetry, entry [1,0] is the same as entry [0,1].

In [None]:
# Compute the covariance matrix: covariance_matrix
covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width)

# Print covariance matrix
print(covariance_matrix)

# Extract covariance of length and width of petals: petal_cov
petal_cov = covariance_matrix[0,1]

# Print the length/width covariance
print(petal_cov)

### - compute Pearson correlation coefficient, Pearson r - np.corrcoef()
np.corrcoef() function, like np.cov(), takes two arrays as arguments and returns a 2D array. Entries [0,0] and [1,1] are necessarily equal to 1 (can you think about why?), and the value we are after is entry [0,1].

In [None]:
# write a function that calculates Pearson r
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y)

    # Return entry [0,1]
    return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length,versicolor_petal_width)

# Print the result
print(r)

## Thinking probabilistically (discrete variables)
- describes uncertainty

### - Random number generators and Hacker statistics
- hacker stats = uses simulated repeated measurements to compute probabilities (aka simulations)
#### np.random.random()
- draw a number b/n 0 and 1
- Bernoulli trial - 2 outcomes like coin flip
#### np.random.seed() used to create reproducible code
- random number seed - integer for random number generating algorithm


In [None]:
# Simulate 4 coin flips
import numpy as np
np.random.seed(42)
random_numbers = np.random.random(size=4)
# view random_numbers
random_numbers
# give boolean values for heads and tails
heads = random_numbers < 0.5
heads
np.sum(heads)

# initialize number of 4-heads trials
n_all_heads = 0
# 10000 simulations of 4 heads trials
for _ in range(10000):
    heads = np.random.random(size=4) < 0.5
    n_heads = np.sum(heads)
    if n_heads == 4:
        n_all_heads += 1

n_all_heads/10000
out: 0.0621

### - Hacker stats probabilities
- determine how to simulate data
- simulate many many times
- probability is approximately fraction of trials with the outcome of interest

In [None]:
# example: generate random numbers b/n 0 and 1

# Seed the random number generator
np.random.seed(42)

# Initialize random numbers: random_numbers
random_numbers = np.empty(100000)

# Generate random numbers by looping over range(100000)
# note: using arg: size in np.random.random() would be better than for loop
for i in range(100000):
    random_numbers[i] = np.random.random()

# Plot a histogram
_ = plt.hist(random_numbers)

# Show the plot
plt.show()

In [None]:
# example: np.random with Bernoulli trials (biased coin flips)

def perform_bernoulli_trials(n, p):
    """Perform n Bernoulli trials with success probability p
    and return number of successes."""
    # Initialize number of successes: n_success
    n_success = 0


    # Perform trials
    for i in range(n):
        # Choose random number between zero and one: random_number
        random_number = np.random.random()

        # If less than p, it's a success so add one to n_success
        if random_number < p:
            n_success += 1

    return n_success

In [None]:
# example 1: Calculate bank loan defaults using function above
# 100 Bernoulli trials (ie. 100 loans), p = 0.05 (chance of default), 1000 simulations
# Seed random number generator
np.random.seed(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
    n_defaults[i] = perform_bernoulli_trials(100,0.05)


# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed=True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

In [None]:
# example 2: Will the bank fail?
# If interest rates are such that the bank will lose money if 10 or more 
# of its loans are defaulted upon, what is the probability that the bank 
# will lose money?

# Compute ECDF: x, y
x, y = ecdf(n_defaults)

# Plot the ECDF with labeled axes
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

# Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
n_lose_money = np.sum(n_defaults >= 10)

# Compute and print probability of losing money
print('Probability of losing money =', n_lose_money / len(n_defaults))
# 0.022, so 2% chance of getting 10 or more defaults out of 100 loans

### Probability distributions - Probability mass function (PMF)
= set of probabilities of discrete outcomes
- Probability distribution = math description of outcomes

In [None]:
# Sampling from the Binomial distribution
# 4 coin flips, p = 0.5
np.random.binomial(4, 0.5)
# out: 2, so 2 heads out of 4

np.random.binomial(4, 0.5, size=10)

In [None]:
# plot Binomial PMF
samples = np.random.binomial(60, 0.1, size=10000)

In [None]:
# plot Binomial CDF
import matplotlib.pyplot as pt
import seaborn as sns
sns.set()
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle = 'none')
plt.margins(0.02)
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()

In [None]:
# Sampling out of Binomial distribution and plot CDF

# Compute the probability mass function for the number of defaults 
# we would expect for 100 loans as in the last section, but instead 
# of simulating all of the Bernoulli trials, perform the sampling using 
# np.random.binomial(). 

# More efficient than custom-written perform_bernoulli_trials() function
# Take 10000 samples
np.random.seed(42)

# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(n=100, p=0.05, size=10000)

# Compute CDF: x, y
x, y = ecdf(n_defaults)

# Plot the CDF with axis labels
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of defaults')
_ = plt.ylabel('CDF')

# Show the plot
plt.show()

In [None]:
# plot the Binomial PMF as a histogram
# The trick is setting up the edges of the bins to pass to plt.hist() 
# via the bins keyword argument. 
# We want the bins centered on the integers. So, the edges of the bins 
# should be -0.5, 0.5, 1.5, 2.5, ... up to max(n_defaults) + 1.5. 
# Compute bin edges: bins
bins = np.arange(0, max(n_defaults) + 2) - 0.5

# Generate histogram
_ = plt.hist(n_defaults, normed=True, bins=bins)

# Label axes
_ = plt.xlabel('loan defaults')
_ = plt.ylabel('probabilities')

# Show the plot
plt.show()

### - Poisson processes and distribution - Rare events - large n, small p
- timing of next event is completely independent of what happened
- examples: natural births in a given hospital, hits on a website during a given hour
- Poisson distribution: number r of arrivals (ie. hits on a website) in a given time interval (ie. in one hour) with average rate of arrivals per interval (ie. 6 hits per hour)
- Note the Poisson Distribution is a limit of Binomial distribution for low probability of success and large number of trials. Ie. RARE EVENTS

In [None]:
# Poisson CDF
samples = np.random.poisson(6, size=10000)
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
# looks like a Binomial CDF

In [None]:
# example: relationship b/n Binomial and Poisson distributions
# You will compute the mean and standard deviation of samples from a
# Poisson distribution with an arrival rate of 10. Then, you will 
# compute the mean and standard deviation of samples from a Binomial 
# distribution with parameters n and p such that np=10.

# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(10, size=10000)

# Print the mean and standard deviation
print('Poisson:     ', np.mean(samples_poisson),
                       np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
# np = 10
n = [20, 100, 1000]
p = [0.5, 0.1, 0.01]

# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
    samples_binomial = np.random.binomial(n=n[i], p=p[i], size=10000)

    # Print results
    print('n =', n[i], 'Binom:', np.mean(samples_binomial),
                                 np.std(samples_binomial))

# output:
Poisson:      9.9549 3.140997610632647
n = 20 Binom: 10.0235 2.2304590895149814
n = 100 Binom: 9.9836 2.9554916748317868
n = 1000 Binom: 10.023 3.1199152232071947
    
# Means are all about the same. Std dev of binomial distribution gets closer
# Poisson distribution as probability p gets lower.

1990 and 2015 featured the most no-hitters of any season of baseball (there were seven). Given that there are on average 251/115 no-hitters per season, what is the probability of having seven or more in a season?

In [None]:
# Draw 10,000 samples out of Poisson distribution: n_nohitters
n_nohitters = np.random.poisson(251/115, size=10000)

# Compute number of samples that are seven or greater: n_large
n_large = np.sum(n_nohitters >= 7)

# Compute probability of getting seven or more: p_large = n_large / 10000
p_large = n_large / 10000

# Print the result
print('Probability of seven or more no-hitters:', p_large)
# output: 
Probability of seven or more no-hitters: 0.0063

## Thinking probabilistically (continuous variables)

### Probability density function (PDF)
- continuous analog to PMF
- math description of relative likelihood of observing a value of a continuous variable
- *** Note: probability is AUC (area under curve), not the value of the PDF

### Normal (distribution) CDF
- gives the probability under a specified x-axis value


example:
    Using the CDF, what's the probability that x > 10?
    - If value of CDF at x = 10 is 0.75, then probability that x < 10 is 0.75.
    - So probability that x > 10 = 1- 0.75 = 0.25

In [None]:
# example: check normality of data
import numpy as np
mean = np.mean(michelson_speed_of_light)
std = np.std(michelson_speed_of_light)
samples = np.random.normal(mean, std, size=10000)
x, y = ecdf(michelson_speed_of_light)
x_theor, y_theor = ecdf(samples)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.',linestyle='none')
_ = plt.xlabel('speed of light (km/s)')
_ = plt.ylabel('CDF')
plt.show()

In [None]:
# Exercise: explore the Normal PDF sahpe and plot a PDF of a known 
# distribution using hacker statistics. Specifically, you will 
# plot a Normal PDF for various values of the variance.

# Draw 100,000 samples from a Normal distribution that has a mean of 20
# and a standard deviation of 1. Do the same for Normal distributions 
# with standard deviations of 3 and 10, each still with a mean of 20.

# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20,1,size=100000)
samples_std3 = np.random.normal(20,3,size=100000)
samples_std10 = np.random.normal(20,10,size=100000)

# Make histograms
#histtype='step' argument makes the plot look like the smooth 
#theoretical PDF. 
_ = plt.hist(samples_std1,bins=100, normed=True, histtype='step')
_ = plt.hist(samples_std3,bins=100, normed=True, histtype='step')
_ = plt.hist(samples_std10,bins=100, normed=True, histtype='step')

# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()

In [None]:
# Exercise: consider the Normal CDF shape

# Using the samples generated above (samples_std1, samples_std3, and 
# samples_std10), generate and plot the CDFs.

# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)

# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker='.',linestyle='none')
_ = plt.plot(x_std3, y_std3, marker='.',linestyle='none')
_ = plt.plot(x_std10, y_std10, marker='.',linestyle='none')

# Make a legend and show the plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.show()

### Normal distribution (Gaussian): properties and warnings
- many things you think are normally distributed, may not be
- normal distributions have light tails

In [None]:
# example: comparing Belmont Stakes with outliers removed to 
# Normal distribution

# Sample out of a Normal distribution and plot a CDF. 
# Overlay the ECDF from the winning Belmont times.
# Are these close to Normally distributed?

# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu, sigma, size=10000)

# Get the CDF of the samples and of the data
x, y = ecdf(belmont_no_outliers)
x_theor, y_theor = ecdf(samples)

# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()
# Yes, it looks normally distributed

In [None]:
# example: 
# What are the chances of a horse matching or beating Secretariat's record?
# Sssume Belmont winners' times are normally distributed, so outliers removed.

# Take a million samples out of the Normal distribution: samples
samples = np.random.normal(mu, sigma, size=1000000)

# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples <= 144) / len(samples)

# Print the result
print('Probability of besting Secretariat:', prob)
# Probability of besting Secretariat: 0.000675

### Exponential Distribution
- can describe waiting times between rare events (ie. Poisson)
- looks like: max at x=0 and decays to the right

In [None]:
# example: exponential distribution as an example of continuous distributions
mean = np.mean(inter_times)
samples = np.random.exponential(mean, size=10000)
x, y = ecdf(inter_times)
x_theor, y_theor = ecdf(samples)
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.',linestyle='none')
_ = plt.xlabel('time (days)')
_ = plt.ylabel('CDF')
plt.show()

Sometimes, the story describing our probability distribution does not have a named distribution to go along with it. In these cases, we can always simulate it.

In earlier exercises, we looked at the rare event of no-hitters in Major League Baseball. Hitting the cycle is another rare baseball event. When a batter hits the cycle, he gets all four kinds of hits, a single, double, triple, and home run, in a single game. Like no-hitters, this can be modeled as a Poisson process, so the time between hits of the cycle are also Exponentially distributed.

How long must we wait to see both a no-hitter and then a batter hit the cycle? The idea is that we have to wait some time for the no-hitter, and then after the no-hitter, we have to wait for hitting the cycle. Stated another way, what is the total waiting time for the arrival of two different Poisson processes? The total waiting time is the time waited for the no-hitter, plus the time waited for the hitting the cycle.

Now, you will write a function to sample out of the distribution described by this story.

In [None]:
# example: simulate a probability distribution like describing the waiting
# time for 2 different Poisson processes.

def successive_poisson(tau1, tau2, size=1):
    """Compute time for arrival of 2 successive Poisson processes."""
    # Draw samples out of first exponential distribution: t1
    t1 = np.random.exponential(tau1, size)

    # Draw samples out of second exponential distribution: t2
    t2 = np.random.exponential(tau2, size)

    return t1 + t2

# Distribution of no-hitters and hit cycles
# The mean waiting time for a no-hitter is 764 games, and the 
# mean waiting time for hitting the cycle is 715 games.

# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764,715,size=100000)

# Make the PDF histogram
_ = plt.hist(waiting_times, bins=100, normed=True, histtype='step')

# Label axes
_ = plt.xlabel('waiting time (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

In [None]:
# plot the CDF
x_theor, y_theor = ecdf(waiting_times)
_ = plt.plot(x_theor, y_theor)
_ = plt.xlabel('waiting time (games)')
_ = plt.ylabel('CDF')
plt.show()