# Multivariate Hypergeometric Sample Tool

Code in file: src/multivariate_hypergeom_sample.py 

This distribution allows for sampling from a group of elements of various types (more than 2) without replacement. For example, we might want to sample from a bag containing 10 red, 20 white, and 5 blue balls. If we take a random sample of 5 balls, how many will be red, white, and blue, respectively? Simulating exercises such as this can be accomplished using the following Python class. 

&nbsp;
&nbsp;
&nbsp;

*Run the following lines once to run examples below:*

In [None]:
cd ..

In [None]:
from src.multivariate_hypergeom_sample import multivhyper

### multivhyper

Python class to create distribution objects to obtain samples. 

Methods:
- `sample(int size)`: Get a sample of given size from the distribution.
- `cdf()`: Calculate current CDF of distribution as list of floats.
- `print_status()`: Print current status of distribution. Current size, element type counts, and current element proportions. *For formatting purposes, element proportions are rounded to 3 decimal places when printed.*


*Note: Any arrays associated with the element types of the distribution will keep the same order as the list/dict passed into the constructor.*


#### Creating an object

To create the distribution object, either pass a list of integer counts of each element or a dictionary with element type keys and integer values into the constructor.

This code creates a distribution object for the ball example using a list and a dictionary. We call print_status() to demonstrate the creation of each object.

In [None]:
ball_dist1 = multivhyper([10, 20, 5])
ball_dist1.print_status()

ball_dist2 = multivhyper({'Red': 10, 'White': 20, 'Blue': 5})
ball_dist2.print_status()

#### Obtaining a Sample

To sample from a distribution, simple call the `sample()` method and pass in the desired sample size. The method returns an array of counts of each element type in the sample. The optional print_sample parameter allows you to print the sample (along with element types if applicable).

*Sample size must integer be no greater than the size (number of elements) left in the distribution.*



To folloiwng pseudocode demonstrates how the sample is obtained:
``` 
sample(size):
    sample size = 0
    
    while sample size < size:
        rand = generate random number between 0 and 1
        cdf = get cdf of of dist
        
        for each type of element in cdf:
            if rand < current cdf value and count of element type > 0:
                add 1 element of current type to sample
                remove element from dist
                increment sample size
                decrement dist size
    
    return sample

```

The first example takes a sample of size 10 from the first ball distribution and the second takes a sample of 5 from the second with `print_sample=True`.

In [None]:
ball_dist1.sample(10)

In [None]:
ball_dist2.sample(5, True)

After taking the sample, obtain the status of the dsitribution again to observe how sampling wihtout replacement has affected the distribution.

In [None]:
ball_dist1.print_status()
ball_dist2.print_status()

#### Calculating CDF

The `cdf()` method is primarily used to obtain the sample; however, we can simply obtain this list without needing to sample if necessary. To do so, simple call the `cdf()` method on a distribuion object to obtain a list of floats representing the CDF value at each element type. 

For the ball example:

In [None]:
ball_dist1.cdf()

In [None]:
ball_dist2.cdf()

## Use in Bayesian-RLA

The multivhyper class was created to simulate audits on elections with invlaid votes or more than two candidates. Although Python has useful libraries such as NumPy and SciPy, neither currently allow you to sample from a multivaraite hypergeomteric distribution, as needed to simulate these audits on various types of elections. Given code to perform an audit, it is possible to simulate an audit on any type of election using multivhyper. 

For example, when testing audits on electoins with invalid votes, I created a tied electoin with 20% of the votes beign invalid. To run the audit, I repeatedly sample from the distribution (which represents the abllots in an election) as the audit either continues or progresses. 

