### Calculate the probability of a model ensemble that uses simple majority voting making an incorrect prediction in the following scenarios. (Hint: Understanding how to use the binomial distribution will be useful in answering this question.)

In [219]:
import numpy as np
from scipy.stats import binom
from scipy.special import comb as choose # for nCr combination calculations

A binomial distribution is known for finding the probability of a situation that has only two outcomes and is usually repeated multiple times.

In the case of this problem, the two possible outcomes are a correct or an incorrect prediction from the model ensemble, which is why it is a great use case for this type of distribution!

Binomial distributions must also meet the following criteria:
1. The number of observations or trials is fixed. In this case, we know the number of independent models in our model ensembles and they are not changing throughout the problem, so this criteria is met!

2. Each observation or trial is independent. As mentioned in the problem statements, the model ensembles contain independent models; therefore, this criteria is good!

3. The probability of success is exactly the same from one trial to another. For our problems, the error rate was given which we can take as our (1-p) for the calculations. Also, this rate is not changing, so this criteria is also met.

Therefore, since all three criteria are satisfied for the calculation of the probability that a model ensemble makes an incorrect prediction, we are good to use the binomial distribution to find this proability!

For the below function, it takes the total number of independent models and the "succes" probability (probability of success "p") as input parameters. Since the error rate is given to us for the model ensembles, we will take this as our probability of success since we are focusing the problem on incorrect predictions, and this probability represents the models making prediction errors. 

Then, in order to find the value of "k" for the binomial distribution, the function will divide the total number of independent models by two and round up to the nearest whole number in order to mimic majority voting of the models in the ensemble for predictions.

Once these parameters are found, the function will use the scipy function binom.pmf with the needed inputs to calculate the probability of the model ensemble making an incorrect prediction.

This probability will then be returned by the function as output.

In [220]:
def binom_incorrect_prob(num_models,success_prob):

    # calculate number of models to get over 50% of correct predictions by dividing num_models/2, then rounding up
    #calculation for finding value of "k"
    majority = np.ceil((num_models/2))
    print("Number of models needed to account for over 50% of predictions: ",majority)
    
    #calculation of prediction probability using binom.pmf function from scipy
    #k = majority
    #n = num_models
    #p = correct_probability
    total_prob_incorrect = binom.pmf(k=majority,n=num_models,p=success_prob)
    
    return total_prob_incorrect

### 1. The ensemble contains 11 independent models, all of which have an error rate of 0.2. 

In [221]:
error_prob1 = 0.2

In [222]:
num_models1 = 11

In [223]:
prob_incorrect1 = binom_incorrect_prob(num_models1,error_prob1)

Number of models needed to account for over 50% of predictions:  6.0


In [224]:
print("Probability of this model ensemble making an incorrect prediction: ", prob_incorrect1)

Probability of this model ensemble making an incorrect prediction:  0.009688842240000004


In [225]:
print("Percentage of the time that this model ensemble makes an incorrect prediction: ", prob_incorrect1*100, "%")

Percentage of the time that this model ensemble makes an incorrect prediction:  0.9688842240000004 %


This probability of the model ensemble making an incorrect prediction makes sense, especially given that the error rate is only 0.2! I would expect that most of the time the models predict correctly.

### 2. The ensemble contains 11 independent models, all of which have an error rate of 0.49.

In [226]:
error_prob2 = 0.49

In [227]:
num_models2 = 11

In [228]:
prob_incorrect2 = binom_incorrect_prob(num_models2,error_prob2)

Number of models needed to account for over 50% of predictions:  6.0


In [229]:
print("Probability of this model ensemble making an incorrect prediction: ", prob_incorrect2)

Probability of this model ensemble making an incorrect prediction:  0.22063242388979068


In [230]:
print("Percentage of the time that this model ensemble makes an incorrect prediction: ", prob_incorrect2*100, "%")

Percentage of the time that this model ensemble makes an incorrect prediction:  22.063242388979067 %


Holding the number of independent models constant at 11, but with an increased error rate of 0.49, it make sense that there would be an increase in incorrect probability given the higher probability "p" of making an incorrect prediction. 

### 3.The ensemble contains 21 independent models, all of which have an error rate of 0.49.

In [231]:
error_prob3 = 0.49

In [232]:
num_models3 = 21

In [233]:
prob_incorrect3 = binom_incorrect_prob(num_models3,error_prob3)

Number of models needed to account for over 50% of predictions:  11.0


In [234]:
print("Probability of this model ensemble making an incorrect prediction: ", prob_incorrect3)

Probability of this model ensemble making an incorrect prediction:  0.1641662213283881


In [235]:
print("Percentage of the time that this model ensemble makes an incorrect prediction: ", prob_incorrect3*100, "%")

Percentage of the time that this model ensemble makes an incorrect prediction:  16.41662213283881 %


Error rate stays the same but number of independent models increases .... more possibilities for predictions in the majority voting which leads to a lower probability of making an incorrect prediction, compared to the one above. 