Now that you have some code to create your own populations, sample them, and compare the samples to the populations, it's time to experiment. Using your own Jupyter notebook, or a copy of the notebook from the previous assignment, reproduce the pop1 and pop2 populations and samples, using numpy's binomial function. Specifically, create two binomially distributed populations with n equal to 10 and size equal to 10000. The p-value of pop1 should be 0.2 and the p-value of pop2 should be 0.5. Using a sample size of 100, calculate the means and standard deviations of your samples.

For each of the following tasks, first write what you expect will happen, then code the changes and observe what does happen. Discuss the results with your mentor.

- Increase the size of your samples from 100 to 1000, then calculate the means and standard deviations for your new samples and create histograms for each. Repeat this again, decreasing the size of your samples to 20. What values change, and what remain the same?

- Change the probability value (p in the NumPy documentation) for pop1 to 0.3, then take new samples and compute the t-statistic and p-value. Then change the probability value p for group 1 to 0.4, and do it again. What changes, and why?

- Change the distribution of your populations from binomial to a distribution of your choice. Do the sample mean values still accurately represent the population values?


In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

pop1 = np.random.binomial(10, 0.2, 10000)
pop2 = np.random.binomial(10,0.5, 10000) 

sample1 = np.random.choice(pop1, 100, replace=True)
sample2 = np.random.choice(pop2, 100, replace=True)

print("Sample Pop 1 mean: ", sample1.mean())
print("Sample Pop 2 mean: ", sample2.mean())
print("Sample Pop 1 standard deviation: ", sample1.std())
print("Sample Pop 2 standard deviation: ", sample2.std())

Sample Pop 1 mean:  2.0
Sample Pop 2 mean:  5.14
Sample Pop 1 standard deviation:  1.2083045973594573
Sample Pop 2 standard deviation:  1.549322432549145


Increase the size of your samples from 100 to 1000, then calculate the means and standard deviations for your new samples and create histograms for each. Repeat this again, decreasing the size of your samples to 20. What values change, and what remain the same?

**Prediction:** When increasing the sample size to 1000, I predict the values for Population 1 will remain close to the same. However, values for Population 2 may differ slightly because it has a higher p-value. When decreasing to 20, I predict the values for both population samples will differ more because there are not enough included. 

In [2]:
sample1 = np.random.choice(pop1, 1000, replace=True)
sample2 = np.random.choice(pop2, 1000, replace=True)

print("Sample Pop 1 mean: ", sample1.mean())
print("Sample Pop 2 mean: ", sample2.mean())
print("Sample Pop 1 standard deviation: ", sample1.std())
print("Sample Pop 2 standard deviation: ", sample2.std())

Sample Pop 1 mean:  1.987
Sample Pop 2 mean:  4.945
Sample Pop 1 standard deviation:  1.2848466834607153
Sample Pop 2 standard deviation:  1.5033213229379807


In [3]:
sample1 = np.random.choice(pop1, 20, replace=True)
sample2 = np.random.choice(pop2, 20, replace=True)

print("Sample Pop 1 mean: ", sample1.mean())
print("Sample Pop 2 mean: ", sample2.mean())
print("Sample Pop 1 standard deviation: ", sample1.std())
print("Sample Pop 2 standard deviation: ", sample2.std())

Sample Pop 1 mean:  1.95
Sample Pop 2 mean:  4.95
Sample Pop 1 standard deviation:  1.16081867662439
Sample Pop 2 standard deviation:  1.627114009527298


Change the probability value (p in the NumPy documentation) for pop1 to 0.3, then take new samples and compute the t-statistic and p-value. Then change the probability value p for group 1 to 0.4, and do it again. What changes, and why?

In [18]:
pop1 = np.random.binomial(10, 0.3, 10000)
sample1 = np.random.choice(pop1, 100, replace=True)
print("Sample mean: ", sample1.mean())
print("Sample standard deviation: ", sample1.std())

sample2 = np.random.choice(pop1, 100, replace=True)
print("Sample 2 mean: ", sample2.mean())
print("Sample 2 standard deviation: ", sample2.std())

diff=sample2.mean( )-sample1.mean()
size = np.array([len(sample1), len(sample2)])
sd = np.array([sample1.std(), sample2.std()])
diff_se = (sum(sd ** 2 / size)) ** 0.5  
print("t-value: ",diff/diff_se)

from scipy.stats import ttest_ind
print(ttest_ind(sample2, sample1, equal_var=False))

Sample mean:  3.09
Sample standard deviation:  1.3863260799682016
Sample 2 mean:  2.97
Sample 2 standard deviation:  1.431467778191322
t-value:  -0.6021868984708987
Ttest_indResult(statistic=-0.599168398768744, pvalue=0.5497462357534663)


In [19]:
pop1 = np.random.binomial(10, 0.4, 10000)
sample1 = np.random.choice(pop1, 100, replace=True)
print("Sample Pop 1 mean: ", sample1.mean())
print("Sample Pop 1 standard deviation: ", sample1.std())

sample2 = np.random.choice(pop1, 100, replace=True)
print("Sample 2 mean: ", sample2.mean())
print("Sample 2 standard deviation: ", sample2.std())

diff=sample2.mean( )-sample1.mean()
size = np.array([len(sample1), len(sample2)])
sd = np.array([sample1.std(), sample2.std()])
diff_se = (sum(sd ** 2 / size)) ** 0.5  
print("t-value: ",diff/diff_se)

from scipy.stats import ttest_ind
print(ttest_ind(sample2, sample1, equal_var=False))

Sample Pop 1 mean:  3.97
Sample Pop 1 standard deviation:  1.5326773959317077
Sample 2 mean:  4.05
Sample 2 standard deviation:  1.55161206491829
t-value:  0.3668104259367122
Ttest_indResult(statistic=0.364971765606757, pvalue=0.715522079823731)


The mean and standard deviation from the first group, with p-value 0.3, differ less from each other each time a random sample is drawn. According to Central Limit Theorem this means the changes in In comparison, the second group has a p-value of 0.4, and each time samples are drawn, the mean and standard deviations differ more.

Change the distribution of your populations from binomial to a distribution of your choice. Do the sample mean values still accurately represent the population values?