# Multiple comparisons problem
The multiple comparisons problem arises when a researcher repeatedly checks different variables/samples against one another for significance. Just by random chance we expect to find an occasional result of statistical significance.

In this exercise you'll work with data from salaries for employees at the City of Austin, TX. You will compare their salaries against randomly generated data. You will see how often this random data is "significant" in explaining the salaries of employees. Clearly any such "significance" would be spurious, as random numbers aren't very helpful in explaining anything!

In [23]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [24]:
df = pd.read_csv('austin_salaries_asset.csv')
df.head()

Unnamed: 0,Title,Gender,Ethnicity,Annual Salary,Years of Employment
0,Administrative Specialist,F,White,51542.4,6
1,Administrative Specialist,M,Black or African American,48235.2,11
2,Administrative Specialist,F,Hispanic or Latino,51542.4,14
3,Administrative Specialist,F,Hispanic or Latino,48235.2,3
4,"MuniProg, Paraprofessional",F,White,50668.8,2


In [25]:
df_police_officer = df[df['Title']=='Police Officer']

In [26]:
n_rows = df_police_officer.shape[0]
n_significant = 0
p_values = []

for i in range(1000):
    random_nums = np.random.uniform(size=n_rows)
    r,p_value = stats.pearsonr(df_police_officer['Annual Salary'],random_nums)
    p_values.append(p_value)
    
    if p_value < 0.05:
        n_significant += 1

print(f'The number of test that reject the null hypothesis {n_significant}')

The number of test that reject the null hypothesis 55


## Bonferonni-Holm correction
You've seen that comparing many different datasets, even randomly generated ones, can result in "statistically significant relationships" that are anything but! One way around this is to apply a correction to the alpha of your confidence level. In this exercise you'll explore why you should apply this correction and how to do so.

In [27]:
#Apply the Bonferonni correction
alpha = 0.05/1000
p_values = np.array(p_values)

print('The number of test that reject the null hypothesis after applying the Bonferonni correction', sum(p_values<alpha))

The number of test that reject the null hypothesis after applying the Bonferonni correction 0


Beyond knowing about statistical tools, it's important to know exactly when to apply them. By understanding that the role of the Bonferonni correction is to minimize the chance of spurious correlations being deemed statistically significant, you know exactly when and where to use it!