I'm interested in finding out if there's a relationship between having programming background and having taken statistics. First, though, I'll need to read in my data.

In [1]:
# import our libraries
import scipy.stats # statistics
import pandas as pd # dataframe

# read in our data
surveyData = pd.read_csv("../input/anonymous-survey-responses.csv")

Now let's do a chi-square test! The chisquare function from scipy.stats will only do a one-way comparison, so let's start with that.

In [2]:
# first let's do a one-way chi-squared test for stats background
scipy.stats.chisquare(surveyData["Have you ever taken a course in statistics?"].value_counts())

Power_divergenceResult(statistic=108.50120096076861, pvalue=2.7495623442639547e-24)

Statistic here is the chi-square value (larger = more difference from a uniform distrobution) and pvalue is the p-value, which is very low here.

In [3]:
# first let's do a one-way chi-squared test for programming background
scipy.stats.chisquare(surveyData["Do you have any previous experience with programming?"].value_counts())

Power_divergenceResult(statistic=906.20016012810243, pvalue=7.5559148788603605e-195)

And, again, our p-value is very low. This means that we can be sure, for both these questions, that the people who answered them are not drawn from a pool of people who are uniformly likely to have chosen each answer.

Now let's do a two-way comparison. Is there a relationship between having programming background and having taken statistics?

In [4]:
# now let's do a two-way chi-square test. Is there a relationship between programming background 
# and stats background?

contingencyTable = pd.crosstab(surveyData["Do you have any previous experience with programming?"],
                              surveyData["Have you ever taken a course in statistics?"])

scipy.stats.chi2_contingency(contingencyTable)

(16.827631021435366,
 0.03195483698199162,
 8,
 array([[  94.48839071,  204.47878303,  162.03282626],
        [   0.40992794,    0.88710969,    0.70296237],
        [  43.45236189,   94.0336269 ,   74.51401121],
        [ 108.22097678,  234.19695757,  185.58206565],
        [   9.42834267,   20.40352282,   16.16813451]]))

Here, the first value (16.827) is the $\chi^2$ value, the second value (0.032) is the p-value and the third value (8) is the degrees of freedom. Since our p-value is under our alpha of 0.05, we can say that it seems unlikely that there *isn't* a connection between these two things, right?

BUT! Becuase we have preformed three tests of statistical significance we need to correct for the fact that the probability that we're going to get a significant effect just by chance increases with each test. (If you set your alpha to 0.05, you'll be wrong just by chance 1/20 times, so if you preform 20 test you're very likely to get the wrong answer on one of them & you need to correct for that.) We can do by dividing our alpha by x, where x is the number of tests we have preformed. So in this case, our p-value would have to be below a value of 0.016 to have an overall alphs of 0.05.

TL;DR because we did three tests, this final result is not significant at alpha = 0.05. 