2018-10-10 In-class notes - Data in Social Context

**What does statistical thinking look like and how do I interpret statistical output in Python?**

*The goal of todays class is to better understand the difference between simplistic data-driven thinking and statistical thinking. In a nutshell, the type of statistics I mean here are meant to help answer the questions "Is it real?" or "Does it matter?"*

*Suppose the top two students in a class have a 98.1% semester grade and a 98.2% semester grade respectively. Which one is the better student? Simplistic data-driven thinking says student 1 is better because they have a higher grade. But any number of factors could have produced the difference, from the timing of assignments for other classes to the relative weighting of skills associated with each student's strengths. Few people would argue that Student 1 is conclusively a stronger student.*

*Statistics provide a mathematically-informed way to make such decisions. In today's lesson, we'll test the likely range of population means of scores, had other similar students had the same assignment, and whether students who like N Sync are actually more successful in their assignments than those that prefer Backstreet Boys.*

In [2]:
#import pandas
import pandas as pd

In [3]:
#read data in from github (no download)
indata = pd.read_excel('https://github.com/ndporter/pythonDiSC/raw/master/Foundations_grades_F18_anon.xlsx')

In [4]:
#look at what you've got
indata.head()

Unnamed: 0,classNum,ID,nsLong,bsbLong,compLong,songList,noXmas,for,if,elif,...,time,bonus,total,minSpent,bestBand,hot,qual,prod,members,other
0,class1,1,2.0,5.0,3.0,2.0,0.0,1.0,1.0,1.0,...,5.0,0.0,35.0,300.0,,,,,,
1,class1,2,5.0,5.0,5.0,5.0,3.0,3.0,5.0,5.0,...,5.0,5.0,101.0,60.0,B,1.0,0.0,0.0,0.0,0.0
2,class1,3,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,101.0,60.0,N,0.0,1.0,0.0,0.0,0.0
3,class1,4,5.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,...,5.0,5.0,103.0,30.0,B,0.0,0.0,1.0,0.0,0.0
4,class1,5,4.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,...,5.0,5.0,102.0,120.0,B,0.0,1.0,0.0,0.0,0.0


In [5]:
#remove missing data - but not all, only cases that don't have an overall score
valid = indata.dropna(subset=['total'])

**A good first question: what's the average score in the class?**

To find the mean, describe the relevant column (after dealing with missing data). The other lines mean the following:
- *count* number of cases/rows included in analysis
- *mean* mean value (sum of values divided by count)
- *std* standard deviation (a measure of how widely dispersed values are; a low std means most people had similar scores; 21 is quite high for assignment grades
- *min* lowest value in data - check for unusual values (like negative scores)
- *25/50/75%* quartiles - 25% of students have scores below 77.75, 50% below 95.5, and 75% below 101
- *max* highest value in data - check for unusual values (like scores more than the total possible = 105)

In [6]:
valid['total'].describe()

count     44.000000
mean      85.159091
std       21.341022
min       35.000000
25%       77.750000
50%       95.500000
75%      101.000000
max      105.000000
Name: total, dtype: float64

**Ok, now we know the average score we saw, but what might the average score be for the same assignment in another similar class?**

*For this question, we need to construct what's called a confidence interval, or CI. CIs give a range of values that samples (groups of actual cases) taken from the same population (potential cases) might have for something, in this case the mean overall score. The population here might be VT students who registered for any section of Data in Social Context, and the sample is students in this section who completed the assignment.*

*Most scientists use a standard of 95% CI's, that is a range that we're 95% certain the population mean falls into based on the data. This means we accept a 5% probability that the mean is outside our calculated confidence interval.

In [16]:
#We'll need some statistical tools here
#Notice that some modules have multiple layers
#If we just imported statsmodels, we'd have to type statsmodels.stats.weightstats.COMMAND every time
import statsmodels.stats.weightstats as stats

In [9]:
#The zconfint command returns the lower and higher bounds of the CI (and defaults to 95% CI).
stats.zconfint(valid['total'])

(78.853337763333158, 91.464844054848655)

To interpret the values above in words:

*we are 95% certain that the true or population mean score, were every DiSC student to have completed the assignment, would fall between 78.85 and 91.46.*

**Hey, I wonder if people who think N Sync are better than Backstreet Boys are smarter (and do better work)? Or vice versa?**

*For this question, we need not only to find the means, but to statistically compare them to see if any difference is likely to be due to chance. There will almost always be a difference between the means, even if it's very small, like a fraction of a point. So what's important is to know whether the difference is meaningful.*

*In tests like this, most scientists use a p<0.05 test statistic, meaning we are willing to accept a 5% or 1 in 20 probability of finding a statistical difference when there isn't a true difference in the population. This prevents having very many false positives and is generally a conservative way of reporting findings.*

In [17]:
#First we need to make subsets of the groups we're comparing (people who thing N Sync are better or Backstreet Boys are better)
#The first expression can be read "valid where the value of the bestBand variable in valid is the string 'N'."
nsLove=valid[valid['bestBand']=='N']
#We do the same thing for Backstreet Boys. We're ignoring the "Other" and missing answers for now.
bsbLove=valid[valid['bestBand']=='B']

In [11]:
#Scroll over to 'bestBand' and check that it worked
nsLove.head()

Unnamed: 0,classNum,ID,nsLong,bsbLong,compLong,songList,noXmas,for,if,elif,...,time,bonus,total,minSpent,bestBand,hot,qual,prod,members,other
2,class1,3,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,101.0,60.0,N,0.0,1.0,0.0,0.0,0.0
5,class1,6,5.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,...,5.0,5.0,99.0,60.0,N,0.0,0.0,1.0,0.0,0.0
10,class2,11,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,103.0,300.0,N,0.0,1.0,0.0,0.0,0.0
12,class2,13,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,98.0,40.0,N,0.0,1.0,0.0,1.0,0.0
19,class2,20,4.0,5.0,5.0,5.0,3.0,5.0,3.0,3.0,...,5.0,5.0,92.0,210.0,N,0.0,0.0,0.0,1.0,0.0


In [12]:
#Let's see the means
nsLove['total'].describe()

count     13.000000
mean      92.923077
std       15.332218
min       48.000000
25%       91.000000
50%       99.000000
75%      101.000000
max      105.000000
Name: total, dtype: float64

In [13]:
bsbLove['total'].describe()

count     25.000000
mean      86.600000
std       19.050372
min       44.000000
25%       78.000000
50%       94.000000
75%      102.000000
max      105.000000
Name: total, dtype: float64

**92.9 certainly looks larger than 86.6, but is it meaningful or the result of random differences in samples?**

*Let's use a T-test (again a procedure in statsmodels) to find out.*

In [14]:
#For reference, this is how you would calculate the difference automatically
nsLove['total'].mean()-bsbLove['total'].mean()


6.3230769230769255

In [15]:
#Test the difference of means using a T-test
import statsmodels.stats.weightstats as stats
stats.ttest_ind(nsLove['total'],bsbLove['total'])

(1.0332313152236081, 0.30838715138763295, 36.0)

To interpret the above:
- The first value is the *T-Statistic*, which is used to conduct the test. Larger values are associated with more probability of a meaningful difference
- The **second value, the p-value of the test,** is the key here. If p is less than 0.05, we say there is a statistically significant difference between the groups; e.g. they're not random samples drawn from the same population. Remember there is still a 1 in 20 probability that you're wrong, but if the test value p is decreased, we then increase the probability of not finding a difference even when there is a true difference.
- The third value is the degrees of freedom for the test, which is based on the number of cases. More cases provide more degrees of freedom, which allows for more statistical confidence even with the same differences in means. This is why you can't just ask the three people next to you which band is better and assume that more people in the US (or even the class) prefer whichever two of the three of them say.

**That's it for today. On Friday, we'll talk about other statistical tests to work with more variables or different kinds of variables. Thanks!**