# WISER SIMS Data Extraction Paired T-Test Analysis

This takes standard SIMS Data Extraction files and runs a paired T-Test on pre and post quizzes. 
- Author  : John Lutz <lutzjw@upmc.edu>
- Created : 2020-02-18
- Edited : 2020-02-20

## Some Assumptions
- You are using an unaltered SIMS data extraction file
- This is a Paired T-Test, so each student needs to have just 2 tests, one pre test and one post test. 
    - If a student has more than two tests (e.g. June 1 Pre-test, June 7 Pre-test, and June 14 Post-test), the program will take the **last** test of the type (In our example it would take the June 7 Pre-test and June 14 Post-test)
- Any 0 or -999 in an answer is taken as incorrect, including multiple choice, so an answer of "1,1,0,1" is considered incorrect
- You have an `Analysis/T-Analysis` directory in the folder where you're running this Jupyter notebook. This is where the analysis files will go.

## For John or Kim
- Here is an SQL statement that finds courses with Quiz Data: 

        select c.site_id, c.ABBRV, q.QUIZ_NAME, count(m.QUIZ_MAIN_ID)
          from courses c, CLASSES l, ID0_QUIZ_TYPE q, QUIZ_MAIN m, users u
          where c.COURSE_ID = l.course_id
            and c.SITE_ID = 1
            and c.COURSE_ID = q.COURSE_ID
            and q.QUIZ_TYPE_ID = m.QUIZ_TYPE_ID
            and l.CLASS_ID     = m.CLASS_ID
            and u.USER_ID      = m.USER_ID
            --and l.class_date > sysdate-100
            and l.CLASS_DATE between to_date('2018-01-01', 'YYYY-MM-DD')
                                 and to_date('2020-01-01', 'YYYY-MM-DD')
          group by c.site_id, c.ABBRV, q.QUIZ_NAME
          order by 2,3 DESC

## Instructions

- Change the variables in the section below. The ones you will need to always change are:
    - `file`
        - Get the Data Extraction file from SIMS
            - This is the Excel file straight from SIMS Data Extraction. It's downloaded as User.xlsx. Rename it to something that makes sense. 
            - You need to have selected "Correctness" for the "Quiz Reponses" when you generate the file
            - Drag it from your computer and drop it into the file browser in Jupyter (just to the left here)
            - Right click on the file and select "Copy Path" and paste it into the `file` variable below. Remember to enclose it in double quotes.
    - `preQuizFirst`, `preQuizLast`, `postQuizFirst`, and `postQuizLast`
        - These are the first and last Question IDs (e.g. Q23) for the quizzes you want to analyze
        - Make sure there are double quotes around the IDs: "Q23"
        - Don't forget the asterisks if appliccable for retired questions (e.g. "*Q24")
    - There are some other various varibles to change, but for the most part, you will not need to.


- Once you have made your changes, hit the **SHIFT + RETURN** keys together, or the "Play" button in the toolbar to run the analysis
- Scroll to the bottom to see the results
- The results file is in `Analysis/T-Analysis/(filename)-T-Test.txt`, where (filename) is the `file` filename. If the variable `createScoresFile` is set to True, there is also a `Analysis/(filename)-Scores.tsv` created, which is a tab separated value file with each student's scores in them. This is the data that the T-Test was run on.

In [1]:
#SIMS Data Extraction file (Copy Path)
file = "WISER/data/NUR1121/NUR-1121-AY-2019.xlsx"
#PRE
preQuizFirst = "*Q120"
preQuizLast  = "Q170"
#POST
postQuizFirst = "*Q69"
postQuizLast  = "Q119"

#Set this to True (Capitalized) if you want to print all of the data at the end to the screen
printGoodScores = False

# Set to True if you want remove the retired columns from the data (normally you set this to False)
removeRetired = False 

#Set to true to show the records where we made the students dumber
showDumber = True

#Set to true if you want all of the scores dumped to a tab separated file
createScoresFile = True
####################################################################
#  This is the end of the section where you can change variables   #
####################################################################

import pandas as pd
from scipy import stats
#from statistics import stdev
import math
import os

#Add my home diretory to the path and get the base file name
file = "~/" + file
base = os.path.basename(file)
fileName=os.path.splitext(base)[0]

#This is to check for multiple choice quizzes with 1,1,0,1 etc. in the cell.
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False
    
#set up the Panda DataFrame we will use for the analysis....
score_columns = ['pre_date', 'pre_score', 'post_date', 'post_score']
scores = pd.DataFrame(columns=score_columns, dtype=float)
#....and the lists that will hold the Pre qnd Post quiz columns
preQuizCols  = []
postQuizCols = []

# Open up the Excel File
xl = pd.ExcelFile(file)

#Get the list of Pre Quiz IDS
questionSheet = xl.parse('Question Dictionary')
inQRows = False
#initialize the scores dataframe
for index, row in questionSheet.iterrows() :
    if (removeRetired and row['Q#'].find("*") != -1):
        #ignore this
        print ("Ignoring " +row['Q#'])
    else:
        if (row['Q#'] == preQuizFirst):
            inQRows = True
        elif (row['Q#'] == preQuizLast):
            inQRows = False
            preQuizCols.append(row['Q#'])  #Get the last one
        if (inQRows):
            preQuizCols.append(row['Q#'])  #Get the last one        
        
#Get the list of Post Quiz IDS
inQRows = False
#initialize the scores dataframe
for index, row in questionSheet.iterrows() :
    if (removeRetired and row['Q#'].find("*") != -1):
        #ignore this
        print ("Ignoring " +row['Q#'])
    else: 
        if (row['Q#'] == postQuizFirst):
            inQRows = True
        elif (row['Q#'] == postQuizLast):
            inQRows = False
            postQuizCols.append(row['Q#'])  #Get the last one
        if (inQRows):
            postQuizCols.append(row['Q#'])  #Get the last one

dataSheet = xl.parse('User') # The User sheet holds the data by default.
for row in dataSheet.iterrows():
    # Get the User's Research ID
    userClass = row[1]['UserID'].split('_') #the data is USERID_CLASSID...
    userID    = userClass[0]                #...so we just need the first part.
    theDate   = row[1]['ClsDate']
    
    # Reset the to get the Pre Quiz data 
    preQuizTotal = 0
    dataCnt = 0
    for col in preQuizCols : #Go through all of the Pre Quiz cells for this row.
        cellValue = row[1][col]
        if (not(pd.isnull(cellValue))):     #there is something in there
            dataCnt += 1                    #Add to the score denominator
            if (is_number(cellValue)):      #if it is number
                cellValue = int(cellValue)  #convert a 1.0 to an int to get rid of the trailing zero
            strCellValue = str(cellValue)
            if (strCellValue.find("0")== -1 and strCellValue != "-999"): # Find any 0s or no data (-999) in the Cell value (multiple choice could be "1,1,0,1,1")
                preQuizTotal += 1

    if (dataCnt > 0):
        scores.loc[userID,'pre_date']  = theDate
        scores.loc[userID,'pre_score'] = preQuizTotal/dataCnt

    # Reset the to get the Post Quiz data  
    postQuizTotal = 0
    dataCnt = 0
    for col in postQuizCols : #Go through all of the Post Quiz cells for this row.
        cellValue = row[1][col]
        if (not(pd.isnull(cellValue))):     #there is something in there
            dataCnt += 1                    #Add to the score denominator
            if (is_number(cellValue)):      #if it is number
                cellValue = int(cellValue)  #convert a 1.0 to an int to get rid of the trailing zero
            strCellValue = str(cellValue)            
            if (strCellValue.find("0")== -1 and strCellValue != "-999"): # Find any 0s or no data (-999) in the Cell value (multiple choice could be "1,1,0,1,1")
                postQuizTotal += 1
    #if we have data add it to the scores dataFrame
    if (dataCnt > 0):
        scores.loc[userID,'post_date']  = theDate
        scores.loc[userID,'post_score'] = postQuizTotal/dataCnt

#Get rid of incomplete data for each user. 
#This will drop the users that don't have both Pre AND Post scores
goodScores = scores.dropna()

# Calculate the differences for all of the remaining users.
goodScores.eval('diff=@goodScores.post_score - @goodScores.pre_score', inplace=True)

# Run the Paired T-Test
tTest = stats.ttest_rel(goodScores['post_score'], goodScores['pre_score'])

# Print everything out to the file
with open('Analysis/T-Analysis/'+fileName+'-T-Test.txt', 'w') as f:
    print ("Source File : "+file, file=f)
    print ("Removed "+ str(len(scores)-len(goodScores)) +" students that had null data for the pre or post test.", file=f)
    print ("Number of Students : " +str(len(goodScores)), file=f)
    print ("Mean Pre    = {0:6.3f}".format(goodScores['pre_score'].mean()), file=f)
    print ("StdDev Pre  = {0:6.3f}".format(goodScores['pre_score'].std()), file=f)
    print ("Mean Post   = {0:6.3f}".format(goodScores['post_score'].mean()), file=f)
    print ("StdDev Post = {0:6.3f}".format(goodScores['post_score'].std()), file=f)
    print ("Mean Δ      = {0:6.3f}".format(goodScores['diff'].mean()), file=f)
    print ("T-Test      = {0:9.6f}".format(tTest.statistic), file=f)
    print ("P Value     = {0:9.6f}".format(tTest.pvalue), file=f)
    if (tTest.pvalue < 0.00001): #Print the raw data if we have a tiny P value
        print (tTest, file=f)

# Also print everything out to the screen
print ("Output File : Analysis/T-Analysis/"+fileName+"-T-Test.txt")
print ("Source File : " +file)
print ("Removed "+ str(len(scores)-len(goodScores)) +" students that had null data for the pre or post test.")
print ("Number of Students : " +str(len(goodScores)))
print ("Mean Pre    = {0:6.3f}".format(goodScores['pre_score'].mean()))
print ("StdDev Pre  = {0:6.3f}".format(goodScores['pre_score'].std()))
print ("Mean Post   = {0:6.3f}".format(goodScores['post_score'].mean()))
print ("StdDev Post = {0:6.3f}".format(goodScores['post_score'].std()))
print ("Mean Δ      = {0:6.3f}".format(goodScores['diff'].mean()))
print ("T-Test      = {0:9.6f}".format(tTest.statistic))
print ("P Value     = {0:9.6f}".format(tTest.pvalue))
if (tTest.pvalue < 0.00001): #Print the raw data if we have a tiny P value
    print (tTest)

if (printGoodScores):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
        print(goodScores)

if (showDumber) :
    cnt = 0
    print ("We made some students dumber:")
    for x in goodScores.iterrows() :
         if (x[1][4] < 0) :
            print ("Pre : %5.2f\tPost : %5.2f\t Diff : %5.2f" % (x[1][1], x[1][3], x[1][4]))
            cnt += 1
    print (str(cnt) +" total.")

if (createScoresFile):
    goodScores.to_csv('Analysis/T-Analysis/' +fileName+'-Scores.tsv', sep='\t')

Output File : Analysis/T-Analysis/NUR-1121-AY-2019-T-Test.txt
Source File : ~/WISER/data/NUR1121/NUR-1121-AY-2019.xlsx
Removed 45 students that had null data for the pre or post test.
Number of Students : 178
Mean Pre    =  0.709
StdDev Pre  =  0.162
Mean Post   =  0.895
StdDev Post =  0.102
Mean Δ      =  0.186
T-Test      = 17.609859
P Value     =  0.000000
Ttest_relResult(statistic=17.60985943336383, pvalue=9.223330817579075e-41)
We made some students dumber:
Pre :  0.95	Post :  0.90	 Diff : -0.05
Pre :  1.00	Post :  0.95	 Diff : -0.05
Pre :  0.50	Post :  0.45	 Diff : -0.05
Pre :  0.80	Post :  0.70	 Diff : -0.10
Pre :  0.85	Post :  0.80	 Diff : -0.05
Pre :  0.85	Post :  0.75	 Diff : -0.10
6 total.
