# WISER SIMS Data Extraction Question Analysis

This takes standard SIMS Data Extraction files and returns the # of total, correct, and incorrect count, precentage correct and first and last dates they were answered for each quiz question. 
- Author  : John Lutz <lutzjw@upmc.edu>
- Created : 2020-02-19
- Edited : 2020-02-20

## Some Assumptions
- You are using an unaltered SIMS data extraction file
- Any 0 or -999 in an answer is taken as incorrect, including multiple choice, so an answer of "1,1,0,1" is considered incorrect
- You have an `Analysis/Q-Analysis` directory in the folder your are running this Jupyter notebook. This is where the analysis files will go.

## Instructions

- Change the variables in the section below. The ones you will need to always change are:
    - `file`
        - Get the Data Extraction file from SIMS
            - This is the Excel file straight from SIMS Data Extraction
            - You need to have selected "Correctness" for the "Quiz Reponses" when you generate the file
            - Drag it from your computer and drop it into the file browser in Jupyter (just to the left here)
            - Right click on the file and select "Copy path" from the data file you want and paste it into the `file` variable below
    - `qFirst` and `qLast`
        - These are the first and last Question IDs (e.g. Q23) for the questions you want to analyze
        - Make sure there are doulble quotes around the IDs: "Q23"
        - Don't forget the asteriks if appliccable for retired questions (e.g. "*Q24")
    - `fileLabel`
        - This is put at the end of the tab separated analysis output file to help you identify the file. The file is put in the Analysis folder.
        - The output file is taken from the input file, so if you have `file="myData/folder/data-extraction.xls"` and `fileLabel="foo"`, your output file is `"Analysis/data-extraction-foo.tsv"`

- Once you have made your changes hit the **SHIFT-RETURN** keys together to run the analysis
- Scroll to the bottom to see the results

In [2]:
file = "WISER/data/NUR1121/NUR-1121-AY-2019.xlsx"
#Put the first and last Question
qFirst = "*Q120"  #Don't forget the '*'' if needed!
qLast  = "Q170"

#Put the label you want the CSV filename to be appended with
#It's good to surround it with dashes (e.g. -preQuiz-) so you can read it easily
fileLabel = "-PreQuiz-"

#Set this to True (Capitalized) if you want to print all of the data at the end
printDF = True

# Set to True if you want remove the retired columns from the data (normally you do not)
removeRetired = False 

####################################################################
#  This is the end of the section where you can change variables   #
####################################################################
%matplotlib inline
import pandas as pd
from scipy import stats
import math
import os
from dateutil import parser

#This is to check for multiple choice quizzes with 1,1,0,1 etc. in the cell.
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

#Add my home diretory to the path and get the base file name
file = "~/" + file
base = os.path.basename(file)
fileName=os.path.splitext(base)[0]

quiz_columns = ['text']
quizCols = pd.DataFrame(columns=quiz_columns)
#set up the Panda DataFrame we will use for the analysis
score_columns = ['q_text', 'responses', 'num_correct', 'num_incorrect', 'first_score', 'last_score']
scores = pd.DataFrame(columns=score_columns, dtype=int)

# Open up the Excel File
xl = pd.ExcelFile(file)
#Use execfile to include this file.

#Get the list of Question headers we want.
questionSheet = xl.parse('Question Dictionary')
inQRows = False
#initialize the scores dataframe
for index, row in questionSheet.iterrows() :
    if (row['Q#'] == qFirst):
        inQRows = True
    elif (row['Q#'] == qLast):
        inQRows = False
        scores.loc[row['Q#'], 'q_text'] = row['Text']
        scores.loc[row['Q#'], 'responses']     = 0
        scores.loc[row['Q#'], 'num_correct']   = 0
        scores.loc[row['Q#'], 'num_incorrect'] = 0
        scores.loc[row['Q#'], 'first_score']   = parser.parse('01/01/2100')
        scores.loc[row['Q#'], 'last_score']    = parser.parse('01/01/1900')

    if (inQRows):
        scores.loc[row['Q#'], 'q_text'] = row['Text']
        scores.loc[row['Q#'], 'responses']     = 0
        scores.loc[row['Q#'], 'num_correct']   = 0
        scores.loc[row['Q#'], 'num_incorrect'] = 0
        scores.loc[row['Q#'], 'first_score']   = parser.parse('01/01/2100')
        scores.loc[row['Q#'], 'last_score']    = parser.parse('01/01/1900')
        
dataSheet = xl.parse('User') # The User sheet holds the data by default.
for row in dataSheet.iterrows():
    for index, col in scores.iterrows() :
        cellValue = row[1][index]
        if (not(pd.isnull(cellValue))):     #there is something in there
            if (is_number(cellValue)):      #if it is number
                cellValue = int(cellValue)  #convert a 1.0 to an int to get rid of the trailing zero
            # If the date of this row (row[1][1]) is less than the first score, update
            if scores.loc[index, 'first_score'] > parser.parse(row[1][1]): scores.loc[index, 'first_score'] = parser.parse(row[1][1])
            if scores.loc[index, 'last_score']  < parser.parse(row[1][1]): scores.loc[index, 'last_score']  = parser.parse(row[1][1])
            #count all of the respones for this question
            scores.loc[index, 'responses']   += 1
            #figure out if it is a correct or incorrect answer
            strCellValue = str(cellValue)
            if (strCellValue.find("0")== -1 and strCellValue != "-999"): # Find any 0s or no data (-999) in the Cell value (multiple choice could be "1,1,0,1,1")
                scores.loc[index, 'num_correct'] += 1
            else: 
                scores.loc[index, 'num_incorrect'] += 1

#calculate the percentage correct for each question
scores.eval('perc_correct=@scores.num_correct / @scores.responses', inplace=True)       

print ('Question output File : Analysis/' +fileName+fileLabel+'Q-Analysis.tsv')

if (printDF):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
         print(scores.sort_values('perc_correct', ascending=True))

print ('Scores output file : Analysis/Q-Analysis/'+fileName+fileLabel+'Q-Analysis.tsv')
scores.to_csv('Analysis/Q-Analysis/' +fileName+fileLabel+'Q-Analysis.tsv', sep='\t')

Question output File : Analysis/NUR-1121-AY-2019-PreQuiz-Q-Analysis.tsv
                                                  q_text  responses  \
*Q131  What condition (ABG) will result in the patien...       76.0   
Q167   If the patient&rsquo;s heart rate increases to...      178.0   
Q139   If you patient&rsquo;s heart rate fails to res...      178.0   
Q130   According to AHA CPR guidelines what is the ra...      102.0   
Q138   Amiodarone is the first line drug for which ca...      102.0   
Q160   An appropriate treatment for the rhythm in #13...      178.0   
Q157   What is the first line treatment for the rhyth...      178.0   
Q165                          Identify the rhythm below.      178.0   
Q170   An appropriate treatment for the rhythm in #19 is      178.0   
*Q137  Amiodarone is the first line drug for ventricu...       76.0   
Q151                       Transcutaneous pacing is used      178.0   
Q164   An appropriate treatment for the rhythm in #15...      178.0   
Q127 