Load libraries

In [125]:
## For some reason, when loading mlogit, the notebook can't find package 'statmod' so I specify it's location
library(statmod, lib.loc='D:\\Applications\\Anaconda2\\pkgs\\r-statmod-1.4.30-r3.4.1_0\\lib\\R\\library\\')
require(mlogit)
require(ggplot2)
require(reshape2)
require(lme4)
require(compiler)
require(parallel)
require(car)
require(boot)
require(dplyr)
require(sjstats)
require(broom)

# Load data and set factors

In [126]:
mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
mydata$sid <- factor(mydata$sid)
mydata$sim_index <- factor(mydata$sim_index)
mydata$lab_experience <- factor(mydata$lab_experience)
mydata$similar_sim <- factor(mydata$similar_sim)
mydata$cvs_graph <- factor(mydata$cvs_graph)
mydata$cvs_table <- factor(mydata$cvs_table)
mydata$main <- factor(mydata$main)
mydata$pre <- factor(mydata$pre)

In [127]:
head(mydata)
colnames(mydata)

sid,sim,variable,pre,main,cvs_graph,cvs_table,qual_score,quant_score,activity_order,...,pre_with_ident,main_with_ident,CVS_context,use_table,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars
10127163,L,Concentration,0,2,1,1,1,1,LC,...,1,3,2,1,1,1,1,1,1,4
10127163,L,Width,0,2,1,1,1,1,LC,...,1,3,2,1,1,1,1,1,1,4
10127163,C,Area,2,2,1,1,1,1,LC,...,3,3,2,1,1,1,1,1,1,4
10127163,C,Separation,2,2,1,1,1,1,LC,...,3,3,2,1,1,1,1,1,1,4
10232160,L,Concentration,0,0,1,1,1,1,LC,...,1,1,2,1,1,1,1,1,1,4
10232160,L,Width,0,0,0,0,1,1,LC,...,1,1,0,1,1,1,1,1,1,4


# Stat model 1: Predicting main model scores

We try to predict the HIGHEST type of model in the main worksheet (0,1 or 2 for neither, qual or quant). In other words:
* score = 2 if they have a correct quantitative model
* score = 1 if they have a correct qual model but incorrect quant model OR if their quant model is incorrect but qualitatively correct
* score = 0 otherwise (i.e. all incorrect or only identified)

We have 8 independant variables, all are categorical except for prior_number_virtual_labs:
* sim_index (1 or 2, wither it was student's 1st or 2nd activity)
* variable (thus don't include sim)
* CVS_graph (0 or 1)
* CVS_table (0 or 1)
* pre (0,1,2) a categorical variable
* student attibutes (lab_experience, similar_sim, prior_number_virtual_labs)

We ignore attitude components.

First we transform the data in an extra wide format for the mlogit function.
Now every student has a row for each variable times type of model (0,1,2).
The "alt" is the model type (0,1,2) and "main" is True if that was the model type they got correct (and the others are always False for that variable).

In [128]:
wide_mydata <- mlogit.data(mydata, shape = 'wide', choice = "main", id.var = "sid")
head(wide_mydata, 5)

Unnamed: 0,sid,sim,variable,pre,main,cvs_graph,cvs_table,qual_score,quant_score,activity_order,...,CVS_context,use_table,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars,chid,alt
1.0,10127163,L,Concentration,0,False,1,1,1,1,LC,...,2,1,1,1,1,1,1,4,1,0
1.1,10127163,L,Concentration,0,False,1,1,1,1,LC,...,2,1,1,1,1,1,1,4,1,1
1.2,10127163,L,Concentration,0,True,1,1,1,1,LC,...,2,1,1,1,1,1,1,4,1,2
2.0,10127163,L,Width,0,False,1,1,1,1,LC,...,2,1,1,1,1,1,1,4,2,0
2.1,10127163,L,Width,0,False,1,1,1,1,LC,...,2,1,1,1,1,1,1,4,2,1


Then we run the mlogit model.

See the following: https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Specifically, mixed in this document DOESN't mean with repeated measures. The "1 | " in the formula below tells it that some of the variables are individual specific.
The examples using the "Train" dataset is what I followed. See pages 3-7 for how to structure data and 22,23 for example of running mlogit.

## The model for all variables

In [129]:
ml.mydata <- mlogit(main
    ~ 1 | cvs_table + cvs_graph + variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs, wide_mydata)
summary(ml.mydata)


Call:
mlogit(formula = main ~ 1 | cvs_table + cvs_graph + variable + 
    sim_index + pre + lab_experience + similar_sim + prior_number_virtual_labs, 
    data = wide_mydata, method = "nr", print.level = 0)

Frequencies of alternatives:
       0        1        2 
0.095238 0.486395 0.418367 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 8.22E-06 
successive function values within tolerance limits 

Coefficients :
                              Estimate Std. Error z-value Pr(>|z|)   
1:(intercept)                1.6611897  0.5679551  2.9249 0.003446 **
2:(intercept)                0.1104790  0.6206179  0.1780 0.858712   
1:cvs_table1                -0.1951430  0.4023796 -0.4850 0.627696   
2:cvs_table1                 0.4026985  0.4434384  0.9081 0.363811   
1:cvs_graph1                 0.0098364  0.4395978  0.0224 0.982148   
2:cvs_graph1                 1.2499559  0.4550555  2.7468 0.006018 **
1:variableConcentration     -0.1756190  0.4629949 -0.3793 0.704457   
2:variableConcentrat

In [130]:
# for(var in list('Width','Separation','Area','Concentration')){
#     print(var)
#     var_data = wide_mydata[wide_mydata$variable == var,]

#     ml.var_data <- mlogit(main
#         ~ 1 | cvs_table + cvs_graph + sim_index + pre
#         + lab_experience + similar_sim + prior_number_virtual_labs, var_data)
#     print(summary(ml.var_data))
#     }

# Stat model 2: Predicting transfer data

## Excluding student main worksheet score

In [131]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table + cvs_graph + variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table + cvs_graph + variable + sim_index +  
    pre + lab_experience + similar_sim + prior_number_virtual_labs +  
    (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   658.8    715.7   -316.4    632.8      575 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6646 -0.4526  0.2772  0.4765  1.4940 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.571    1.89    
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)  
(Intercept)                0.63441    0.69197   0.917   0.3592  
cvs_table1                -0.26984    0.42060  -0.642   0.5212  
cvs_graph1                -0.11581    0.43409  -0.267   0.7896  
variableConcentration     -0.09122 

As expected, CVS doesn't predict quant transfer scores, only variable does.

## Including student main worksheet score
as a categorical variable

In [132]:
mydata$main <- factor(mydata$main)

In [133]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table + cvs_graph + variable + sim_index + pre + main
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed1)


Correlation matrix not shown by default, as p = 14 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table + cvs_graph + variable + sim_index +  
    pre + main + lab_experience + similar_sim + prior_number_virtual_labs +  
    (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   633.8    699.4   -301.9    603.8      573 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7180 -0.3917  0.2569  0.4453  2.3367 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 4.008    2.002   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -0.9516     0.8517  -1.117  0.26389    
cvs_table1                 -0.3024     0.4463  -0.678  0.49803    
cvs_graph1                 -0.5763     0.4783  -1.205  0.22823    
variableConcentratio

# Stat model 3: Predicting the use of CVS

## For cvs_table

In [134]:
# mydata$variable <- relevel(mydata$variable, "Width")
mixed <- glmer(
    cvs_table
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_table ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   639.6    687.8   -308.8    617.6      577 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.2907 -0.4368  0.2094  0.3609  2.4240 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 6.361    2.522   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)   
(Intercept)               -2.38742    0.89177  -2.677  0.00742 **
variableConcentration      0.52474    0.36349   1.444  0.14885   
variableSeparation        -0.24145    0.33966  -0.711  0.47716   
variableWidth              0.27596    0.35960   0.767  0.44284

## For cvs_graph

In [135]:
mixed <- glmer(
    cvs_graph
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_graph ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   604.1    652.2   -291.1    582.1      577 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.18574 -0.32928 -0.09205  0.33075  2.80758 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 11.1     3.332   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -5.8095     1.3601  -4.271 1.94e-05 ***
variableConcentration       0.7505     0.4089   1.835  0.06644 .  
variableSeparation         -0.1380     0.3756  -0.367  0.71327    
variableWidth               0.2763     0.4022   

# ANALYSES THAT WE NO LONGER CARE ABOUT
but keep here just in case

## Predict main as a continous variable using an ANOVA

Some resources:
* On SS Types: https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
* on drop() function to do type 3: https://www.statmethods.net/stats/anova.html
* On repeated measures: http://psych.wisc.edu/moore/Rpdf/610-R8_OneWayWithin.pdf, https://datascienceplus.com/two-way-anova-with-repeated-measures/
* the car package: https://cran.r-project.org/web/packages/car/car.pdf

In [136]:
mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
mydata$sid <- factor(mydata$sid)
mydata$sim_index <- factor(mydata$sim_index)
mydata$lab_experience <- factor(mydata$lab_experience)
mydata$similar_sim <- factor(mydata$similar_sim)
mydata$cvs_graph <- factor(mydata$cvs_graph)
mydata$cvs_table <- factor(mydata$cvs_table)
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

In [137]:
lm1 = lm(main
        ~  cvs_table + cvs_graph + variable + sim_index + pre + sid
         + lab_experience + similar_sim + prior_number_virtual_labs,
         data=mydata)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table,0.002948199,1,0.012941292,0.9094809,2.988662e-05
cvs_graph,2.847763,1,12.500420043,0.0004507331,0.02805928
variable,2.053713,3,3.004964024,0.03020047,0.020395
sim_index,2.747183,1,12.05891698,0.0005672906,0.0270951
pre,0.5046346,1,2.215122837,0.1373921,0.005089719
sid,100.2275,146,3.013387963,8.909223e-19,0.5039834
lab_experience,,0,,,
similar_sim,0.001078127,1,0.004732502,0.945186,1.092945e-05
prior_number_virtual_labs,,0,,,
Residuals,98.64319,433,,,0.5


We see that, in order of significance: cvs_graph, sim_index, and variable matter.