Load libraries

In [2]:
## For some reason, when loading mlogit, the notebook can't find package 'statmod' so I specify it's location
library(statmod, lib.loc='D:\\Applications\\Anaconda2\\pkgs\\r-statmod-1.4.30-r3.4.1_0\\lib\\R\\library\\')
require(mlogit)
require(ggplot2)
require(reshape2)
require(lme4)
require(compiler)
require(parallel)
require(car)
require(boot)
require(dplyr)
require(sjstats)
require(broom)

Loading required package: mlogit
Loading required package: Formula
"package 'Formula' was built under R version 3.4.4"Loading required package: maxLik
"package 'maxLik' was built under R version 3.4.4"Loading required package: miscTools

Please cite the 'maxLik' package as:
Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.

If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
https://r-forge.r-project.org/projects/maxlik/
Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.4.4"Loading required package: reshape2
"package 'reshape2' was built under R version 3.4.4"Loading required package: lme4
"package 'lme4' was built under R version 3.4.4"Loading required package: Matrix
Loading required package: compiler
Loading required package: parallel
Loading requir

# Load data and set factors

In [3]:
all_mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
# we use the "factor()" option to make sure R treats them as categorical
all_mydata$sid <- factor(all_mydata$sid)
all_mydata$sim_index <- factor(all_mydata$sim_index)
all_mydata$lab_experience <- factor(all_mydata$lab_experience)
all_mydata$lab_experience_chem <- factor(all_mydata$lab_experience_chem)
all_mydata$lab_experience_phys <- factor(all_mydata$lab_experience_phys)
all_mydata$similar_sim <- factor(all_mydata$similar_sim)
all_mydata$cvs_graph <- factor(all_mydata$cvs_graph)
all_mydata$cvs_graph_inverse <- factor(all_mydata$cvs_graph_inverse)
# all_mydata$cvs_graph_NOT_inverse <- factor(all_mydata$cvs_graph_NOT_inverse)
all_mydata$cvs_table_only <- factor(all_mydata$cvs_table_only)
all_mydata$quant_score <- factor(all_mydata$quant_score)
# all_mydata$main <- factor(all_mydata$main)
# all_mydata$pre <- factor(all_mydata$pre)

Here is what our data looks like:

In [4]:
head(all_mydata)
# colnames(mydata)

sid,sim,variable,pre,main,cvs_graph,cvs_table,cvs_table_only,cvs_graph_inverse,cvs_graph_axes,...,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars,use_graph_beers,use_table_beers,use_table_capacitor,use_graph_capacitor
10127163,L,Concentration,0,2,1,1,0,0,1,...,1,1,1,1,1,4,1,1,1,1
10127163,L,Width,0,2,1,1,0,0,1,...,1,1,1,1,1,4,1,1,1,1
10127163,C,Area,2,2,1,1,0,0,1,...,1,1,1,1,1,4,1,1,1,1
10127163,C,Separation,2,2,1,1,0,1,2,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Concentration,0,0,1,1,0,1,2,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Width,0,0,0,0,0,0,0,...,1,1,1,1,1,4,1,1,1,1


We have the following factors that change per variable:
* main (0,1,2), treated as a continuous variable
* pre (0,1,2), treated as a continuous variable
* quant_score (0 or 1)
* CVS_graph (0 or 1)
* CVS_table (0 or 1)

We have the following independant factors:
* sim_index (1 or 2, wither it was student's 1st or 2nd activity)
* variable (thus don't include sim as a variable)
* student attibutes:
   * lab_experience (0 or 1 if students have prior undergraduate physics or chemistry lab experience)
   * similar_sim (0 or 1 if they have used a similar simulation)
   * prior_number_virtual_labs (levels from 0 to 3 depending on the number of virtual labs they have done in the past)

We ignore attitude components.

For main and pre score:
* score = 2 if they describe the correct relationship, ie. a correct quantitative model
* score = 1 if they describe the correct direction of the relationship, ie. they have a correct qual model but incorrect quant model OR if their quant model is incorrect but qualitatively correct
* score = 0 otherwise (i.e. all incorrect or only identified)

# Use this code to remove perfect pre per variable instance

In [5]:
# mydata <- (all_mydata %>% filter(pre < 2))
mydata <- all_mydata

In [6]:
# print(dim(mydata));print(dim(all_mydata));
# print(dim(unique(mydata['sid'])));print(dim(unique(all_mydata['sid'])));

We could remove 39 instances of perfect pre. If we do, all 147 students remain in the study (i.e. no student got a prefect pre on all variables).

# Stat model 1: Prediction main model score as a continuous variable

Some resources:
* On SS Types: https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
* on drop() function to do type 3: https://www.statmethods.net/stats/anova.html
* On repeated measures: http://psych.wisc.edu/moore/Rpdf/610-R8_OneWayWithin.pdf, https://datascienceplus.com/two-way-anova-with-repeated-measures/
* the car package: https://cran.r-project.org/web/packages/car/car.pdf

## Complete model with interactions

Our model (without student factors) is:

    main  ~  cvs_table_only*variable + cvs_graph*variable
             + cvs_table_only*pre + cvs_graph*pre
             + sim_index + sid
             
We run a type III Anova:

In [7]:
lm1 = lm(main
        ~  cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=3)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

"package 'bindrcpp' was built under R version 3.4.4"

term,sumsq,df,statistic,p.value,eta
(Intercept),7.805831,1,34.747383469,7.642655e-09,0.07541526
cvs_table_only,0.02646537,1,0.117809663,0.7315917,0.000276472
variable,0.2962562,3,0.439591352,0.7247968,0.00308616
cvs_graph,2.251265,1,10.021427952,0.001658389,0.0229838
pre,0.0009773337,1,0.004350567,0.9474415,1.021249e-05
sim_index,2.740467,1,12.19909339,0.0005281336,0.02783916
sid,98.8187,146,3.012931588,1.176573e-18,0.5080195
cvs_table_only:variable,1.212752,3,1.799508506,0.1465681,0.01251401
variable:cvs_graph,2.400688,3,3.562193174,0.01431263,0.02447197
cvs_table_only:pre,0.001981195,1,0.008819222,0.9252242,2.070197e-05


None of the interactions are significant so let's move to a simpler model.

## Simple model without interaction

Our model (without student factors) is:

    main  ~  cvs_table_only + cvs_graph + variable
             + pre + sim_index + sid
             
We run a type II Anova:

In [8]:
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + variable + pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.002875905,1,0.01265297,0.9104905,2.915345e-05
cvs_graph,3.533661,1,15.54686102,9.38115e-05,0.0345834
variable,2.34179,3,3.43434961,0.01698431,0.02318924
pre,0.5063759,1,2.22787537,0.1362667,0.005107137
sim_index,2.769224,1,12.18360845,0.000531456,0.02730627
sid,100.3316,146,3.02345002,6.702067999999999e-19,0.5042401
Residuals,98.64427,434,,,0.5


We see that, in order of significance and eta^2: cvs_graph, sim_index, and variable matter.

## Model by variable

### For Width without interaction

In [9]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,2.0083545,1,5.6774156,0.01850865,0.038444711
cvs_graph,10.775617,1,30.4615826,1.569396e-07,0.176628222
pre,0.2706068,1,0.7649782,0.383252,0.005358304
sim_index,1.5010825,1,4.243409,0.04122893,0.02901607
Residuals,50.2317177,142,,,0.5


### For Area without interaction

In [10]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.30640133,1,0.99824432,0.3194352,0.0069808152
cvs_graph,6.448698455,1,21.00962353,9.927121e-06,0.1288857865
pre,2.643065816,1,8.61101169,0.003897894,0.057173852
sim_index,0.009813715,1,0.03197273,0.8583424,0.0002251094
Residuals,43.58551115,142,,,0.5


### For Concentration without interaction

In [11]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.2787357,1,0.8475172,0.358817,0.00593302
cvs_graph,8.3179422,1,25.2913382,1.464157e-06,0.1511814
pre,1.7380768,1,5.2847552,0.0229722,0.03588121
sim_index,2.2443126,1,6.8240039,0.009961487,0.04585284
Residuals,46.7016722,142,,,0.5


### For Separation without interaction

In [12]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.4788152,1,1.302363,0.255703,0.009088217
cvs_graph,0.7928125,1,2.156426,0.1441863,0.014958931
pre,2.5078567,1,6.821295,0.0099759,0.045835475
sim_index,0.3905307,1,1.062232,0.304459,0.007424964
Residuals,52.2064594,142,,,0.5


### For Width with interaction

In [13]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,1.90468176,1,5.3424028,0.02227267,0.036757358
pre,0.27060681,1,0.7590195,0.3851271,0.005392333
cvs_graph,11.01025113,1,30.8824273,1.339132e-07,0.180723248
sim_index,1.59026245,1,4.4604945,0.03646344,0.030876916
cvs_table_only:pre,0.30620735,1,0.8588747,0.3556487,0.006097413
pre:cvs_graph,0.08090914,1,0.2269404,0.6345445,0.001618379
Residuals,49.91301822,140,,,0.5


### For Area  with interaction

In [14]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.3179735,1,1.029414,0.3120465,0.007299289
pre,2.643066,1,8.556719,0.004017353,0.057599
cvs_graph,6.322457,1,20.46846,1.282505e-05,0.1275544
sim_index,0.01266573,1,0.0410043,0.8398236,0.0002928021
cvs_table_only:pre,4.89071e-07,1,1.583329e-06,0.9989978,1.130949e-08
pre:cvs_graph,0.2652803,1,0.8588243,0.3556628,0.006097057
Residuals,43.24429,140,,,0.5


### For Concentration with interaction

In [15]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.273831,1,0.8536326,0.3571164,0.006060423
pre,1.7380768,1,5.4182291,0.02136072,0.037259628
cvs_graph,8.5774119,1,26.7389698,7.891198e-07,0.16036425
sim_index,1.9978595,1,6.2280681,0.01373516,0.042591468
cvs_table_only:pre,1.7900473,1,5.5802404,0.01954085,0.038331029
pre:cvs_graph,0.4967508,1,1.5485563,0.215428,0.010940107
Residuals,44.9096459,140,,,0.5


### For Separation with interaction

In [16]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.46376144,1,1.26392161,0.262834021,0.0089472357
pre,2.50785668,1,6.83483785,0.009918134,0.0465477945
cvs_graph,0.79434797,1,2.1648923,0.143438726,0.0152280374
sim_index,0.2619667,1,0.71395625,0.399575237,0.0050738126
cvs_table_only:pre,0.8062434,1,2.19731175,0.140499124,0.0154525548
pre:cvs_graph,0.02104242,1,0.05734839,0.811087341,0.0004094636
Residuals,51.36916812,140,,,0.5


In [17]:
colMeans(test["main"])

## Model for Separation with inverse cvs_graph

In [18]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph_inverse + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.3542486,1,1.006645,0.317414573,0.007039149
cvs_graph_inverse,3.0280361,1,8.604573,0.003910897,0.057133542
pre,2.7717625,1,7.876337,0.005712494,0.052552236
sim_index,0.5366358,1,1.524923,0.218915467,0.010624795
Residuals,49.9712358,142,,,0.5


# Stat model 2: Predicting transfer data

## Excluding student main worksheet score

### Complete model with interactions

Our model is:

    quant_score  ~  cvs_table_only*variable + cvs_graph*variable
             + cvs_table_only*pre + cvs_graph*pre
             + sim_index + sid
             
We run a logistic regression:

In [19]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 16 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only * variable + cvs_graph * variable +  
    cvs_table_only * pre + cvs_graph * pre + sim_index + (1 |      sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   670.0    744.4   -318.0    636.0      571 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2052 -0.4659  0.2881  0.4786  1.2799 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.735    1.932   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)                            1.16394    0.46704   2.492   0.0127 *
cvs_table_only1                        0.28969    0.78196   0.370   0.7110  
variableConcentration                 -0.19343    0.55514  -0.348   0.727

**(non log) Odds ratio with confidence intervals**

In [20]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                        est  2.5 % 97.5 %
(Intercept)                           3.203 1.2822   8.00
cvs_table_only1                       1.336 0.2885   6.19
variableConcentration                 0.824 0.2776   2.45
variableSeparation                    0.346 0.1259   0.95
variableWidth                         1.597 0.5387   4.73
cvs_graph1                            0.673 0.2022   2.24
pre                                   2.657 1.1025   6.40
sim_index2                            1.300 0.8226   2.05
cvs_table_only1:variableConcentration 0.678 0.0865   5.32
cvs_table_only1:variableSeparation    0.971 0.1345   7.01
cvs_table_only1:variableWidth         0.474 0.0591   3.80
variableConcentration:cvs_graph1      1.873 0.4176   8.40
variableSeparation:cvs_graph1         1.721 0.4319   6.86
variableWidth:cvs_graph1              0.571 0.1309   2.49
cvs_table_only1:pre                   0.460 0.1186   1.79
cvs_graph1:pre                        0.723 0.2436   2.15


Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  cvs_table_only + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [21]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only + cvs_graph + variable + sim_index +  
    pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   658.8    698.2   -320.4    640.8      579 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6604 -0.4832  0.2932  0.4892  1.3620 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.461    1.86    
Number of obs: 588, groups:  sid, 147

Fixed effects:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            1.21488    0.36514   3.327 0.000877 ***
cvs_table_only1       -0.23617    0.41210  -0.573 0.566580    
cvs_graph1            -0.28717    0.34211  -0.839 0.401247    
variableConcentration  0.03205    0.32643   0.098 0.921794    
variableSeparation    -0.81255    0.3167

**(non log) Odds ratio with confidence intervals**

In [22]:
cc <- confint(mixed1,parm="beta_",method="Wald",level=0.95,value=(0.05))
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           3.370 1.647  6.893
cvs_table_only1       0.790 0.352  1.771
cvs_graph1            0.750 0.384  1.467
variableConcentration 1.033 0.545  1.958
variableSeparation    0.444 0.238  0.826
variableWidth         1.072 0.567  2.024
sim_index2            1.275 0.813  2.001
pre                   1.831 1.158  2.894


As expected, CVS doesn't predict quant transfer scores, only variable does.

## Including student main worksheet score
as a continuous variable

### Complete model with interactions

Our model is:

    quant_score  ~  main + cvs_table_only*variable + cvs_graph*variable
                    + cvs_table_only*pre + cvs_graph*pre
                    + sim_index + sid
             
We run a logistic regression:

In [23]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 17 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table_only * variable + cvs_graph *  
    variable + cvs_table_only * pre + cvs_graph * pre + sim_index +  
    (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   639.2    718.0   -301.6    603.2      570 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8868 -0.3810  0.2352  0.4439  2.6622 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 4.377    2.092   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -0.22567    0.55203  -0.409   0.6827    
main                                   1.46100    0.28806   5.072 3.94e-07 ***
cvs_table_only1                        0.21502    0.82363  

**(non log) Odds ratio with confidence intervals**

In [24]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                        est  2.5 % 97.5 %
(Intercept)                           0.798 0.2705  2.354
main                                  4.310 2.4508  7.581
cvs_table_only1                       1.240 0.2468  6.230
variableConcentration                 0.733 0.2350  2.287
variableSeparation                    0.285 0.0987  0.823
variableWidth                         1.684 0.5345  5.303
cvs_graph1                            0.342 0.0925  1.267
pre                                   2.581 1.0100  6.594
sim_index2                            1.098 0.6741  1.789
cvs_table_only1:variableConcentration 0.653 0.0747  5.706
cvs_table_only1:variableSeparation    1.569 0.1983 12.421
cvs_table_only1:variableWidth         0.352 0.0394  3.136
variableConcentration:cvs_graph1      2.056 0.4222 10.011
variableSeparation:cvs_graph1         2.971 0.6787 13.005
variableWidth:cvs_graph1              0.498 0.1054  2.357
cvs_table_only1:pre                   0.445 0.1004  1.971
cvs_graph1:pre

Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  main + cvs_table_only + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [25]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table_only + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table_only + cvs_graph + variable +  
    sim_index + pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   630.9    674.7   -305.5    610.9      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8731 -0.4163  0.2662  0.4557  2.0194 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.965    1.991   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -0.107638   0.452838  -0.238   0.8121    
main                   1.340418   0.271828   4.931 8.18e-07 ***
cvs_table_only1       -0.269439   0.439074  -0.614   0.5394    
cvs_graph1            -0.808983   0.376226  -2.150   0.0315 *  
variableConcentration -0.046

**(non log) Odds ratio with confidence intervals**

In [26]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           0.898 0.370  2.181
main                  3.821 2.243  6.509
cvs_table_only1       0.764 0.323  1.806
cvs_graph1            0.445 0.213  0.931
variableConcentration 0.954 0.488  1.865
variableSeparation    0.489 0.256  0.937
variableWidth         1.009 0.518  1.967
sim_index2            1.097 0.681  1.765
pre                   1.668 1.020  2.728


## Discussion on all 4 models (w. w/o interaction and w. w/o main)
What we notice:
* cvs_graph never matters
* main matters
* pre doesn't matter
* variable matters
* sim_index doesn't matter...

## redo model split by activity order

In [27]:
mydata_first <- (mydata %>% filter(sim_index==1))
mydata_second <- (mydata %>% filter(sim_index ==2))

### First

In [28]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + (1 | sid),
           data = mydata_first, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 15 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only * variable + cvs_graph * variable +  
    cvs_table_only * pre + cvs_graph * pre + (1 | sid)
   Data: mydata_first
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   344.4    403.3   -156.2    312.4      278 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8335 -0.2994  0.1422  0.2748  1.3792 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 15.75    3.968   
Number of obs: 294, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)                             3.0281     1.1959   2.532   0.0113 *
cvs_table_only1                         1.2203     2.0009   0.610   0.5419  
variableConcentration                  -0.3191     1.3937  -0.229   0.8189  
variabl

### second

In [29]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + (1 | sid),
           data = mydata_second, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 15 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only * variable + cvs_graph * variable +  
    cvs_table_only * pre + cvs_graph * pre + (1 | sid)
   Data: mydata_second
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   339.2    398.1   -153.6    307.2      278 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.2223 -0.3546  0.2339  0.3828  1.4242 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 4.384    2.094   
Number of obs: 294, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)                            0.63183    0.72463   0.872    0.383  
cvs_table_only1                       -1.56832    1.42510  -1.100    0.271  
variableConcentration                  0.56315    0.98408   0.572    0.567  
variab

# Stat model 3: Predicting the use of CVS

## Predicting the use of CVS_graph

Our model is:

    cvs_graph  ~ variable + pre + sim_index + sid
                 + lab_experience + similar_sim + prior_number_virtual_labs
             
We run a logistic regression:

In [31]:
mixed <- glmer(
    cvs_graph
    ~ variable + sim_index + pre
#     + lab_experience_chem + lab_experience_phys 
    +lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_graph ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   603.7    647.5   -291.9    583.7      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.0406 -0.3283 -0.1003  0.3351  2.8259 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 10.79    3.284   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -5.6625     1.3361  -4.238 2.25e-05 ***
variableConcentration       0.7366     0.4057   1.816  0.06943 .  
variableSeparation         -0.1014     0.3731  -0.272  0.78572    
variableWidth               0.2744     0.3993   0.687  0.4

**(non log) Odds ratio with confidence intervals**

In [32]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                               est    2.5 %   97.5 %
(Intercept)                0.00347 0.000253   0.0476
variableConcentration      2.08883 0.943120   4.6263
variableSeparation         0.90354 0.434863   1.8773
variableWidth              1.31570 0.601511   2.8779
sim_index2                 3.64962 2.062665   6.4575
pre                        1.64787 0.980161   2.7704
lab_experience1           52.79573 3.391274 821.9297
similar_sim1               0.86094 0.364845   2.0316
prior_number_virtual_labs  1.38074 0.645833   2.9519


## Predicting use of table_only

In [33]:
mixed <- glmer(
    cvs_table_only
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: cvs_table_only ~ variable + sim_index + pre + lab_experience +  
    similar_sim + prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   462.5    506.3   -221.3    442.5      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5048 -0.2833 -0.2009 -0.1461  2.9175 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.204    1.79    
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)  
(Intercept)               -1.79523    0.74638  -2.405   0.0162 *
variableConcentration     -0.12782    0.41083  -0.311   0.7557  
variableSeparation        -0.27673    0.40099  -0.690   0.4901  
variableWidth              0.04624    0.40370   0.115   0.9088

In [34]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                            est  2.5 % 97.5 %
(Intercept)               0.166 0.0385  0.717
variableConcentration     0.880 0.3934  1.969
variableSeparation        0.758 0.3455  1.664
variableWidth             1.047 0.4747  2.311
sim_index2                0.530 0.3043  0.924
pre                       1.305 0.7719  2.206
lab_experience1           1.527 0.3262  7.146
similar_sim1              1.314 0.5383  3.210
prior_number_virtual_labs 0.567 0.3298  0.975


## Predicting use of graph with inverse axis for Separation only

In [42]:
mydata_separation <- (all_mydata %>% filter(variable == 'Separation'))

In [43]:
mixed <- glm(
    cvs_graph_inverse
    ~ sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs,
           data = mydata_separation, family = binomial)

summary(mixed)


Call:
glm(formula = cvs_graph_inverse ~ sim_index + pre + lab_experience + 
    similar_sim + prior_number_virtual_labs, family = binomial, 
    data = mydata_separation)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1576  -0.9107  -0.7372   1.2840   1.9212  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)  
(Intercept)                -17.88426 1008.19682  -0.018    0.986  
sim_index2                   0.17107    0.38515   0.444    0.657  
pre                         -0.07983    0.31624  -0.252    0.801  
lab_experience1             16.21065 1008.19682   0.016    0.987  
similar_sim1                 0.37141    0.55565   0.668    0.504  
prior_number_virtual_labs    0.41843    0.23744   1.762    0.078 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 177.69  on 146  degrees of freedom
Residual deviance: 162.27  on 141  degrees of free

In [44]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

Waiting for profiling to be done...


ERROR: Error in if (!nonA[i]) next: argument is of length zero


## Predicting use of table + graph

In [45]:
mixed <- glmer(
    cvs_table
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_table ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   638.0    681.8   -309.0    618.0      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4461 -0.4324  0.2107  0.3639  2.4347 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 6.385    2.527   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)   
(Intercept)               -2.42649    0.89129  -2.722  0.00648 **
variableConcentration      0.53062    0.36336   1.460  0.14420   
variableSeparation        -0.25288    0.33892  -0.746  0.45560   
variableWidth              0.27573    0.35989   0.766  0.44358

In [46]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                              est  2.5 %  97.5 %
(Intercept)                0.0883 0.0154   0.507
variableConcentration      1.7000 0.8340   3.465
variableSeparation         0.7766 0.3997   1.509
variableWidth              1.3175 0.6507   2.667
sim_index2                 1.7883 1.0978   2.913
pre                        2.1300 1.2831   3.536
lab_experience1           19.8732 2.9063 135.895
similar_sim1               1.0519 0.4900   2.258
prior_number_virtual_labs  0.9267 0.5080   1.691









____________________________________________________________________________





# OTHER VERSION OF ANALYSES - keep for historical purposes
Even though we decided not to include them or do analyses this way, we keep the code to run them here just in case.

First we reload the data, in case some factors have changed from continuous to categorical variables

In [None]:
# mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# # sid is the student number
# mydata$sid <- factor(mydata$sid)
# mydata$sim_index <- factor(mydata$sim_index)
# mydata$lab_experience <- factor(mydata$lab_experience)
# mydata$similar_sim <- factor(mydata$similar_sim)
# mydata$cvs_graph <- factor(mydata$cvs_graph)
# mydata$cvs_table_only <- factor(mydata$cvs_table_only)
# # mydata$main <- factor(mydata$main)
# # mydata$pre <- factor(mydata$pre)

## Stat model 1: Predicting main model scores as a categorical variable

First we transform the data in an extra wide format for the mlogit function.
Now every student has a row for each variable times type of model (0,1,2).
The "alt" is the model type (0,1,2) and "main" is True if that was the model type they got correct (and the others are always False for that variable).

In [38]:
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

In [39]:
# wide_mydata <- mlogit.data(mydata, shape = 'wide', choice = "main", id.var = "sid")
# head(wide_mydata, 5)

Then we run the mlogit model.

See the following: https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Specifically, mixed in this document DOESN't mean with repeated measures. The "1 | " in the formula below tells it that some of the variables are individual specific.
The examples using the "Train" dataset is what I followed. See pages 3-7 for how to structure data and 22,23 for example of running mlogit.

In [40]:
# ml.mydata <- mlogit(main
#     ~ 1 | cvs_table_only + cvs_graph + variable + sim_index + pre
#     + lab_experience + similar_sim + prior_number_virtual_labs, wide_mydata)
# summary(ml.mydata)