Load libraries

In [1]:
## For some reason, when loading mlogit, the notebook can't find package 'statmod' so I specify it's location
library(statmod, lib.loc='D:\\Applications\\Anaconda2\\pkgs\\r-statmod-1.4.30-r3.4.1_0\\lib\\R\\library\\')
require(mlogit)
require(ggplot2)
require(reshape2)
require(lme4)
require(compiler)
require(parallel)
require(car)
require(boot)
require(dplyr)
require(sjstats)
require(broom)

Loading required package: mlogit
Loading required package: Formula
"package 'Formula' was built under R version 3.4.4"Loading required package: maxLik
"package 'maxLik' was built under R version 3.4.4"Loading required package: miscTools

Please cite the 'maxLik' package as:
Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.

If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
https://r-forge.r-project.org/projects/maxlik/
Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.4.4"Loading required package: reshape2
"package 'reshape2' was built under R version 3.4.4"Loading required package: lme4
"package 'lme4' was built under R version 3.4.4"Loading required package: Matrix
Loading required package: compiler
Loading required package: parallel
Loading requir

# Load data and set factors

In [2]:
all_mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
# we use the "factor()" option to make sure R treats them as categorical
all_mydata$sid <- factor(all_mydata$sid)
all_mydata$sim_index <- factor(all_mydata$sim_index)
all_mydata$lab_experience <- factor(all_mydata$lab_experience)
all_mydata$similar_sim <- factor(all_mydata$similar_sim)
all_mydata$cvs_graph <- factor(all_mydata$cvs_graph)
all_mydata$cvs_table_only <- factor(all_mydata$cvs_table_only)
all_mydata$quant_score <- factor(all_mydata$quant_score)
# all_mydata$main <- factor(all_mydata$main)
# all_mydata$pre <- factor(all_mydata$pre)

Here is what our data looks like:

In [3]:
head(all_mydata)
# colnames(mydata)

sid,sim,variable,pre,main,cvs_graph,cvs_table,cvs_table_only,qual_score,quant_score,...,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars,use_graph_beers,use_table_beers,use_table_capacitor,use_graph_capacitor
10127163,L,Concentration,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10127163,L,Width,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Concentration,0,0,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Width,0,0,0,0,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,C,Area,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,C,Separation,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1


We have the following factors that change per variable:
* main (0,1,2), treated as a continuous variable
* pre (0,1,2), treated as a continuous variable
* quant_score (0 or 1)
* CVS_graph (0 or 1)
* CVS_table (0 or 1)

We have the following independant factors:
* sim_index (1 or 2, wither it was student's 1st or 2nd activity)
* variable (thus don't include sim as a variable)
* student attibutes:
   * lab_experience (0 or 1 if students have prior undergraduate physics or chemistry lab experience)
   * similar_sim (0 or 1 if they have used a similar simulation)
   * prior_number_virtual_labs (levels from 0 to 3 depending on the number of virtual labs they have done in the past)

We ignore attitude components.

For main and pre score:
* score = 2 if they describe the correct relationship, ie. a correct quantitative model
* score = 1 if they describe the correct direction of the relationship, ie. they have a correct qual model but incorrect quant model OR if their quant model is incorrect but qualitatively correct
* score = 0 otherwise (i.e. all incorrect or only identified)

# FIRST we remove perfect pre per variable instance

In [4]:
mydata <- (all_mydata %>% filter(pre < 2))

"package 'bindrcpp' was built under R version 3.4.4"

In [5]:
print(dim(mydata));print(dim(all_mydata));
print(dim(unique(mydata['sid'])));print(dim(unique(all_mydata['sid'])));

[1] 549  32
[1] 549  32
[1] 147   1
[1] 147   1


We removed 39 instances of perfect pre. All 147 students remain in the study (i.e. no student got a prefect pre on all variables).

# Stat model 1: Prediction main model score as a continuous variable

Some resources:
* On SS Types: https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
* on drop() function to do type 3: https://www.statmethods.net/stats/anova.html
* On repeated measures: http://psych.wisc.edu/moore/Rpdf/610-R8_OneWayWithin.pdf, https://datascienceplus.com/two-way-anova-with-repeated-measures/
* the car package: https://cran.r-project.org/web/packages/car/car.pdf

## Complete model with interactions

Our model (without student factors) is:

    main  ~  cvs_table_only*variable + cvs_graph*variable
             + cvs_table_only*pre + cvs_graph*pre
             + sim_index + sid
             
We run a type III Anova:

In [6]:
lm1 = lm(main
        ~  cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=3)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
(Intercept),4.51614308,1,20.94491355,6.377037e-06,0.051342504
cvs_table_only,0.01658873,1,0.07693502,0.7816429,0.000198759
variable,0.44384032,3,0.68614575,0.560941,0.005290818
cvs_graph,2.42460628,1,11.24480955,0.0008772326,0.028235922
pre,0.61836397,1,2.86784091,0.0911707,0.007355931
sim_index,2.58685638,1,11.99729111,0.0005923491,0.030068603
sid,100.64902924,146,3.19718387,8.159092e-20,0.546726334
cvs_table_only:variable,0.19078764,3,0.29494421,0.829055,0.002281174
variable:cvs_graph,2.00869631,3,3.10530245,0.02653575,0.023506267
cvs_table_only:pre,0.13767781,1,0.6385205,0.4247372,0.001647206


None of the interactions are significant so let's move to a simpler model.

## Simple model without interaction

Our model (without student factors) is:

    main  ~  cvs_table_only + cvs_graph + variable
             + pre + sim_index + sid
             
We run a type II Anova:

In [7]:
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + variable + pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.04977712,1,0.2269466,0.6340606,0.0005742185
cvs_graph,4.34472215,1,19.8086981,1.114922e-05,0.0477538157
variable,1.74754763,3,2.6558387,0.04816351,0.0197721051
pre,0.10980713,1,0.5006388,0.4796368,0.0012658356
sim_index,2.92872682,1,13.3528137,0.0002928603,0.0326992083
sid,101.63002321,146,3.1736797,9.587964e-20,0.5398186406
Residuals,86.63695306,395,,,0.5


We see that, in order of significance and eta^2: cvs_graph, sim_index, and variable matter.

## Model by variable

### For Width without interaction

In [8]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,1.792362334,1,4.97924188,0.0273023,0.03557129
cvs_graph,10.940237359,1,30.392341444,1.731737e-07,0.1837591
pre,0.002772954,1,0.007703358,0.9301905,5.705865e-05
sim_index,1.914251343,1,5.317853583,0.02263155,0.03789862
Residuals,48.595533392,135,,,0.5


### For Area without interaction

In [9]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.13427005,1,0.421359,0.5173961,0.0032061687
cvs_graph,6.43611111,1,20.1974579,1.519191e-05,0.1335833167
pre,0.86726624,1,2.7216083,0.101394,0.0203527936
sim_index,0.03383323,1,0.1061736,0.7450631,0.0008098293
Residuals,41.74438974,131,,,0.5


### For Concentration without interaction

In [10]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.001731068,1,0.005176852,0.942752,3.982035e-05
cvs_graph,6.757247108,1,20.20791055,1.520275e-05,0.1345329
pre,0.50700202,1,1.516216782,0.2204158,0.01152874
sim_index,2.048674243,1,6.1266704,0.01460262,0.04500713
Residuals,43.470210433,130,,,0.5


### For Separation without interaction

In [11]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table_only + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.08258145,1,0.2260558,0.6352439,0.001696784
cvs_graph,0.90022731,1,2.4642535,0.1188391,0.018191172
pre,0.76145678,1,2.0843875,0.1511633,0.015430262
sim_index,0.35528444,1,0.9725443,0.3258371,0.00725928
Residuals,48.58681664,133,,,0.5


### For Width with interaction

In [12]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,1.784553299,1,4.891344806,0.02869904,0.03547246
pre,0.002772954,1,0.007600487,0.9306589,5.714325e-05
cvs_graph,10.96542797,1,30.055526608,2.036985e-07,0.1843269
sim_index,1.937525446,1,5.310631537,0.02274507,0.03839641
cvs_table_only:pre,0.070479509,1,0.193179763,0.6609965,0.001450373
pre:cvs_graph,0.029027102,1,0.079561403,0.7783311,0.0005978484
Residuals,48.523585665,133,,,0.5


### For Area  with interaction

In [13]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.0908844,1,0.29793427,0.586123028,0.0023042462
pre,0.86726624,1,2.84304387,0.094187396,0.0215638519
cvs_graph,6.20267166,1,20.33339573,1.4445e-05,0.1361610752
sim_index,0.01571997,1,0.05153269,0.82077678,0.0003993187
cvs_table_only:pre,0.80404013,1,2.63577811,0.106922041,0.0200232653
pre:cvs_graph,0.77389531,1,2.53695835,0.113656256,0.0192870383
Residuals,39.35125516,129,,,0.5


### For Concentration with interaction

In [14]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.001378535,1,0.004061585,0.9492842,3.173012e-05
pre,0.507002,1,1.493783,0.2238763,0.01153556
cvs_graph,6.766789,1,19.93703,1.737392e-05,0.134767
sim_index,1.990416,1,5.864376,0.0168519,0.04380834
cvs_table_only:pre,0.02372703,1,0.06990708,0.7918974,0.0005458509
pre:cvs_graph,8.594135e-05,1,0.0002532095,0.9873289,1.978195e-06
Residuals,43.44423,128,,,0.5


### For Separation with interaction

In [15]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table_only*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table_only,0.078825521,1,0.21259614,0.6455047,0.0016202419
pre,0.761456782,1,2.05368479,0.1542205,0.0154350087
cvs_graph,0.898950741,1,2.42451247,0.121863,0.0181714171
sim_index,0.327733253,1,0.88391201,0.3488628,0.0067021974
cvs_table_only:pre,0.005427385,1,0.01463791,0.9038865,0.0001117273
pre:cvs_graph,0.013800705,1,0.03722115,0.8473143,0.0002840502
Residuals,48.571640181,131,,,0.5


In [16]:
colMeans(test["main"])

# Stat model 2: Predicting transfer data

## Excluding student main worksheet score

### Complete model with interactions

Our model is:

    quant_score  ~  cvs_table_only*variable + cvs_graph*variable
             + cvs_table_only*pre + cvs_graph*pre
             + sim_index + sid
             
We run a logistic regression:

In [17]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 16 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only * variable + cvs_graph * variable +  
    cvs_table_only * pre + cvs_graph * pre + sim_index + (1 |      sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   644.3    717.6   -305.2    610.3      532 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7498 -0.5006  0.2947  0.4849  1.2801 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.612    1.9     
Number of obs: 549, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)                            1.15239    0.46688   2.468   0.0136 *
cvs_table_only1                        0.29186    0.79386   0.368   0.7131  
variableConcentration                 -0.15172    0.55501  -0.273   0.784

**(non log) Odds ratio with confidence intervals**

In [18]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                        est  2.5 % 97.5 %
(Intercept)                           3.166 1.2679   7.90
cvs_table_only1                       1.339 0.2825   6.35
variableConcentration                 0.859 0.2895   2.55
variableSeparation                    0.355 0.1298   0.97
variableWidth                         1.643 0.5554   4.86
cvs_graph1                            0.712 0.2112   2.40
pre                                   1.942 0.7222   5.22
sim_index2                            1.298 0.8108   2.08
cvs_table_only1:variableConcentration 0.533 0.0627   4.53
cvs_table_only1:variableSeparation    1.775 0.2276  13.85
cvs_table_only1:variableWidth         0.565 0.0688   4.64
variableConcentration:cvs_graph1      2.009 0.4354   9.27
variableSeparation:cvs_graph1         1.503 0.3697   6.11
variableWidth:cvs_graph1              0.515 0.1150   2.31
cvs_table_only1:pre                   0.225 0.0338   1.50
cvs_graph1:pre                        1.024 0.2652   3.95


Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  cvs_table_only + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [19]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table_only + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table_only + cvs_graph + variable + sim_index +  
    pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   635.8    674.6   -308.9    617.8      540 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4121 -0.5315  0.3070  0.4982  1.4331 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.167    1.78    
Number of obs: 549, groups:  sid, 147

Fixed effects:
                      Estimate Std. Error z value Pr(>|z|)   
(Intercept)            1.15680    0.36448   3.174   0.0015 **
cvs_table_only1       -0.23523    0.41552  -0.566   0.5713   
cvs_graph1            -0.17789    0.33941  -0.524   0.6002   
variableConcentration  0.08048    0.33327   0.241   0.8092   
variableSeparation    -0.79314    0.32207  -2

**(non log) Odds ratio with confidence intervals**

In [20]:
cc <- confint(mixed1,parm="beta_",method="Wald",level=0.95,value=(0.05))
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           3.180 1.556  6.496
cvs_table_only1       0.790 0.350  1.785
cvs_graph1            0.837 0.430  1.628
variableConcentration 1.084 0.564  2.083
variableSeparation    0.452 0.241  0.851
variableWidth         1.088 0.571  2.072
sim_index2            1.278 0.807  2.024
pre                   1.517 0.826  2.787


As expected, CVS doesn't predict quant transfer scores, only variable does.

## Including student main worksheet score
as a continuous variable

### Complete model with interactions

Our model is:

    quant_score  ~  main + cvs_table_only*variable + cvs_graph*variable
                    + cvs_table_only*pre + cvs_graph*pre
                    + sim_index + sid
             
We run a logistic regression:

In [21]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table_only*variable + cvs_graph*variable + cvs_table_only*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 17 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table_only * variable + cvs_graph *  
    variable + cvs_table_only * pre + cvs_graph * pre + sim_index +  
    (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   617.8    695.3   -290.9    581.8      531 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4260 -0.4262  0.2495  0.4576  2.5404 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 4.232    2.057   
Number of obs: 549, groups:  sid, 147

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            -0.1673     0.5519  -0.303    0.762    
main                                    1.3980     0.2939   4.757 1.97e-06 ***
cvs_table_only1                         0.2459     0.8320  

**(non log) Odds ratio with confidence intervals**

In [22]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                        est  2.5 % 97.5 %
(Intercept)                           0.846 0.2868  2.495
main                                  4.047 2.2750  7.200
cvs_table_only1                       1.279 0.2504  6.531
variableConcentration                 0.734 0.2353  2.290
variableSeparation                    0.295 0.1028  0.845
variableWidth                         1.706 0.5448  5.344
cvs_graph1                            0.366 0.0978  1.371
pre                                   2.053 0.7300  5.771
sim_index2                            1.110 0.6730  1.831
cvs_table_only1:variableConcentration 0.560 0.0599  5.234
cvs_table_only1:variableSeparation    2.249 0.2691 18.803
cvs_table_only1:variableWidth         0.429 0.0475  3.869
variableConcentration:cvs_graph1      2.326 0.4626 11.699
variableSeparation:cvs_graph1         2.495 0.5623 11.073
variableWidth:cvs_graph1              0.455 0.0939  2.202
cvs_table_only1:pre                   0.239 0.0336  1.704
cvs_graph1:pre

Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  main + cvs_table_only + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [23]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table_only + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table_only + cvs_graph + variable +  
    sim_index + pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   611.1    654.2   -295.5    591.1      539 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8527 -0.4704  0.2833  0.4741  1.9656 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.648    1.91    
Number of obs: 549, groups:  sid, 147

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -0.095498   0.452358  -0.211   0.8328    
main                   1.285815   0.275273   4.671    3e-06 ***
cvs_table_only1       -0.271983   0.439773  -0.618   0.5363    
cvs_graph1            -0.698200   0.375859  -1.858   0.0632 .  
variableConcentration -0.007

**(non log) Odds ratio with confidence intervals**

In [24]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           0.909 0.375  2.206
main                  3.618 2.109  6.205
cvs_table_only1       0.762 0.322  1.804
cvs_graph1            0.497 0.238  1.039
variableConcentration 0.992 0.501  1.965
variableSeparation    0.482 0.250  0.931
variableWidth         1.021 0.520  2.004
sim_index2            1.108 0.682  1.799
pre                   1.457 0.771  2.753


## Discussion on all 4 models (w. w/o interaction and w. w/o main)
What we notice:
* cvs_graph never matters
* main matters
* pre doesn't matter
* variable matters
* sim_index doesn't matter...

# Stat model 3: Predicting the use of CVS

## Predicting the use of CVS_graph

Our model is:

    cvs_graph  ~ variable + pre + sim_index + sid
                 + lab_experience + similar_sim + prior_number_virtual_labs
             
We run a logistic regression:

In [25]:
mixed <- glmer(
    cvs_graph
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_graph ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   561.0    604.1   -270.5    541.0      539 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.47112 -0.30915 -0.07747  0.32556  2.83633 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 11.45    3.383   
Number of obs: 549, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -6.2905     1.5039  -4.183 2.88e-05 ***
variableConcentration       0.8132     0.4288   1.897 0.057890 .  
variableSeparation         -0.1421     0.3948  -0.360 0.719011    
variableWidth               0.1669     0.4183   

**(non log) Odds ratio with confidence intervals**

In [26]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                               est    2.5 %   97.5 %
(Intercept)                0.00185 9.73e-05 3.53e-02
variableConcentration      2.25501 9.73e-01 5.23e+00
variableSeparation         0.86758 4.00e-01 1.88e+00
variableWidth              1.18167 5.20e-01 2.68e+00
sim_index2                 2.93582 1.62e+00 5.32e+00
pre                        2.53046 1.16e+00 5.52e+00
lab_experience1           91.27049 4.38e+00 1.90e+03
similar_sim1               0.88842 3.63e-01 2.17e+00
prior_number_virtual_labs  1.48409 6.73e-01 3.27e+00


## Predicting use of table_only

In [27]:
mixed <- glmer(
    cvs_table_only
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: cvs_table_only ~ variable + sim_index + pre + lab_experience +  
    similar_sim + prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   427.4    470.5   -203.7    407.4      539 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5401 -0.2913 -0.2034 -0.1509  2.9021 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 2.946    1.716   
Number of obs: 549, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)  
(Intercept)                -1.6050     0.7344  -2.185   0.0289 *
variableConcentration      -0.2349     0.4357  -0.539   0.5898  
variableSeparation         -0.1197     0.4152  -0.288   0.7730  
variableWidth               0.1100     0.4162   0.264   0.7915

In [28]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                            est  2.5 % 97.5 %
(Intercept)               0.201 0.0476  0.847
variableConcentration     0.791 0.3366  1.857
variableSeparation        0.887 0.3932  2.002
variableWidth             1.116 0.4937  2.524
sim_index2                0.540 0.3022  0.966
pre                       0.805 0.3765  1.722
lab_experience1           1.424 0.3159  6.421
similar_sim1              1.257 0.5040  3.137
prior_number_virtual_labs 0.562 0.3284  0.962


## Predicting use of table + graph

In [29]:
mixed <- glmer(
    cvs_table
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_table ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   610.1    653.2   -295.1    590.1      539 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3867 -0.4387  0.2258  0.3497  2.3340 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 6.418    2.533   
Number of obs: 549, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)   
(Intercept)               -2.47889    0.91531  -2.708  0.00676 **
variableConcentration      0.50799    0.37462   1.356  0.17509   
variableSeparation        -0.16126    0.34963  -0.461  0.64462   
variableWidth              0.19536    0.36655   0.533  0.59405

In [30]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                              est  2.5 %  97.5 %
(Intercept)                0.0838 0.0139   0.504
variableConcentration      1.6620 0.7975   3.463
variableSeparation         0.8511 0.4289   1.689
variableWidth              1.2158 0.5927   2.494
sim_index2                 1.5617 0.9445   2.582
pre                        1.8937 0.9596   3.737
lab_experience1           21.0913 2.9661 149.979
similar_sim1               1.1356 0.5193   2.483
prior_number_virtual_labs  0.9589 0.5223   1.760









____________________________________________________________________________





# OTHER VERSION OF ANALYSES - keep for historical purposes
Even though we decided not to include them or do analyses this way, we keep the code to run them here just in case.

First we reload the data, in case some factors have changed from continuous to categorical variables

In [31]:
# mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# # sid is the student number
# mydata$sid <- factor(mydata$sid)
# mydata$sim_index <- factor(mydata$sim_index)
# mydata$lab_experience <- factor(mydata$lab_experience)
# mydata$similar_sim <- factor(mydata$similar_sim)
# mydata$cvs_graph <- factor(mydata$cvs_graph)
# mydata$cvs_table_only <- factor(mydata$cvs_table_only)
# # mydata$main <- factor(mydata$main)
# # mydata$pre <- factor(mydata$pre)

## Stat model 1: Predicting main model scores as a categorical variable

First we transform the data in an extra wide format for the mlogit function.
Now every student has a row for each variable times type of model (0,1,2).
The "alt" is the model type (0,1,2) and "main" is True if that was the model type they got correct (and the others are always False for that variable).

In [32]:
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

In [33]:
# wide_mydata <- mlogit.data(mydata, shape = 'wide', choice = "main", id.var = "sid")
# head(wide_mydata, 5)

Then we run the mlogit model.

See the following: https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Specifically, mixed in this document DOESN't mean with repeated measures. The "1 | " in the formula below tells it that some of the variables are individual specific.
The examples using the "Train" dataset is what I followed. See pages 3-7 for how to structure data and 22,23 for example of running mlogit.

In [34]:
# ml.mydata <- mlogit(main
#     ~ 1 | cvs_table_only + cvs_graph + variable + sim_index + pre
#     + lab_experience + similar_sim + prior_number_virtual_labs, wide_mydata)
# summary(ml.mydata)