Load libraries

In [91]:
## For some reason, when loading mlogit, the notebook can't find package 'statmod' so I specify it's location
library(statmod, lib.loc='D:\\Applications\\Anaconda2\\pkgs\\r-statmod-1.4.30-r3.4.1_0\\lib\\R\\library\\')
require(mlogit)
require(ggplot2)
require(reshape2)
require(lme4)
require(compiler)
require(parallel)
require(car)
require(boot)
require(dplyr)
require(sjstats)
require(broom)

# Load data and set factors

In [93]:
mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
mydata$sid <- factor(mydata$sid)
mydata$sim_index <- factor(mydata$sim_index)
mydata$lab_experience <- factor(mydata$lab_experience)
mydata$similar_sim <- factor(mydata$similar_sim)
mydata$cvs_graph <- factor(mydata$cvs_graph)
mydata$cvs_table <- factor(mydata$cvs_table)
mydata$quant_score <- factor(mydata$quant_score)
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

Here is what our data looks like:

In [94]:
head(mydata)
# colnames(mydata)

sid,sim,variable,pre,main,cvs_graph,cvs_table,cvs_table_only,qual_score,quant_score,...,pre_with_ident,main_with_ident,CVS_context,use_table,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars
10127163,L,Concentration,0,2,1,1,0,1,1,...,1,3,2,1,1,1,1,1,1,4
10127163,L,Width,0,2,1,1,0,1,1,...,1,3,2,1,1,1,1,1,1,4
10127163,C,Area,2,2,1,1,0,1,1,...,3,3,2,1,1,1,1,1,1,4
10127163,C,Separation,2,2,1,1,0,1,1,...,3,3,2,1,1,1,1,1,1,4
10232160,L,Concentration,0,0,1,1,0,1,1,...,1,1,2,1,1,1,1,1,1,4
10232160,L,Width,0,0,0,0,0,1,1,...,1,1,0,1,1,1,1,1,1,4


We have the following factors that change per variable:
* main (0,1,2), treated as a continuous variable
* pre (0,1,2), treated as a continuous variable
* quant_score (0 or 1)
* CVS_graph (0 or 1)
* CVS_table (0 or 1)

We have the following independant factors:
* sim_index (1 or 2, wither it was student's 1st or 2nd activity)
* variable (thus don't include sim as a variable)
* student attibutes:
   * lab_experience (0 or 1 if students have prior undergraduate physics or chemistry lab experience)
   * similar_sim (0 or 1 if they have used a similar simulation)
   * prior_number_virtual_labs (levels from 0 to 3 depending on the number of virtual labs they have done in the past)

We ignore attitude components.

For main and pre score:
* score = 2 if they describe the correct relationship, ie. a correct quantitative model
* score = 1 if they describe the correct direction of the relationship, ie. they have a correct qual model but incorrect quant model OR if their quant model is incorrect but qualitatively correct
* score = 0 otherwise (i.e. all incorrect or only identified)

# Stat model 1: Prediction main model score as a continuous variable

Some resources:
* On SS Types: https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
* on drop() function to do type 3: https://www.statmethods.net/stats/anova.html
* On repeated measures: http://psych.wisc.edu/moore/Rpdf/610-R8_OneWayWithin.pdf, https://datascienceplus.com/two-way-anova-with-repeated-measures/
* the car package: https://cran.r-project.org/web/packages/car/car.pdf

## Complete model with interactions

Our model (without student factors) is:

    main  ~  cvs_table*variable + cvs_graph*variable
             + cvs_table*pre + cvs_graph*pre
             + sim_index + sid
             
We run a type III Anova:

In [95]:
lm1 = lm(main
        ~  cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=3)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
(Intercept),7.805831,1,34.747383469,7.642655e-09,0.07541526
cvs_table,0.02646537,1,0.117809663,0.7315917,0.000276472
variable,0.2962562,3,0.439591352,0.7247968,0.00308616
cvs_graph,1.019934,1,4.540199846,0.03367984,0.01054536
pre,0.0009773337,1,0.004350567,0.9474415,1.021249e-05
sim_index,2.740467,1,12.19909339,0.0005281336,0.02783916
sid,98.8187,146,3.012931588,1.176573e-18,0.5080195
cvs_table:variable,1.212752,3,1.799508506,0.1465681,0.01251401
variable:cvs_graph,0.04663895,3,0.069203884,0.9763197,0.0004871139
cvs_table:pre,0.001981195,1,0.008819222,0.9252242,2.070197e-05


None of the interactions are significant so let's move to a simpler model.

## Simple model with interaction

Our model (without student factors) is:

    main  ~  cvs_table + cvs_graph + variable
             + pre + sim_index + sid
             
We run a type II Anova:

In [96]:
lm1 = lm(main
        ~  cvs_table + cvs_graph + variable + pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

term,sumsq,df,statistic,p.value,eta
cvs_table,0.002875905,1,0.01265297,0.9104905,2.915345e-05
cvs_graph,2.846821,1,12.52501086,0.0004449018,0.02804996
variable,2.34179,3,3.43434961,0.01698431,0.02318924
pre,0.5063759,1,2.22787537,0.1362667,0.005107137
sim_index,2.769224,1,12.18360845,0.000531456,0.02730627
sid,100.3316,146,3.02345002,6.702067999999999e-19,0.5042401
Residuals,98.64427,434,,,0.5


We see that, in order of significance and eta^2: cvs_graph, sim_index, and variable matter.

# Stat model 2: Predicting transfer data

## Excluding student main worksheet score

### Complete model with interactions

Our model is:

    quant_score  ~  cvs_table*variable + cvs_graph*variable
             + cvs_table*pre + cvs_graph*pre
             + sim_index + sid
             
We run a logistic regression:

In [97]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 16 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
quant_score ~ cvs_table * variable + cvs_graph * variable + cvs_table *  
    pre + cvs_graph * pre + sim_index + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   670.0    744.4   -318.0    636.0      571 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2052 -0.4659  0.2881  0.4786  1.2799 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.735    1.932   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                                 Estimate Std. Error z value Pr(>|z|)  
(Intercept)                       1.16394    0.46704   2.492   0.0127 *
cvs_table1                        0.28970    0.78194   0.370   0.7110  
variableConcentration            -0.19343    0.55513  -0.348   0.7275  
variableSeparation            

**(non log) Odds ratio with confidence intervals**

In [98]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                   est  2.5 % 97.5 %
(Intercept)                      3.203 1.2822   8.00
cvs_table1                       1.336 0.2886   6.19
variableConcentration            0.824 0.2776   2.45
variableSeparation               0.346 0.1259   0.95
variableWidth                    1.597 0.5387   4.73
cvs_graph1                       0.503 0.1060   2.39
pre                              2.657 1.1025   6.40
sim_index2                       1.300 0.8226   2.05
cvs_table1:variableConcentration 0.678 0.0865   5.32
cvs_table1:variableSeparation    0.971 0.1345   7.01
cvs_table1:variableWidth         0.474 0.0591   3.80
variableConcentration:cvs_graph1 2.760 0.3685  20.68
variableSeparation:cvs_graph1    1.772 0.2571  12.21
variableWidth:cvs_graph1         1.205 0.1635   8.88
cvs_table1:pre                   0.460 0.1186   1.79
cvs_graph1:pre                   1.572 0.4528   5.46


Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  cvs_table + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [99]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ cvs_table + cvs_graph + variable + sim_index +  
    pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   658.8    698.2   -320.4    640.8      579 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6604 -0.4832  0.2932  0.4892  1.3620 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.461    1.86    
Number of obs: 588, groups:  sid, 147

Fixed effects:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            1.21488    0.36514   3.327 0.000877 ***
cvs_table1            -0.23617    0.41210  -0.573 0.566580    
cvs_graph1            -0.05100    0.42562  -0.120 0.904629    
variableConcentration  0.03205    0.32643   0.098 0.921793    
variableSeparation    -0.81255    0.31677  -2

**(non log) Odds ratio with confidence intervals**

In [100]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           3.370 1.647  6.893
cvs_table1            0.790 0.352  1.771
cvs_graph1            0.950 0.413  2.188
variableConcentration 1.033 0.545  1.958
variableSeparation    0.444 0.238  0.826
variableWidth         1.072 0.567  2.024
sim_index2            1.275 0.813  2.001
pre                   1.831 1.158  2.894


As expected, CVS doesn't predict quant transfer scores, only variable does.

## Including student main worksheet score
as a continuous variable

### Complete model with interactions

Our model is:

    quant_score  ~  main + cvs_table*variable + cvs_graph*variable
                    + cvs_table*pre + cvs_graph*pre
                    + sim_index + sid
             
We run a logistic regression:

In [101]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)


Correlation matrix not shown by default, as p = 17 > 12.
Use print(obj, correlation=TRUE)  or
	 vcov(obj)	 if you need it



Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table * variable + cvs_graph * variable +  
    cvs_table * pre + cvs_graph * pre + sim_index + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   639.2    718.0   -301.6    603.2      570 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8868 -0.3810  0.2352  0.4439  2.6622 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 4.377    2.092   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -0.22567    0.55203  -0.409   0.6827    
main                              1.46100    0.28806   5.072 3.94e-07 ***
cvs_table1                        0.21502    0.82361   0.261   0.7940    
variableConcentr

**(non log) Odds ratio with confidence intervals**

In [102]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                                   est  2.5 % 97.5 %
(Intercept)                      0.798 0.2705  2.354
main                             4.310 2.4508  7.581
cvs_table1                       1.240 0.2468  6.229
variableConcentration            0.733 0.2350  2.287
variableSeparation               0.285 0.0987  0.823
variableWidth                    1.684 0.5346  5.303
cvs_graph1                       0.276 0.0516  1.478
pre                              2.581 1.0100  6.594
sim_index2                       1.098 0.6741  1.789
cvs_table1:variableConcentration 0.653 0.0747  5.706
cvs_table1:variableSeparation    1.569 0.1983 12.420
cvs_table1:variableWidth         0.352 0.0394  3.136
variableConcentration:cvs_graph1 3.150 0.3724 26.636
variableSeparation:cvs_graph1    1.893 0.2504 14.313
variableWidth:cvs_graph1         1.417 0.1742 11.524
cvs_table1:pre                   0.445 0.1004  1.971
cvs_graph1:pre                   1.467 0.3715  5.794


Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  main + cvs_table + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [103]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: quant_score ~ main + cvs_table + cvs_graph + variable + sim_index +  
    pre + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   630.9    674.7   -305.5    610.9      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8731 -0.4163  0.2662  0.4557  2.0194 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 3.965    1.991   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -0.107637   0.452839  -0.238   0.8121    
main                   1.340418   0.271828   4.931 8.18e-07 ***
cvs_table1            -0.269440   0.439074  -0.614   0.5394    
cvs_graph1            -0.539544   0.468049  -1.153   0.2490    
variableConcentration -0.046691  

**(non log) Odds ratio with confidence intervals**

In [104]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           0.898 0.370  2.181
main                  3.821 2.243  6.509
cvs_table1            0.764 0.323  1.806
cvs_graph1            0.583 0.233  1.459
variableConcentration 0.954 0.488  1.865
variableSeparation    0.489 0.256  0.937
variableWidth         1.009 0.518  1.967
sim_index2            1.097 0.681  1.765
pre                   1.668 1.020  2.728


## Discussion on all 4 models (w. w/o interaction and w. w/o main)
What we notice:
* cvs_graph never matters
* main matters
* pre matters
* variable matters
* sim_index doesn't matter...

# Stat model 3: Predicting the use of CVS graph

Our model is:

    cvs_graph  ~ variable + pre + sim_index + sid
                 + lab_experience + similar_sim + prior_number_virtual_labs
             
We run a logistic regression:

In [105]:
mixed <- glmer(
    cvs_graph
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cvs_graph ~ variable + sim_index + pre + lab_experience + similar_sim +  
    prior_number_virtual_labs + (1 | sid)
   Data: mydata
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   603.7    647.5   -291.9    583.7      578 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.0406 -0.3283 -0.1003  0.3351  2.8259 

Random effects:
 Groups Name        Variance Std.Dev.
 sid    (Intercept) 10.79    3.284   
Number of obs: 588, groups:  sid, 147

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -5.6625     1.3361  -4.238 2.25e-05 ***
variableConcentration       0.7366     0.4057   1.816  0.06943 .  
variableSeparation         -0.1014     0.3731  -0.272  0.78572    
variableWidth               0.2744     0.3993   0.687  0.4

**(non log) Odds ratio with confidence intervals**

In [106]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

                        est 2.5 % 97.5 %
(Intercept)           0.898 0.370  2.181
main                  3.821 2.243  6.509
cvs_table1            0.764 0.323  1.806
cvs_graph1            0.583 0.233  1.459
variableConcentration 0.954 0.488  1.865
variableSeparation    0.489 0.256  0.937
variableWidth         1.009 0.518  1.967
sim_index2            1.097 0.681  1.765
pre                   1.668 1.020  2.728









____________________________________________________________________________





# OTHER VERSION OF ANALYSES - keep for historical purposes
Even though we decided not to include them or do analyses this way, we keep the code to run them here just in case.

First we reload the data, in case some factors have changed from continuous to categorical variables

In [107]:
mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
mydata$sid <- factor(mydata$sid)
mydata$sim_index <- factor(mydata$sim_index)
mydata$lab_experience <- factor(mydata$lab_experience)
mydata$similar_sim <- factor(mydata$similar_sim)
mydata$cvs_graph <- factor(mydata$cvs_graph)
mydata$cvs_table <- factor(mydata$cvs_table)
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

## Stat model 1: Predicting main model scores as a categorical variable

First we transform the data in an extra wide format for the mlogit function.
Now every student has a row for each variable times type of model (0,1,2).
The "alt" is the model type (0,1,2) and "main" is True if that was the model type they got correct (and the others are always False for that variable).

In [108]:
mydata$main <- factor(mydata$main)
mydata$pre <- factor(mydata$pre)

In [109]:
wide_mydata <- mlogit.data(mydata, shape = 'wide', choice = "main", id.var = "sid")
head(wide_mydata, 5)

Unnamed: 0,sid,sim,variable,pre,main,cvs_graph,cvs_table,cvs_table_only,qual_score,quant_score,...,CVS_context,use_table,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars,chid,alt
1.0,10127163,L,Concentration,0,False,1,1,0,1,1,...,2,1,1,1,1,1,1,4,1,0
1.1,10127163,L,Concentration,0,False,1,1,0,1,1,...,2,1,1,1,1,1,1,4,1,1
1.2,10127163,L,Concentration,0,True,1,1,0,1,1,...,2,1,1,1,1,1,1,4,1,2
2.0,10127163,L,Width,0,False,1,1,0,1,1,...,2,1,1,1,1,1,1,4,2,0
2.1,10127163,L,Width,0,False,1,1,0,1,1,...,2,1,1,1,1,1,1,4,2,1


Then we run the mlogit model.

See the following: https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Specifically, mixed in this document DOESN't mean with repeated measures. The "1 | " in the formula below tells it that some of the variables are individual specific.
The examples using the "Train" dataset is what I followed. See pages 3-7 for how to structure data and 22,23 for example of running mlogit.

In [110]:
ml.mydata <- mlogit(main
    ~ 1 | cvs_table + cvs_graph + variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs, wide_mydata)
summary(ml.mydata)


Call:
mlogit(formula = main ~ 1 | cvs_table + cvs_graph + variable + 
    sim_index + pre + lab_experience + similar_sim + prior_number_virtual_labs, 
    data = wide_mydata, method = "nr", print.level = 0)

Frequencies of alternatives:
       0        1        2 
0.095238 0.486395 0.418367 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 8.22E-06 
successive function values within tolerance limits 

Coefficients :
                              Estimate Std. Error z-value Pr(>|z|)   
1:(intercept)                1.6611897  0.5679551  2.9249 0.003446 **
2:(intercept)                0.1104790  0.6206179  0.1780 0.858712   
1:cvs_table1                -0.1951430  0.4023796 -0.4850 0.627696   
2:cvs_table1                 0.4026985  0.4434384  0.9081 0.363811   
1:cvs_graph1                 0.0098364  0.4395978  0.0224 0.982148   
2:cvs_graph1                 1.2499559  0.4550555  2.7468 0.006018 **
1:variableConcentration     -0.1756190  0.4629949 -0.3793 0.704457   
2:variableConcentrat