Load libraries

In [8]:
## For some reason, when loading mlogit, the notebook can't find package 'statmod' so I specify it's location
library(statmod, lib.loc='D:\\Applications\\Anaconda2\\pkgs\\r-statmod-1.4.30-r3.4.1_0\\lib\\R\\library\\')
require(mlogit)
require(ggplot2)
require(reshape2)
require(lme4)
require(compiler)
require(parallel)
require(car)
require(boot)
require(dplyr)
require(sjstats)
require(broom)

# Load data and set factors

In [9]:
all_mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# sid is the student number
# we use the "factor()" option to make sure R treats them as categorical
all_mydata$sid <- factor(all_mydata$sid)
all_mydata$sim_index <- factor(all_mydata$sim_index)
all_mydata$lab_experience <- factor(all_mydata$lab_experience)
all_mydata$similar_sim <- factor(all_mydata$similar_sim)
all_mydata$cvs_graph <- factor(all_mydata$cvs_graph)
all_mydata$cvs_table <- factor(all_mydata$cvs_table)
all_mydata$quant_score <- factor(all_mydata$quant_score)
# all_mydata$main <- factor(all_mydata$main)
# all_mydata$pre <- factor(all_mydata$pre)

Here is what our data looks like:

In [10]:
head(all_mydata)
# colnames(mydata)

sid,sim,variable,pre,main,cvs_graph,cvs_table,cvs_table_only,qual_score,quant_score,...,use_graph,use_concentration,use_width,use_area,use_separation,use_all_vars,use_graph_beers,use_table_beers,use_table_capacitor,use_graph_capacitor
10127163,L,Concentration,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10127163,L,Width,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Concentration,0,0,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,L,Width,0,0,0,0,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,C,Area,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1
10232160,C,Separation,0,2,1,1,0,1,1,...,1,1,1,1,1,4,1,1,1,1


We have the following factors that change per variable:
* main (0,1,2), treated as a continuous variable
* pre (0,1,2), treated as a continuous variable
* quant_score (0 or 1)
* CVS_graph (0 or 1)
* CVS_table (0 or 1)

We have the following independant factors:
* sim_index (1 or 2, wither it was student's 1st or 2nd activity)
* variable (thus don't include sim as a variable)
* student attibutes:
   * lab_experience (0 or 1 if students have prior undergraduate physics or chemistry lab experience)
   * similar_sim (0 or 1 if they have used a similar simulation)
   * prior_number_virtual_labs (levels from 0 to 3 depending on the number of virtual labs they have done in the past)

We ignore attitude components.

For main and pre score:
* score = 2 if they describe the correct relationship, ie. a correct quantitative model
* score = 1 if they describe the correct direction of the relationship, ie. they have a correct qual model but incorrect quant model OR if their quant model is incorrect but qualitatively correct
* score = 0 otherwise (i.e. all incorrect or only identified)

# Use this code to remove perfect pre per variable instance

In [11]:
# mydata <- (all_mydata %>% filter(pre < 2))

In [12]:
# print(dim(mydata));print(dim(all_mydata));
# print(dim(unique(mydata['sid'])));print(dim(unique(all_mydata['sid'])));

We removed 39 instances of perfect pre. All 147 students remain in the study (i.e. no student got a prefect pre on all variables).

# Stat model 1: Prediction main model score as a continuous variable

Some resources:
* On SS Types: https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
* on drop() function to do type 3: https://www.statmethods.net/stats/anova.html
* On repeated measures: http://psych.wisc.edu/moore/Rpdf/610-R8_OneWayWithin.pdf, https://datascienceplus.com/two-way-anova-with-repeated-measures/
* the car package: https://cran.r-project.org/web/packages/car/car.pdf

## Complete model with interactions

Our model (without student factors) is:

    main  ~  cvs_table*variable + cvs_graph*variable
             + cvs_table*pre + cvs_graph*pre
             + sim_index + sid
             
We run a type III Anova:

In [13]:
lm1 = lm(main
        ~  cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=3)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

ERROR: Error in is.data.frame(data): object 'mydata' not found


None of the interactions are significant so let's move to a simpler model.

## Simple model without interaction

Our model (without student factors) is:

    main  ~  cvs_table + cvs_graph + variable
             + pre + sim_index + sid
             
We run a type II Anova:

In [None]:
lm1 = lm(main
        ~  cvs_table + cvs_graph + variable + pre + sim_index + sid,
         data=mydata)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

We see that, in order of significance and eta^2: cvs_graph, sim_index, and variable matter.

## Model by variable

### For Width without interaction

In [None]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Area without interaction

In [None]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Concentration without interaction

In [None]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Separation without interaction

In [None]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table + cvs_graph + pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Width with interaction

In [None]:
test <- subset(mydata, variable == "Width")
lm1 = lm(main
        ~  cvs_table*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Area  with interaction

In [None]:
test <- subset(mydata, variable == "Area")
lm1 = lm(main
        ~  cvs_table*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Concentration with interaction

In [None]:
test <- subset(mydata, variable == "Concentration")
lm1 = lm(main
        ~  cvs_table*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

### For Separation with interaction

In [None]:
test <- subset(mydata, variable == "Separation")
lm1 = lm(main
        ~  cvs_table*pre + cvs_graph*pre + sim_index,
         data=test)
results1 = Anova(lm1, type=2)
results_table1 = tidy(results1)
results_table1$eta <- results_table1$sumsq/(results_table1$sumsq + results_table1$sumsq[dim(results_table1)[1]])
results_table1

In [None]:
colMeans(test["main"])

# Stat model 2: Predicting transfer data

## Excluding student main worksheet score

### Complete model with interactions

Our model is:

    quant_score  ~  cvs_table*variable + cvs_graph*variable
             + cvs_table*pre + cvs_graph*pre
             + sim_index + sid
             
We run a logistic regression:

In [None]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

**(non log) Odds ratio with confidence intervals**

In [None]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  cvs_table + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [None]:
mixed1 <- glmer(
    quant_score
    ~ cvs_table + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

**(non log) Odds ratio with confidence intervals**

In [None]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

As expected, CVS doesn't predict quant transfer scores, only variable does.

## Including student main worksheet score
as a continuous variable

### Complete model with interactions

Our model is:

    quant_score  ~  main + cvs_table*variable + cvs_graph*variable
                    + cvs_table*pre + cvs_graph*pre
                    + sim_index + sid
             
We run a logistic regression:

In [None]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table*variable + cvs_graph*variable + cvs_table*pre + cvs_graph*pre + sim_index + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

**(non log) Odds ratio with confidence intervals**

In [None]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

Again interactions are not significant, so we stick to a simpler model.

### Simple model without interactions

Our model is:

    quant_score  ~  main + cvs_table + cvs_graph + variable
                     + pre + sim_index + sid
             
We run a logistic regression:

In [None]:
mixed1 <- glmer(
    quant_score
    ~ main + cvs_table + cvs_graph + variable + sim_index + pre + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)
summary(mixed1)

**(non log) Odds ratio with confidence intervals**

In [None]:
cc <- confint(mixed1,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed1),cc)
rtab <- exp(ctab)
print(rtab,digits=3)

## Discussion on all 4 models (w. w/o interaction and w. w/o main)
What we notice:
* cvs_graph never matters
* main matters
* pre doesn't matter
* variable matters
* sim_index doesn't matter...

# Stat model 3: Predicting the use of CVS graph

Our model is:

    cvs_graph  ~ variable + pre + sim_index + sid
                 + lab_experience + similar_sim + prior_number_virtual_labs
             
We run a logistic regression:

In [None]:
mixed <- glmer(
    cvs_graph
    ~ variable + sim_index + pre
    + lab_experience + similar_sim + prior_number_virtual_labs + (1 | sid),
           data = mydata, family = binomial, 
           control = glmerControl(optimizer = "bobyqa"), nAGQ = 10)

summary(mixed)

**(non log) Odds ratio with confidence intervals**

In [None]:
cc <- confint(mixed,parm="beta_",method="Wald")
ctab <- cbind(est=fixef(mixed),cc)
rtab <- exp(ctab)
print(rtab,digits=3)








____________________________________________________________________________





# OTHER VERSION OF ANALYSES - keep for historical purposes
Even though we decided not to include them or do analyses this way, we keep the code to run them here just in case.

First we reload the data, in case some factors have changed from continuous to categorical variables

In [None]:
# mydata <- read.csv("C:\\Users\\Sarah\\Documents\\Personal Content\\Lab_study_data\\all_massaged_data\\dataframe_all_factors_for_analysis.txt",sep = '\t')
# # sid is the student number
# mydata$sid <- factor(mydata$sid)
# mydata$sim_index <- factor(mydata$sim_index)
# mydata$lab_experience <- factor(mydata$lab_experience)
# mydata$similar_sim <- factor(mydata$similar_sim)
# mydata$cvs_graph <- factor(mydata$cvs_graph)
# mydata$cvs_table <- factor(mydata$cvs_table)
# # mydata$main <- factor(mydata$main)
# # mydata$pre <- factor(mydata$pre)

## Stat model 1: Predicting main model scores as a categorical variable

First we transform the data in an extra wide format for the mlogit function.
Now every student has a row for each variable times type of model (0,1,2).
The "alt" is the model type (0,1,2) and "main" is True if that was the model type they got correct (and the others are always False for that variable).

In [None]:
# mydata$main <- factor(mydata$main)
# mydata$pre <- factor(mydata$pre)

In [None]:
# wide_mydata <- mlogit.data(mydata, shape = 'wide', choice = "main", id.var = "sid")
# head(wide_mydata, 5)

Then we run the mlogit model.

See the following: https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Specifically, mixed in this document DOESN't mean with repeated measures. The "1 | " in the formula below tells it that some of the variables are individual specific.
The examples using the "Train" dataset is what I followed. See pages 3-7 for how to structure data and 22,23 for example of running mlogit.

In [None]:
# ml.mydata <- mlogit(main
#     ~ 1 | cvs_table + cvs_graph + variable + sim_index + pre
#     + lab_experience + similar_sim + prior_number_virtual_labs, wide_mydata)
# summary(ml.mydata)