#Statistical analysis and model checking with R and Julia
# Part II: Implementation in Julia

In the first part, data from three previously completed experiments were analyzed in R. As a point of comparison and a way to check models, the data are now analyzed in Julia.  

The next step is to use the Julia language and its `MixedModels` and `GLM`package in order to see if we get similar results. First we need to know what packages to use, how to import data and then how model calls differ in R and Julia.

## Packages
There are three packages of primary interest:
- `DataFrames`
- `MixedModels`
- `GLM`  

These will be used to import data and implement the LMMs and GLMs, respectively. They also load in other useful packages, such as `Distributions`.

## 'Circlipses' (audiovisual integration)
The first step will be to import data and see what the structure is.

In [1]:
using DataFrames

# I suggest cloning since DataFrames will not download from the URL;
# If you like, change the corresponding code in Part 1 if you clone

dataset = "~/GitHub/model-comparison-r-julia/data/circlipse-pwr-data-final-2015.csv"

circlipse_pwr_data = readtable(expanduser(dataset), header = true);

In [2]:
typeof(circlipse_pwr_data)

DataFrame (constructor with 11 methods)

In [3]:
head(circlipse_pwr_data)

Unnamed: 0,Subject,Hem,Harm,Cond,Area,LinPwr,dBPwr
1,r1532,1-lh,1-Fm,1-Uni,1-AntTem,4310000000000000.0,156.3464854
2,r1532,1-lh,1-Fm,1-Uni,2-PosTem,2890000000000000.0,154.6139347
3,r1532,1-lh,1-Fm,1-Uni,3-Occ,2.97e+18,184.7342857
4,r1532,1-lh,1-Fm,2-Zero,1-AntTem,4.37e+16,166.4075961
5,r1532,1-lh,1-Fm,2-Zero,2-PosTem,2.06e+17,173.1441501
6,r1532,1-lh,1-Fm,2-Zero,3-Occ,1.61e+18,182.0758051


Let's look more carefully at the structure of the data:

In [4]:
dump(circlipse_pwr_data)

DataFrame  672 observations of 7 variables
  Subject: DataArray{UTF8String,1}(672) UTF8String["r1532","r1532","r1532","r1532"]
  Hem: DataArray{UTF8String,1}(672) UTF8String["1-lh","1-lh","1-lh","1-lh"]
  Harm: DataArray{UTF8String,1}(672) UTF8String["1-Fm","1-Fm","1-Fm","1-Fm"]
  Cond: DataArray{UTF8String,1}(672) UTF8String["1-Uni","1-Uni","1-Uni","2-Zero"]
  Area: DataArray{UTF8String,1}(672) UTF8String["1-AntTem","2-PosTem","3-Occ","1-AntTem"]
  LinPwr: DataArray{Float64,1}(672) [4.31e15,2.89e15,2.97e18,4.37e16]
  dBPwr: DataArray{Float64,1}(672) [156.346,154.614,184.734,166.408]


From the previous analyses, we know that the columns `Subject, Hem, Harm, Cond and Area` are all character strings. In order to implement the LMM, we need change these from the class `DataArray` to `PooledDataArray`.

In [5]:
# need to do this (in-place) so as to be able to fit LMMs
pool!(circlipse_pwr_data, [:Subject, :Hem, :Harm, :Cond, :Area])

In [6]:
dump(circlipse_pwr_data)

DataFrame  672 observations of 7 variables
  Subject: PooledDataArray{UTF8String,Uint8,1}(672) UTF8String["r1532","r1532","r1532","r1532"]
  Hem: PooledDataArray{UTF8String,Uint8,1}(672) UTF8String["1-lh","1-lh","1-lh","1-lh"]
  Harm: PooledDataArray{UTF8String,Uint8,1}(672) UTF8String["1-Fm","1-Fm","1-Fm","1-Fm"]
  Cond: PooledDataArray{UTF8String,Uint8,1}(672) UTF8String["1-Uni","1-Uni","1-Uni","2-Zero"]
  Area: PooledDataArray{UTF8String,Uint8,1}(672) UTF8String["1-AntTem","2-PosTem","3-Occ","1-AntTem"]
  LinPwr: DataArray{Float64,1}(672) [4.31e15,2.89e15,2.97e18,4.37e16]
  dBPwr: DataArray{Float64,1}(672) [156.346,154.614,184.734,166.408]


The model syntax is slightly different for the `MixedModels` package. The most parsimonious model in R was this:  

`lmer(dBPwr ~ Hem + Harm + Cond + Area + Hem:Area + Harm:Area + Cond:Area + (Hem + Harm + Cond | Subject), data = circlipse.pwr.data, REML = FALSE)`  

The '`:`' will be replaced by '`&`' in Julia and the model will be fit via maximum likelihood by default.

In [7]:
using MixedModels

interact_model_parsim = fit(lmm(dBPwr ~ Hem + Harm + Cond + Area + Hem & Area + Harm & Area + Cond & Area +
(Hem + Harm + Cond | Subject), circlipse_pwr_data))

Linear mixed model fit by maximum likelihood
Formula: dBPwr ~ Hem + Harm + Cond + Area + Hem & Area + Harm & Area + Cond & Area + ((Hem + Harm + Cond) | Subject)

 logLik: -2318.589913, deviance: 4637.179826

Variance components:
           Variance   Std.Dev.   Corr.
 Subject  19.5799879 4.4249280
           6.6874648 2.5860133 -0.22
          10.9537050 3.3096382 -0.74 -0.74
           4.0285954 2.0071361  0.35  0.35  0.35
           2.5462567 1.5956994 -0.57 -0.57 -0.57 -0.57
           6.1240922 2.4746903 -0.15 -0.15 -0.15 -0.15 -0.15
 Residual 52.3303999 7.2339754
 Number of obs: 672; levels of grouping factors: 14

  Fixed-effects parameters:
                                Estimate Std.Error   z value
(Intercept)                      156.936    1.6734   93.7823
Hem2-rh                         0.340135   1.18834  0.286228
Harm2-SecondHarm                -9.30368    1.3103  -7.10044
Cond2-Zero                       -1.1611   1.46857 -0.790631
Cond3-HalfPi                    -2.256

Aside from precision differences, the results are identical.

## M100 analysis in 16p11.2 deletion and duplication
For the M100 analysis, we will follow the same procedure: import the data, columns to a form `MixedModels` can understand, compare the output. I cheated a bit and exported the subsetted data from R as well as removed the missing observations from the dependent variable.

In [8]:
dataset = "~/GitHub/model-comparison-r-julia/data/simons-child-data-experimental-na-removed.csv"

child_data = readtable(expanduser(dataset), header = true);

In [9]:
head(child_data)

Unnamed: 0,Subject,Site,Exclusions,DOS,DOB,Age,Age_Calc,AgeGroup,Gender,Handedness,Dx,Chromosome,Case,Copies,ASD,NVIQ,VIQ,CELF_4,SRS_parent,SRS_adult,CTOPP,ICV2,cmICV,Hem,Cond,M50Lat,M100Lat,M50LatCorr,M100LatCorr,dB_SL
1,3002-102,UCSF,no,11/4/11,1/24/01,10,10.784,child,male,right,proband,16p,duplication,duplication,False,77,86,75,145,,6,1422313.912,1422.313912,2-RH,1-200,,162,,134,25.0
2,3002-102,UCSF,no,11/4/11,1/24/01,10,10.784,child,male,right,proband,16p,duplication,duplication,False,77,86,75,145,,6,1422313.912,1422.313912,2-RH,2-300,,148,,120,25.0
3,3002-102,UCSF,no,11/4/11,1/24/01,10,10.784,child,male,right,proband,16p,duplication,duplication,False,77,86,75,145,,6,1422313.912,1422.313912,2-RH,3-500,,146,,118,25.0
4,3002-102,UCSF,no,11/4/11,1/24/01,10,10.784,child,male,right,proband,16p,duplication,duplication,False,77,86,75,145,,6,1422313.912,1422.313912,2-RH,4-1000,116.0,166,88.0,138,25.0
5,3003-101,UCSF,no,11/18/11,3/3/99,12,12.721,child,male,right,proband,16p,deletion,deletion,False,102,106,88,6,,3,1717376.378,1717.376378,1-LH,1-200,,180,,152,24.0
6,3003-101,UCSF,no,11/18/11,3/3/99,12,12.721,child,male,right,proband,16p,deletion,deletion,False,102,106,88,6,,3,1717376.378,1717.376378,1-LH,2-300,86.0,176,58.0,148,24.0


In [10]:
dump(child_data)

DataFrame  534 observations of 30 variables
  Subject: DataArray{UTF8String,1}(534) UTF8String["3002-102","3002-102","3002-102","3002-102"]
  Site: DataArray{UTF8String,1}(534) UTF8String["UCSF","UCSF","UCSF","UCSF"]
  Exclusions: DataArray{UTF8String,1}(534) UTF8String["no","no","no","no"]
  DOS: DataArray{UTF8String,1}(534) UTF8String["11/4/11","11/4/11","11/4/11","11/4/11"]
  DOB: DataArray{UTF8String,1}(534) UTF8String["1/24/01","1/24/01","1/24/01","1/24/01"]
  Age: DataArray{Int64,1}(534) [10,10,10,10]
  Age_Calc: DataArray{Float64,1}(534) [10.784,10.784,10.784,10.784]
  AgeGroup: DataArray{UTF8String,1}(534) UTF8String["child","child","child","child"]
  Gender: DataArray{UTF8String,1}(534) UTF8String["male","male","male","male"]
  Handedness: DataArray{UTF8String,1}(534) UTF8String["right","right","right","right"]
  Dx: DataArray{UTF8String,1}(534) UTF8String["proband","proband","proband","proband"]
  Chromosome: DataArray{UTF8String,1}(534) UTF8String["16p","16p","16p","16p"]
  

Once again, make factors class `PooledDataArray`:

In [11]:
pool!(child_data, [:Subject, :Hem, :Cond, :Case, :Site])
dump(child_data)

DataFrame  534 observations of 30 variables
  Subject: PooledDataArray{UTF8String,Uint8,1}(534) UTF8String["3002-102","3002-102","3002-102","3002-102"]
  Site: PooledDataArray{UTF8String,Uint8,1}(534) UTF8String["UCSF","UCSF","UCSF","UCSF"]
  Exclusions: DataArray{UTF8String,1}(534) UTF8String["no","no","no","no"]
  DOS: DataArray{UTF8String,1}(534) UTF8String["11/4/11","11/4/11","11/4/11","11/4/11"]
  DOB: DataArray{UTF8String,1}(534) UTF8String["1/24/01","1/24/01","1/24/01","1/24/01"]
  Age: DataArray{Int64,1}(534) [10,10,10,10]
  Age_Calc: DataArray{Float64,1}(534) [10.784,10.784,10.784,10.784]
  AgeGroup: DataArray{UTF8String,1}(534) UTF8String["child","child","child","child"]
  Gender: DataArray{UTF8String,1}(534) UTF8String["male","male","male","male"]
  Handedness: DataArray{UTF8String,1}(534) UTF8String["right","right","right","right"]
  Dx: DataArray{UTF8String,1}(534) UTF8String["proband","proband","proband","proband"]
  Chromosome: DataArray{UTF8String,1}(534) UTF8String["16

Fit the LMM:

In [12]:
child_addmodel = fit(lmm(M100LatCorr ~ Hem + Cond + Case + Site + Age_Calc + (Cond + Hem | Subject), child_data));

In [13]:
child_addmodel

Linear mixed model fit by maximum likelihood
Formula: M100LatCorr ~ Hem + Cond + Case + Site + Age_Calc + ((Cond + Hem) | Subject)

 logLik: -2101.511696, deviance: 4203.023392

Variance components:
           Variance   Std.Dev.    Corr.
 Subject  479.229768 21.8913172
           13.274600  3.6434325  0.32
           41.982576  6.4793963 -0.09 -0.09
           75.085061  8.6651637 -0.31 -0.31 -0.31
          364.617204 19.0949523 -0.51 -0.51 -0.51 -0.51
 Residual  49.670257  7.0477129
 Number of obs: 534; levels of grouping factors: 96

  Fixed-effects parameters:
                 Estimate Std.Error   z value
(Intercept)       192.371   10.4107   18.4782
Hem2-RH          -14.5908   2.29338  -6.36214
Cond2-300        -4.47203   1.02708  -4.35413
Cond3-500        -8.78877    1.1908  -7.38058
Cond4-1000       -16.5421   1.32584  -12.4767
Casedeletion      20.9439   4.34312   4.82231
Caseduplication  -5.44841    5.5698 -0.978206
SiteUCSF         -2.45386   3.80511 -0.644885
Age_Calc      

And compare it to the R output:

Once again, the output is identical aside from precision differences.

##Proportion correct analysis in AX task
Now to revisit the proportion correct analysis. The model call to `GLM` in Julia will require specifying both the link function and the exact link explicitly. In R, when `binomial` was called, it used a logit link as the default.

In [14]:
# used a version of the file with the ItemPair column appended
dataset = "~/GitHub/model-comparison-r-julia/data/psycho-data-april-2015.csv"
psycho_data = readtable(expanduser(dataset), header = true);

In [15]:
head(psycho_data)

Unnamed: 0,SubjectAssgn,GenID,ExpAssgn,Language,Training,TrainType,Sex,Instrument,InstYears,InstPlay,EarTrain,EarTrainYear,Vowel,VowelHeight,VowelPosition,F1,F2,F3,F1_F2,F1_F3,F2_F3,BarkF2_F1,BarkF3_F1,BarkF3_F2,Item,ItemType,SCGDiff,logSCGDiff,RT,RespNum,RespNum2,absRT,logAbsRT,ItemPair
1,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,684.71,1,100,684.71,2.835506671,2-2
2,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,707.26,1,100,707.26,2.849579097,2-2
3,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,443.42,1,100,443.42,2.646815278,2-2
4,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,561.19,1,100,561.19,2.749109924,2-2
5,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,534.67,1,100,534.67,2.728085817,2-2
6,a-i-subject-1,Subject1,a-i,english,yes,formal,female,piano,12,current,no,,a,low,back,768,1333,2522,0.576,0.305,0.529,3.306,7.539,4.233,a-2-2,same,0.0,0.0,514.24,1,100,514.24,2.711165855,2-2


In [16]:
dump(psycho_data)

DataFrame  12800 observations of 34 variables
  SubjectAssgn: DataArray{UTF8String,1}(12800) UTF8String["a-i-subject-1","a-i-subject-1","a-i-subject-1","a-i-subject-1"]
  GenID: DataArray{UTF8String,1}(12800) UTF8String["Subject1","Subject1","Subject1","Subject1"]
  ExpAssgn: DataArray{UTF8String,1}(12800) UTF8String["a-i","a-i","a-i","a-i"]
  Language: DataArray{UTF8String,1}(12800) UTF8String["english","english","english","english"]
  Training: DataArray{UTF8String,1}(12800) UTF8String["yes","yes","yes","yes"]
  TrainType: DataArray{UTF8String,1}(12800) UTF8String["formal","formal","formal","formal"]
  Sex: DataArray{UTF8String,1}(12800) UTF8String["female","female","female","female"]
  Instrument: DataArray{UTF8String,1}(12800) UTF8String["piano","piano","piano","piano"]
  InstYears: DataArray{UTF8String,1}(12800) UTF8String["12","12","12","12"]
  InstPlay: DataArray{UTF8String,1}(12800) UTF8String["current","current","current","current"]
  EarTrain: DataArray{UTF8String,1}(12800) U

In [17]:
using GLM

In [18]:
pool!(psycho_data, [:Vowel, :ItemPair, :ItemType, :Item, :SubjectAssgn])
dump(psycho_data)

DataFrame  12800 observations of 34 variables
  SubjectAssgn: PooledDataArray{UTF8String,Uint8,1}(12800) UTF8String["a-i-subject-1","a-i-subject-1","a-i-subject-1","a-i-subject-1"]
  GenID: DataArray{UTF8String,1}(12800) UTF8String["Subject1","Subject1","Subject1","Subject1"]
  ExpAssgn: DataArray{UTF8String,1}(12800) UTF8String["a-i","a-i","a-i","a-i"]
  Language: DataArray{UTF8String,1}(12800) UTF8String["english","english","english","english"]
  Training: DataArray{UTF8String,1}(12800) UTF8String["yes","yes","yes","yes"]
  TrainType: DataArray{UTF8String,1}(12800) UTF8String["formal","formal","formal","formal"]
  Sex: DataArray{UTF8String,1}(12800) UTF8String["female","female","female","female"]
  Instrument: DataArray{UTF8String,1}(12800) UTF8String["piano","piano","piano","piano"]
  InstYears: DataArray{UTF8String,1}(12800) UTF8String["12","12","12","12"]
  InstPlay: DataArray{UTF8String,1}(12800) UTF8String["current","current","current","current"]
  EarTrain: DataArray{UTF8String

In [19]:
psycho_vowel_glm = glm(RespNum ~ Vowel, psycho_data, Binomial(), LogitLink());
psycho_item_pair_glm = glm(RespNum ~ ItemPair, psycho_data, Binomial(), LogitLink());
psycho_vowel_item_glm = glm(RespNum ~ Vowel * ItemType, psycho_data, Binomial(), LogitLink());
psycho_Item_glm = glm(RespNum ~ Item, psycho_data, Binomial(), LogitLink());

Let's see what our results are (Julia first, then R):

In [20]:
psycho_vowel_glm

DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial,LogitLink},DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
              Estimate Std.Error z value Pr(>|z|)
(Intercept)  0.0668999  0.025014  2.6745   0.0075
Vowel - i      1.33509 0.0509469 26.2056   <1e-99
Vowel - u     0.849825 0.0464197 18.3074   <1e-74


In [21]:
psycho_item_pair_glm

DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial,LogitLink},DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
                           Estimate Std.Error   z value Pr(>|z|)
(Intercept)                 2.32514 0.0981064   23.7002   <1e-99
ItemPair - 2-square        -4.58466  0.136929   -33.482   <1e-99
ItemPair - 23-2            -1.88373  0.113598  -16.5824   <1e-61
ItemPair - 23-23            -0.1192  0.135516 -0.879603   0.3791
ItemPair - 23-4            -2.97523  0.114419  -26.0028   <1e-99
ItemPair - 23-square        -1.8043  0.113871  -15.8451   <1e-55
ItemPair - 4-2             -2.13708  0.113038  -18.9059   <1e-78
ItemPair - 4-4            -0.178943  0.134012  -1.33527   0.1818
ItemPair - 4-square        -2.21565  0.112957   -19.615   <1e-84
ItemPair - square-square   0.132165  0.142691  0.926232   0.3543


In [22]:
psycho_vowel_item_glm

DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial,LogitLink},DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
                             Estimate Std.Error  z value Pr(>|z|)
(Intercept)                  -1.06146  0.036928 -28.7439   <1e-99
Vowel - i                      2.0704 0.0634328  32.6392   <1e-99
Vowel - u                     1.40436 0.0592356  23.7081   <1e-99
ItemType - same               3.33014 0.0772007  43.1361   <1e-99
Vowel - i & ItemType - same  -2.06117  0.133722 -15.4138   <1e-52
Vowel - u & ItemType - same  -1.37646  0.132319 -10.4026   <1e-24


In [23]:
psycho_Item_glm

DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial,LogitLink},DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
                            Estimate Std.Error      z value Pr(>|z|)
(Intercept)                  2.25043  0.134595        16.72   <1e-62
Item - a-2-square           -5.19475  0.224813      -23.107   <1e-99
Item - a-23-2               -2.69511  0.157098     -17.1556   <1e-65
Item - a-23-23            -0.0874539  0.187103    -0.467411   0.6402
Item - a-23-4                -4.2107  0.180434     -23.3365   <1e-99
Item - a-23-square          -2.67546   0.15701       -17.04   <1e-64
Item - a-4-2                -3.04617    0.1594     -19.1102   <1e-80
Item - a-4-4            -1.99244e-14  0.190346 -1.04675e-13   1.0000
Item - a-4-square           -3.23556  0.161272     -20.0627   <1e-88
Item - a-square-square      0.175054   0.19755     0.886124   0.3756
Item - i-2-2                0.133907   0.24194     0.553471   0.5799
Item - i-2-s

So the results are similar and we can resonably conclude that the effects and effect sizes are consistent between models and languages (and statistical engines).

##Conclusion
So we have analyzed several datasets, once in R and once in Julia. In both cases, identical results were obtained, with the only differences being those of precision.

In [24]:
versioninfo()

Julia Version 0.3.8
Commit 79599ad (2015-04-30 23:40 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
