# Exercise 9: Mixed effects

This homework assignment is designed to give you practice fitting and interpreting mixed effects models. 

We will be using the **LexicalData.csv** and **Items.csv** files from the *Homework/lexDat* folder in the class GitHub repository again. 

This data is a subset of the [English Lexicon Project database](https://elexicon.wustl.edu/). It provides the reaction times (in milliseconds) of many subjects as they are presented with letter strings and asked to decide, as quickly and as accurately as possible, whether the letter string is a word or not. The **Items.csv** provides characteristics of the words used, namely frequency (how common is this word?) and length (how many letters?). Unlike in the previous homework, there isn't any missing data in the **LexicalData.csv** file. 

*Data courtesy of Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.*

---
## 1. Loading and formatting the data (1 point)

Load in data from the **LexicalData.csv** and **Items.csv** files. As in the previous homeworks, remove the commas from the reaction times and convert them from strings to numbers. Use `left_join` to add word characteristics `Length` and `Log_Freq_Hal` from **Items** to **LexicalData**. 

*Note: the `Freq_HAL` variable in **Items.csv** has a similar formatting issue, using string values with commas. We're not going to worry about fixing this since we're only using `Log_Freq_HAL`, which is the natural log transformation of `Freq_HAL`, in this homework.*

In [None]:
# Read in Original Data Frames
lexical <- read.csv('LexicalData.csv')
head(lexical)
item <- read.csv('Items.csv')
head(item)

# remove RT commas and make numeric
lexical$D_RT <- as.numeric(gsub(",","",lexical$D_RT))

# remove rows with no RT; shouldn't be any missing data
library(tidyverse)
lexical %>% filter(D_RT != '') -> lexical_clean
head(lexical_clean)

# make sure all RTs are positive
sum(lexical_clean$col < 0) # the number of negatives is 0, as expected

# left_join; already loaded tidyverse
lexical_final <- lexical %>% 
  left_join(dplyr::select(item, Word, Length, Log_Freq_HAL), by = c("D_Word" = "Word"))
head(lexical_final)

Unnamed: 0_level_0,Sub_ID,Trial,Type,D_RT,D_Word,Outlier,D_Zscore
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>,<chr>,<chr>,<dbl>
1,157,1,1,710,browse,False,-0.437
2,67,1,1,1094,refrigerant,False,0.825
3,120,1,1,587,gaining,False,-0.645
4,21,1,1,984,cheerless,False,0.025
5,236,1,1,577,pattered,False,-0.763
6,236,2,1,715,conjures,False,-0.364


Unnamed: 0_level_0,Occurrences,Word,Length,Freq_HAL,Log_Freq_HAL
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<dbl>
1,1,synergistic,11,284,5.649
2,1,synonymous,10,951,6.858
3,1,syntactical,11,114,4.736
4,1,synthesis,9,6742,8.816
5,1,synthesized,11,2709,7.904
6,1,synthesizer,11,1390,7.237


Unnamed: 0_level_0,Sub_ID,Trial,Type,D_RT,D_Word,Outlier,D_Zscore
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>
1,157,1,1,710,browse,False,-0.437
2,67,1,1,1094,refrigerant,False,0.825
3,120,1,1,587,gaining,False,-0.645
4,21,1,1,984,cheerless,False,0.025
5,236,1,1,577,pattered,False,-0.763
6,236,2,1,715,conjures,False,-0.364


Unnamed: 0_level_0,Sub_ID,Trial,Type,D_RT,D_Word,Outlier,D_Zscore,Length,Log_Freq_HAL
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>
1,157,1,1,710,browse,False,-0.437,6,8.856
2,67,1,1,1094,refrigerant,False,0.825,11,4.644
3,120,1,1,587,gaining,False,-0.645,7,8.304
4,21,1,1,984,cheerless,False,0.025,9,2.639
5,236,1,1,577,pattered,False,-0.763,8,1.386
6,236,2,1,715,conjures,False,-0.364,8,5.268


---
## 2. Model fitting (4 points)

First, fit a linear model with `Log_Freq_HAL` and `Length` as predictors, and `D_RT` as the output. Include an interaction term. Use `summary()` to look at the model output. 

In [None]:
# model1 [include interaction and main effects with "*"]
model1 <- lm(D_RT ~ Log_Freq_HAL*Length, data = lexical_final)
summary(model1)


Call:
lm(formula = D_RT ~ Log_Freq_HAL * Length, data = lexical_final)

Residuals:
     Min       1Q   Median       3Q      Max 
-1118.01  -205.23   -86.95    90.77  3147.07 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         610.1903    14.6678  41.601  < 2e-16 ***
Log_Freq_HAL         -6.0239     1.9678  -3.061  0.00221 ** 
Length               47.7531     1.6368  29.175  < 2e-16 ***
Log_Freq_HAL:Length  -2.9421     0.2348 -12.528  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 359.1 on 62606 degrees of freedom
Multiple R-squared:  0.09473,	Adjusted R-squared:  0.09469 
F-statistic:  2184 on 3 and 62606 DF,  p-value: < 2.2e-16


Now, install `lme4` using `install.packages()` and then load the library. 

In [None]:
install.packages("lme4")
library(lme4)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘minqa’, ‘nloptr’, ‘Rcpp’, ‘RcppEigen’


Loading required package: Matrix


Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack




Now fit a mixed effects model that includes the same predictors as the linear model above, as well as random intercepts for `Sub_ID` (i.e., cases where subject ID shifts the RT mean). Use `summary()` to look at the model output. 

In [None]:
# model2 [add in random intercept]
model2 <- lmer(D_RT ~ Log_Freq_HAL*Length + (1|Sub_ID), data = lexical_final)
summary(model2)

Linear mixed model fit by REML ['lmerMod']
Formula: D_RT ~ Log_Freq_HAL * Length + (1 | Sub_ID)
   Data: lexical_final

REML criterion at convergence: 888235.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.5058 -0.5472 -0.1568  0.3103 10.7381 

Random effects:
 Groups   Name        Variance Std.Dev.
 Sub_ID   (Intercept) 46333    215.3   
 Residual             82978    288.1   
Number of obs: 62610, groups:  Sub_ID, 299

Fixed effects:
                    Estimate Std. Error t value
(Intercept)         616.8445    17.1522  35.963
Log_Freq_HAL         -7.4374     1.5830  -4.698
Length               47.7477     1.3162  36.277
Log_Freq_HAL:Length  -2.8778     0.1888 -15.239

Correlation of Fixed Effects:
            (Intr) Lg_F_HAL Length
Log_Frq_HAL -0.645                
Length      -0.656  0.917         
Lg_Fr_HAL:L  0.582 -0.942   -0.923

---
## 3. Model assessment (4 points)

Compare the three t-values for the fixed effects and the mixed effects models. How do they differ, and why? 

> *The t-values for the fixed effects model (model1) and mixed-effects model (model2) are all similar to each other in terms of value (i.e., little numerical difference) and are the same sign in both outputs (i.e., positive or negative values). The t-values are different in the sense that the positive t-values are closer to zero in model2 compared to model1 (i.e., the intercept t-value of 41.601 in model1 versus the intercept t-value of 35.963 in model2) and the negative values are farther from zero in model2 compared to model 1 (i.e., the t-value of Log_Freq_HAL being -3.061 in model1 versus the t-value of Log_Freq_HAL being -4.698 in model2). Mixed-effects models allow us to have a better model fit on our fixed effects because we are accounting for the variability of our subject differences via our random intercept of (1|Sub_ID). In other words, random effects help account for variation in the model, accounting for some of the irreducible error. Therefore, mixed-effects models can give more appropriate "weight" to the fixed effects parameters, creating a better fit model.* 
> 

Use the Aikeke Information Criterion (AIC) to compare these two models. Which one is better? 

In [None]:
AIC(model1, model2) # just view AIC

ic = AIC(model1, model2)
ic # same output as AIC(model1, model2)
diff(ic$AIC) # see difference

Unnamed: 0_level_0,df,AIC
Unnamed: 0_level_1,<dbl>,<dbl>
model1,5,914436.4
model2,6,888247.6


Unnamed: 0_level_0,df,AIC
Unnamed: 0_level_1,<dbl>,<dbl>
model1,5,914436.4
model2,6,888247.6


> *The better, more preferable model will have a lower AIC. Thus we can conclude that our mixed-effects model (aka model2) is our better model since it has an AIC of 888247.6 compared to model1's AIC of 914436.4. When we take the difference of the AIC's, we can see a difference of -26188.8233715765, when comparing model1 to model2* 
> 

---
##  4. Reflection (1 point)

What other random effects could be controlled for in this data set? 

> *Our fixed effects in model2 are Log_Freq_HAL and Length and our DV is D_RT. Our model2 is specifically looking at the interaction between frequency and length of words on reaction time. This leaves us with the following variables in our lexical_final dataset: Sub_ID,	Trial,	Type,	D_Word, Outlier, and	D_Zscore. We are already including Sub_ID as a random effect in our model2, but Trial	Type,	D_Word,	and Outlier	D_Zscore are also possible random effects we could include in our model. We of course do not want to include too many random effects because we could overfit our model, make our model less generalizable, and our model could possibly not converge. It is also important to think about if random effects have some overlap and are accounting for the same variance; we don't want to have a bunch of random effects accounting for the same things in our model since that isn't really helpful. With mixed-effects models, there is also an element of subjectivity in determining which random effects to include (e.g., To what degree is adding a given random effect to a model hypothesis-driven?). Thus, based on my subjective opinion, I think including Trial and Word would be the best random intercepts to include, given what we are looking at in our model2.*
> 

**DUE:** 5pm EST, March 15, 2023

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Someone's Name*