# Discussion on linear mixed-effects models, with R-based examples

Linear mixed models are widely applicable for statistical modeling of experimental and observational data used in science. Many traditional analyses that have evolved in different fields (e.g., t-tests, regression analyses, ANOVA, ANCOVA, hierarchical linear modeling, structural equation modeling, growth curve modeling) can all be understood as special cases of (or are closely related to) linear mixed models. Thus, understanding linear mixed models can:
1. help provide overarching conceptual clarity to what might otherwise appear as a zoo of techniques, and
2. provide a framework and library of methods that is likely directly applicable to your scientific question.

Linear mixed models can also be generalized in many ways to help model a wider range of datasets (e.g., binary/categorical responses, counts, non-normal distributions, etc.).


> For a definitive treatment of Bayesian inference, including linear mixed models, see:<br>
**Box, G. E., & Tiao, G. C. (2011)**. Bayesian inference in statistical analysis (Vol. 40). John Wiley & Sons. Originally published in 1973.

## Data preparation

For our discussion we will use the R language along with the package ```lme4```. To make working with R and ```lme4``` easy, we will organize all datasets into tables (stored as ```.csv``` files) in the so-called "long" format. The long (as opposed to "wide") format mirrors the thinking that underlies the statistical modeling, and so helps with conceptual clarity too, in addition to making it easy to work with in software.

In the long format, each row of the table contains a single observation of some "response" or "output" variable, and the values of all other experimentally manipulated or co-observed variables that represent the conditions under which that paprticular response observation was made.

For example, consider an experiment where a certain brain response ("resp", summarized as a single number) is measured under three different stimulus manipulations (say "low", "medium", "high") from many subjects belonging two different groups (say children in the "ASD", and "TD" groups). Let's also say that we suspect participant age to influence the response, and we happen to have a reasonably wide age range such that it is useful to "take that into account" in our modeling. The table of data for this experiment might look something like below:

| resp    |  stim  |  group | subject |  age  |
|---------|--------|--------|---------|-------|
| 2.5     |  low   |  ASD   |  s1     |  7    |
| 1.8     | medium |  ASD   |  s1     |  7    |
| 4.0     |  high  |  ASD   |  s1     |  7    |
| 3.8     |  low   |   TD   |  s2     |  13   |
| 4.1     | medium |   TD   |  s2     |  13   |
| 5.4     |  high  |   TD   |  s2     |  13   |
| 2.9     |  low   |  ASD   |  s3     |  12   |
| 4.0     | medium |  ASD   |  s3     |  12   |
| 5.0     |  high  |  ASD   |  s3     |  12   |
| ...     |  ...   |  ...   |  ...    |  ...  |
 

This long format table makes it more explicit that we are interested in capturing how the response goes up or down as each of experimental or co-observed variables change in level. 



### If you would like us to use your dataset as one of the examples discussed during our meeting ...

Please prepare your dataset in this long format as a ```.csv``` file. Once ready, please email the ```.csv``` file to Hari with a brief description of the data and an associated statistics question of interest in the *body of the e-mail*. For example, in the above experiment, the question could be something like, "Is the rate of growth of the brain response with stimulus level steeper in the TD group compared to the ASD group?". 

> **_NOTE_**:
If you have any missing values for a co-observed variable that make your dataset incomplete (e.g., in the above experiment you don't know the age of one of your subjects), you could just leave the cell blank. If what's missing is the main response of interest (e.g., you didn't measure the brain response in the "low" stimulus condition for a particular subject), then delete that entire row. Also, please avoid spaces in the names of columns/conditions etc., and keep names relatively short.

In case we end up not having the time to use your data during the meeting, we can still go through it offline shortly thereafter.
