# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 19. Some Advanced Data Modeling Stuff.

### Date: November 9, 2023

### To-Dos From Last Class:

* Assignment #5
    
### Today:

* Robust variants of GLM tests
* Linear Mixed-Effects Models
    * AKA multi-level models, mixed models, random effect models, etc.

### Homework

* Assignment #5

In [17]:
# plot theme stuff
# Many palettes available online, can customize
# these are from: https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=9
my_palette <- c("#e41a1c","#377eb8","#4daf4a","#984ea3","#ff7f00")

# This is the basic function I use for all the ggplots I create. 
# Modified from this black themed ggplot function: https://gist.github.com/jslefche/eff85ef06b4705e6efbc
my_theme = function(base_size = 24, base_family = "") {
  
  theme_grey(base_size = base_size, base_family = base_family) %+replace%
    
    theme(
      # Specify axis options
      axis.line = element_blank(),  
      axis.text.x = element_text(size = base_size*0.8, color = "black", lineheight = 0.9),  
      axis.text.y = element_text(size = base_size*0.8, color = "black", lineheight = 0.9),  
      axis.ticks = element_line(color = "black", size  =  0.2),
      axis.title.x = element_text(size = base_size, color = "black", margin = margin(10, 0, 0, 0)),
      axis.title.y = element_text(size = base_size, color = "black", angle = 90, margin = margin(0, 10, 0, 0)),  
      axis.ticks.length = unit(0.3, "lines"),   
      # Specify legend options
      legend.background = element_rect(color = NA, fill = "#ffffff"),  
      legend.key = element_rect(color = "black",  fill = "#ffffff"),  
      legend.key.size = unit(2, "lines"),  
      legend.key.height = NULL,  
      legend.key.width = NULL,      
      legend.text = element_text(size = base_size*0.8, color = "black"),  
      legend.title = element_text(size = base_size*0.8, face = "bold", hjust = 0, color = "black"),
      legend.position = "right",  
      legend.text.align = NULL,  
      legend.title.align = NULL,  
      legend.direction = "vertical",  
      legend.box = NULL, 
      # Specify panel options
      panel.background = element_rect(fill = "#ffffff", color  =  NA),  
      panel.border = element_rect(fill = NA, color = "black"),  
      panel.grid.major = element_line(color = "#ffffff"),  
      panel.grid.minor = element_line(color = "#ffffff"),  
      panel.spacing = unit(2, "lines"),
      # Specify facetting options
      strip.background = element_rect(fill = "grey30", color = "grey10"),  
      strip.text.x = element_text(size = base_size*0.8, color = "black"),  
      strip.text.y = element_text(size = base_size*0.8, color = "black",angle = -90),  
      # Specify plot options
      plot.background = element_rect(color = "#ffffff", fill = "#ffffff"),  
      plot.title = element_text(size = base_size*1.2, color = "black"),  
      plot.margin = unit(rep(1, 4), "lines")
    ) 
}

# Loading tidyverse and creating example df
library(tidyverse)
df <- tibble(pid  = c(1,2,3,4,5,6,7,8,9,10,11),
             age = c(10,25,26,25,30,34,40,40,40,25,80),
             tv_news = c(4.0,5.0,5.0,4.5,6.0,7.0,5.5,6.0,7.0,8.5,9.0),
             experience = as.factor(c(0,0,0,0,0,1,0,1,1,1,1)),
             crime_seriousness = c(21,28,27,26,33,36,31,35,41,80,95)) %>%
    mutate(age_ord = ifelse(age<=25,1,
                            ifelse(age<40,2,3)))

# So you want to model some data...

<img src='img/decision_tree.png' width='500'>

## Table 1: Summarizing the pros / cons

| Type of Model | Distribution Assumptions | Characteristics | Sensitivity to violations |
| --- | --- | --- | --- |
| parametric | specific, inflexible | optimal when assumptions are met | High |
| robust | parametric, allowing some flexibility | good performance in many situations | Moderate |
| nonparametric | No assumptions | Sub-optimal, but acceptable across almost any distribution | Not at all |

* Data science in practice: 
    * If assumptions not violated, use standard parametric models. 
    * If violated a bit (often the case in psych and neuro) use robust variants.
    * If your data are truly wacky, use nonparametric models.
        * In class, only nonparametric test I'll cover is Spearman's _rho_.
   
For a thoughtful discussion on these issues: <a href="https://www.sciencedirect.com/science/article/pii/S0005796717301067">Field & Wilcox. (2017).</a>

# Let's show you some examples of "robust" models!

## Example #1: modeling crime seriousness ratings __as a function of__ age

- Standard: Linear regression with ordinary least squares (OLS) estimation
    - But, given we already saw that the distribution of residuals is not normal and there were some outliers... 
- Robust: Robust linear regression with MM-type estimators

## Example #2: modeling crime seriousness ratings between ppl with and without prior experience (i.e., Difference between 2-levels of a single factor)

- Standard: Student's (if equal var assumed) or Welch's (if not assumed) t-test
- Robust: Yuen's t-test (trimmed samples comparison)
    - yuend: paired samples
    - yuenbt: independent samples

## Example #3: modeling crime seriousness ratings as a function of the age ordinal variable (i.e., low, medium, or high age; _k_-levels of a single factor)

- Standard: Parametric One-way ANOVA & post-hoc test
- Robust: Robust One-way ANOVA & post-hoc test

# Final option we'll code...

<img src='img/LMMs.png' width='500'>

Wonderful (relatively brief) intro lecture: https://www.youtube.com/watch?v=QCqF-2E86r0

## Things that often muck up regression and rm-anova in psych/neuro
* Nested factors that create non-independent observations. E.g...
    * Student:classroom.
    * Rats:litter.
    * For a good review on this issue in psychology -- Grawitch & Munz. (2004), <a href="https://www.researchgate.net/profile/Matthew_Grawitch/publication/250890649_Are_Your_Data_Nonindependent_A_Practical_Guide_to_Evaluating_Nonindependence_and_Within-Group_Agreement/links/546ba7670cf20dedafd535fe.pdf">_Understanding Statistics_</a>
* Missing data or unbalanced designs. E.g...
    * Repeated measures dataset with subjects missing obs
* Repeats as a continuous factor. E.g...
    * Time in longitudinal designs.
    
## LMMs are more flexible as they accommodate both fixed and random effects. 
* Fixed: Test for an effect of this parameter.
* Random: Control for non-independence from nested or hierarchical structure.


## Cautionary note:
* Too much flexibility in the statistical toolkit = a lack of standardized 'best practices'
    * <a href='https://www.sciencedirect.com/science/article/pii/S0749596X20300061'>Meteyard & Davies. (2020), _J Memory Lang_.</a>
* If the design is simple and there are no missing data, you will likely get identical results from LMM vs. rm-ANOVA. 


## Fixed vs. Random Effects -- Examples

* Influence of __dopamine agonists__ on __lever pulls__ in rats reared in __different cages__. 
    * _What is fixed and what is random?_
* Impact of __trauma__ exposure on __amygdalar reactivity__ to threat stimuli in a large __multi-site__ study.
    * _What is fixed and what is random?_
* __Age__-related decline in __cognitive flexibility__, measured longitudinally within __subjects__.
    * _What is fixed and what is random?_