## <div align="center"> <h1 align="center"> PUPILLOMETRY BASICS </h1> </div>
## <div align="center"> <h1 align="center"> FOR LINGUISTICS </h1> </div>

## <div align="center"> <h1 align="center"> PART II: Creating a GAMMs model </h1> </div>a

## 1. Data, data, data

Download and read-in the .csv file with the data. It can be downloaded directly from [Kaggle](https://www.kaggle.com/datasets/priscilalpezbeltrn/pupillometry-sample) (DOI: 10.34740/kaggle/ds/2021248), but it is also uploaded in [Git's Large File Storage](https://github.com/prislb/Pupillometry_Basics/tree/data)

**Note:** consider the data file is quite heavy (83.82 MB) and it might take longer to download depending on your computer specs and internet connection.

In [None]:
data <- read.csv("../input/pupillometry-sample/data_pup.csv")
head(data)

We will also subset the datset to only include the conditions we are currently interested in (i.e, Non-variable Subjunctive-NVS, and Non-variable Indicative-NVI)

In [None]:
target_NV <- droplevels(data[(data$condition == "NVS") | (data$condition == "NVI"), ])
head(target_NV)

## 2. Call in the necessary package libraries
The main packages we will be using are [mgcv](https://cran.r-project.org/web/packages/mgcv/index.html) 
We will also use the [tidyverse](https://tidyverse.tidyverse.org/) due to its versatility for data wrangling.

In [None]:
library(mgcv)
library(tidyverse)

## 3. Prep the data for modeling

After we load in the the data and prepare our environment, we must make sure that *all* categorical variables are converted into factors, otherwise the model won't run. 

In [None]:
target_NV$participant <- as.factor(target_NV$participant)
target_NV$session <- as.factor(target_NV$session)
target_NV$condition <- as.factor(target_NV$condition)
target_NV$item <- as.factor(target_NV$item)
target_NV$regularity <- as.factor(target_NV$regularity)

#sanity check
class(target_NV$participant)

## 4. Creating our first model

We will use the **bam()** function to create our model. 

The code below shows what a basic model with placeholder variables would look like:

model <- bam(dependent variable ~ independent variable **-> fixed effects structure** <br>
                     + s(time, by = independent variable, k = 20) **-> smooth for time by independent variable** <br>
                     + s(gaze_x, gaze_y) **-> smooth for gaze position** <br>
                     + s(time, participant, bs = 'fs', m = 1, k = 10) **-> factor smooth for time by participant / random smooth** <br>
                     + s(time, item, bs = 'fs', m = 1, k = 10) **-> factor smooth for time by item / random smooth** <br>
                     , family = "scat" **-> t-distribution, as assumed in regression** <br>
                     , data = target_NV **-> dataset** <br>
                     , method = "fREML" **-> smoothing parameter estimation method must be fast REML** <br>
                     , discrete = TRUE) **-> must be set to discrete**

**Note:** For more details on model design see "Wieling, M. (2018). Analyzing dynamic phonetic data using generalized additive mixed modeling: a tutorial focusing on articulatory differences between L1 and L2 speakers of English. *Journal of Phonetics, 70*, 86-116. https://doi.org/10.1016/j.wocn.2018.03.002")   [Download](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fpure.rug.nl%2Fws%2Ffiles%2F63442271%2F1_s2.0_S0095447017301377_main.pdf)

In [None]:
# GAMMs tend to take a while to converge.
# My comoputers specs are:
# MacBook Pro (16-inch, 2019), Processor is 2.6 GHz 6-Core Intel Core i7, Memory is 16 GB 2667 MHz DDR4
# Model 1 took approximately 5 monutes to converge

model1 <- bam(corrected_pupil_size ~ condition 
             + s(bin, by = condition, k = 20) # these data were binned into 20 ms time bins (bin = time)
             + s(gaze_x, gaze_y) 
             + s(bin, participant, bs = 'fs', m = 1, k = 10) 
             + s(bin, item, bs = 'fs', m = 1, k = 10)
             , family = "scat"
             , data = target_NV
             , method = "fREML"
             , discrete = TRUE)

summary(model1)

## Let's dissect this scary output

#### Parametric coefficients

These are our fixed effects. As usual in regression, the intercept is the value of the dependent variable when all numerical predictors are equal to 0 and nominal variables are at their reference level. In  this case, the intercept represents the value of the DV for condition Non-Variable Indicative (NVI). We observe that for condition NVS, the pupillary dilation is -7.13 ties smaller than for NVI, which supports our hypothesis.

#### Random smooths

**1. edf**

The edf value is indicative of the amount of non-linearity of the smooth. If the edf value for a certain smooth is (close to) 1, this means that the pattern is (close to) linear, while a value greater than 1 indicates that the pattern is more complex (i.e. non-linear).

**2. Ref.df and F**

The Ref.df value is the reference number of degrees of freedom used for hypothesis testing (on the basis of the associated F-value)

**3. p-value**

The p-value associated with each smooth indicates if the smooth is significantly different from 0. In this case, all variables are highly significantly different fro 0. If we focus on the first two lines of coefficients, which are those of interest to us, we see that the p-value for NVI is much higher than for NVS, indicating a higher difference from 0.

#### Goodness-of-fit measures

The adjusted r2 represents the amount of variance explained by the regression and the deviance explained is a generalization of r2 and will be very similar to the actual r2. Bpth ofn them re pretty good in this model. Consider that with time series data, especially when we try to account for as much variability as possible, is is *very common* to see deviance explained values that are quite low. There is nothing inherently wrong with this, it is just a b-product of the type of data and models were are using.

## In Part III, we will learn to visualize GAMMs results.

## To be continued!