# Code and scripts overview

of the most important scripts to reproduce the results

## data_cleaning_pkl.py

- Token normalization  
- Typo correction  (check_typo.py)
- Time-aligned filtering

## EDA.py

- Text EDA: word counts and unique-word distributions per story (histograms)
- Lexical frequency analysis: top 30 most frequent words across all stories (bar plots)
- fMRI EDA: number of TRs per story (histogram) and voxel intensity distributions (Q–Q plots)

## EDA_fMRI.py

- Loads fMRI time series data for Subject 2 and Subject 3  
- Visualizes voxel–voxel correlation structure (first 5,000 voxels)  
- Visualizes raw fMRI signal matrices (TR × voxel heatmaps)  
- Saves correlation and signal overview figures for basic fMRI quality checks

## word2vec.py

- Cleans narrative text data using a custom preprocessing pipeline
- Trains a Word2Vec skip-gram model on the cleaned story corpus
- Generates and saves a t-SNE visualization of learned word embeddings
- Builds word-level embedding time series aligned with word and TR timestamps
- Downsamples word embeddings to fMRI TR resolution
- Aligns stimulus features with subject-specific fMRI BOLD data
- Applies fixed temporal trimming and time-delay embedding
- Saves per-subject, per-story design matrices for voxel-wise encoding models

## word2vec_uncleaned.py
- does the same w/o data cleaning

## word2vec_tsne.py
- Loads and cleans the story transcripts from `raw_text.pkl`
- Uses an existing Word2Vec embedding space (precomputed t-SNE coordinates)
- Selects a curated list of example words to highlight
- Finds the corresponding positions of these words in the embedding space
- Plots all word embeddings as a background scatter plot
- Annotates selected example words to illustrate semantic relationships
- Saves the resulting t-SNE visualization to `figures/word2vec_tsne_grid.png`

## run_regression.py

- Runs voxel-wise ridge regression encoding models for fMRI data
- Uses precomputed embedding features (e.g. Word2Vec, GloVe, BERT) stored in embeddings_<runid>/
- Processes two subjects (Subject 2 and Subject 3)
- For each subject:
  - Loads per-story delayed embedding matrices and BOLD data
  - Splits data into train (80%) / test (20%)
  - Trains ridge regression models with multiple regularization strengths (ALPHAS)
  - Uses K-fold cross-validation (K=3) to select the best alpha
  - Computes performance metrics (e.g. voxel-wise correlations)
  - Saves learned regression weights and metrics to disk

The core computation is handled by  
run_streaming_exact_ridge_for_subject(...), which performs memory-efficient,
exact ridge regression suitable for high-dimensional fMRI data.

## detailed_cc_analysis.py

- Loads saved ridge-regression
- Extracts voxel-wise test correlations
- Produces a two-panel diagnostic figure:
  - voxel-wise correlation (CC) vs. voxel index for both subjects
  - Fourier analysis (FFT) of the CC vector