# Tutorial on Encoding Models with Word Embeddings
originally for NeuroHackademy 2020, by Alex Huth


In [None]:
# Load some basic stuff we'll need later
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Part 3: Tikhonov Regression

## How does changing the features before regression affect the result?

Now that you've learned about what ridge regression is, and (more or less) how it works, we can play around with it a bit and see what neat things we can make happen.

Let's start with something simple, working from an example with toy data. Let's modify our $x(t)$ by dividing the value of the first feature by 10. (Why? Don't worry about that yet. Just see where this goes.) This means we are replacing our original $x(t)$ by something new, we'll call it $x'(t)$, and define it like this (remember that $x(t)$ is a vector that contains $p$ different features):

$$x'(t) = \begin{bmatrix} \frac{x_1(t)}{10} & x_2(t) & \dots & x_p(t) \end{bmatrix} $$

What would the result of this change be for OLS regression?

More specifically: if we did OLS regression with the responses $y(t)$ and modified stimulus features $x'(t)$, we'd obtain a new set of weights $\beta'_{OLS}$. What do you think the relationship is between $\beta'_{OLS}$ and $\beta_{OLS}$, the weights we would've gotten with the original $x(t)$? Let's find out!

In [None]:
# As we did earlier, let's create some fake data so we can test things out
T_train = 100
T_test = 25
p = 5
noise_size = 10.0 # the standard deviation of the noise, epsilon

X_train = np.random.randn(T_train, p)
X_test = np.random.randn(T_test, p)

beta_true = np.random.randn(p)

Y_train = X_train.dot(beta_true) + noise_size * np.random.randn(T_train)
Y_test = X_test.dot(beta_true) + noise_size * np.random.randn(T_test)

# And let's estimate the weights using the original features, X_train
beta_estimate_orig = np.linalg.lstsq(X_train, Y_train)[0]

In [None]:
# Now let's create our modified X
X_train_mod = X_train.copy()
X_train_mod[:,0] /= 10.0 # divide the first feature by 10

# And re-estimate the weights using this one
beta_estimate_mod = np.linalg.lstsq(X_train_mod, Y_train)[0]

In [None]:
# And let's compare the estimated weights!
print("orig beta:", beta_estimate_orig)
print("mod  beta:", beta_estimate_mod)

Ok! What you should've found (spoiler alert) is that, when you're using OLS, making one of the features smaller by a factor of 10 just makes the corresponding weight value _bigger_ by a factor of 10. This makes the predictions of this new model _exactly the same_ as the predictions of the old model:

$$ x'(t)\beta' = x(t) \beta $$

Which.. makes sense, right? If you're finding the $\beta$ (well, $\beta'$, in this case) that minimizes the error perfectly, then your regression method shouldn't really care about silly little things like multiplying one of your features by 10.

So what about ridge? If we did ridge regression with $y(t)$ and $x'(t)$, obtaining new weights $\beta'_{ridge}$, how would those weights be related to the weights $\beta_{ridge}$ that you'd get from using the original $x(t)$?

In [None]:
from ridge import ridge

beta_est_ridge_orig = ridge(X_train, Y_train[:,None], alpha=1.0)
beta_est_ridge_mod = ridge(X_train_mod, Y_train[:,None], alpha=1.0)

In [None]:
# And let's compare the estimated weights!
print("ridge orig beta:", beta_est_ridge_orig.ravel())
print("ridge mod  beta:", beta_est_ridge_mod.ravel())

Alright the result here is _really different_. Not only does the weight on the first feature not increase by a factor of 10, the weights on the other features have changed as well. Unlike the OLS case, this model is _not_ equivalent to the original ridge model! What's going on here?

When we tested OLS, the regression procedure was able to correct for our modification by changing the weights. In particular, it made the weight on the feature that we modified 10x bigger. But in ridge regression, it's _costly_ to make the weights big. Remember that we penalize the loss by a factor of $\beta^2$. So in order to make the weight 10x bigger, the penalty (at least for that one parameter) needs to increase by _100x_.

The result is that the weight is _not_ simply increased by 10x, it's only increased by about 5x. But there's more than that going on! The _other_ weights have also changed. Why did that happen? Setting the first weight to a smaller value than it should have been creates _new errors_ in the prediction of $y(t)$. To account for these errors, the model will change the values of the other weights in $\beta$ as well.

### _WHAT HAVE WE DONE?_

We just did an extremely simple thing: divided one of our feature values by 10. And it changed our entire model! How does this fit into any of the mathematical formalisms that we were dealing with earlier?

Let's redefine what we've done here more formally. This will help us discover what it is that we've managed to acomplish with this little stunt.

The only thing we did was divide one of the features by 10. Let's represent that as a matrix multiplication: $X A$. Remember that $X$ is a $T \times p$ matrix ($T$ rows, one for each "timepoint" in our dataset, and $p$ columns, one for each feature). Let's define $A$ as a $p \times p$ matrix that looks like this:

$$ A = \begin{bmatrix} 
0.1 & 0 & 0 & \dots \\ 
0 & 1 & 0 & \dots \\
0 & 0 & 1 & \dots \\
\vdots & \vdots & \vdots & \ddots
\end{bmatrix} $$

$A$ is a diagonal matrix (meaning it only has non-zero values on the main diagonal), and all of the values on the diagonal are 1 except the first, which we have set to 0.1 (i.e. dividing by 10). Multiplying this matrix on the right side of $X$ will do exactly what we did by hand before: scale the first feature down by a factor of 10.

So we can write the new model that we're trying to fit like this:

$$ Y = X A \beta + \epsilon $$

And we know that when we did this using OLS, we found that the first weight had _increased_ by a factor of 10. That's like multiplying $\beta$ by the inverse of $A$, giving you:

$$ \beta'_{OLS} = A^{-1} \beta_{OLS} $$

If we try to combine these two things and make predictions using this new scaled model, we see that everything cancels out nicely:

$$ X A \beta'_{OLS} = X A A^{-1} \beta_{OLS} = X \beta_{OLS} $$

Why did this work out so nicely? It's because the OLS equation for $\beta$:

$$ \beta_{OLS} = (X^\top X)^{-1} X^\top Y $$
becomes
$$ \beta'_{OLS} = (A^\top X^\top X A)^{-1} A^\top X^\top Y $$
where we can [pop the $A$ and $A^\top$ out of the inverse](https://en.wikipedia.org/wiki/Invertible_matrix#Other_properties), giving:
$$ \begin{eqnarray}
\beta'_{OLS} &=& A^{-1}(X^\top X)^{-1} (A^\top)^{-1} A^\top X^\top Y \\
&=& A^{-1}(X^\top X)^{-1} X^\top Y \\
&=& A^{-1}\beta_{OLS}\\
\end{eqnarray} $$
confirming what we found empirically! Nice.

### _WHAT HAVE WE DONE? (BUT FOR RIDGE)_

So what happened in the ridge case? Let's try to do the same trick with our ridge equation, $\beta_{ridge} = (X^\top X + \lambda I)^{-1} X^\top Y$:

$$ \beta'_{ridge} = (A^\top X^\top X A + \lambda I)^{-1} A^\top X^\top Y $$
We can't just pop the $A$ and $A^\top$ out of the inverse like we did before because there's a sum inside there.

But we can pop the outer $A^\top$ _into_ the inverse from the right side (its inverse ends up on the left side of each term), giving:

$$ \begin{eqnarray} \beta'_{ridge} &=& ((A^\top)^{-1} A^\top X^\top X A + \lambda (A^\top)^{-1})^{-1} X^\top Y \\
&=& (X^\top X A + \lambda (A^\top)^{-1})^{-1} X^\top Y \end{eqnarray}$$

Then we can use the old trick of multiplying by 1 (well, $I$) in the form of $A^{-1} A$, then pop the $A$ inside the inverse, giving:

$$\begin{eqnarray} \beta'_{ridge} &=& (A^{-1} A) (X^\top X A + \lambda (A^\top)^{-1})^{-1} X^\top Y \\
&=& A^{-1} (X^\top X A A^{-1} + \lambda (A^\top)^{-1} A^{-1})^{-1} X^\top Y \\
&=& A^{-1} (X^\top X + \lambda (A A^\top)^{-1})^{-1} X^\top Y 
\end{eqnarray} $$

Here we can see why the ridge solution came out just _different_. Similar to the OLS solution, it has this leading factor of $A^{-1}$, which is effectively multiplying the first weight by 10 here. But unlike the OLS solution, we still have this weird stuff sitting inside our big matrix inverse: the penalty factor is now $\lambda (A A^\top)^{-1}$ instead of just $\lambda I$. This is why all the other weights changed and not just the weight on the first feature.

To understand what this $(A A^\top)^{-1}$ factor is doing, let's go back to our Bayesian formulation, and in particular the prior on $\beta$. A [multivariate Gaussian distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) on $\beta$ with mean zero and covariance $\Sigma$ has the following form (I'm dropping the constant in front for convenience):
$$ P(\beta) \propto e^{-\frac{1}{2} \beta^{\top} \Sigma^{-1} \beta} $$

Originally we had set $\Sigma = \lambda^{-1} I$, making $\Sigma^{-1} = \lambda I$, which was the original factor in the ridge regression equation. Here, instead of $\lambda I$, we have $\lambda (A A^\top)^{-1}$. This suggests that we can interpret what we've done here—dividing the first feature in $X$ by 10—as _choosing a different prior for $\beta$_.

Specifically, we seem to have chosen $\Sigma = \lambda^{-1} A A^\top$. This prior says that each of the weights in $\beta$ has a prior variance of $\lambda^{-1}$ except the first, which has a prior variance of $\lambda^{-1} / 100$. Thus, the model is different because we pretty much told it that we expect the first weight to be much smaller than the others!

What we've rediscovered here, through this very simple manipulation, is an advanced regression technique called [Tikhonov regularization](https://en.wikipedia.org/wiki/Tikhonov_regularization#Tikhonov_regularization).

## Implications for data preprocessing before regularized regression

One of the key things you might want to take away from this is that inconsequential-seeming things, like scaling features appropriately, can have huge effects on regression models.

# Formal definition of Tikhonov regression

When we introduced the Bayesian interpretation of ridge regression, we created a prior distribution on $\beta$ that, more or less, suggested to the regression model that the weights should be small. We defined this prior as a multivariate Gaussian distribution with mean zero and covariance matrix $\lambda^{-1} I$, i.e. equal variance (of $\lambda^{-1}$) on each weight with zero covariance between weights.

In Tikhonov regression we are simply generalizing this idea. Instead of assuming that the prior covariance is a scaled identity matrix, we can assume _any_ covariance matrix we want!

And what's more, as we've already seen, we can do Tikhonov regression using the standard ridge regression tools that are already available to us. Let's run through how to do that:

### Tikhonov regression via ridge regression

1. Suppose we know our responses $Y$ and stimulus features $X$. We want to fit a Tikhonov regression model of the form $Y = X\beta + \epsilon$, with the prior $P(\beta) = \mathcal{N}(0, \lambda^{-1} \Sigma)$, for some covariance $\Sigma$.
2. We use some technique (there are many) to take a matrix square root of $\Sigma$, giving $\sqrt{\Sigma} = A^\top$, so that $A A^\top = \Sigma$.
3. We are going to use this new matrix to transform our stimulus features, and then fit a ridge regression model for $Y = (XA)\beta' + \epsilon$.
4. The resulting ridge weights are $\beta' = A^{-1} (X^\top X + \lambda (A A^\top)^{-1})^{-1} X^\top Y$.
5. Finally, we multiply these weights by $A$ to get the Tikhonov weights (this corrects for the factor of $A^{-1}$, giving us weights that can be applied to the original $X$): $\beta = A \beta'$.

# Word embeddings for Tikhonov regression models

Now let's get back to our fMRI experiment. We had subjects listen to stories and then we tried to predict the response of each voxel using a regression model where each word was a feature.

It turns out this didn't work terribly well using our OLS or ridge models. We also had this problem where some words might appear in our test set but not the training set, so we couldn't estimate weights for them at all.

Let's try to fix this model using Tikhonov regression! This will require one new concept: a **word embedding space**.

## Word embeddings

[Word embedding spaces](https://en.wikipedia.org/wiki/Word_embedding) are a tool for quantitatively modeling something related to the meaning of words. Famous examples of word embedding spaces are [word2vec](https://en.wikipedia.org/wiki/Word2vec) and [GloVe](https://en.wikipedia.org/wiki/GloVe_(machine_learning%29). The core idea of word embeddings is that they represent each word as a vector of numbers, where these vectors are specifically selected so that words with similar (or related) meanings have similar vectors.

For this exercise we're going to use a word embedding space of my own design called `english1000`. Let's load that space here and play with it a bit.

In [None]:
# Load semantic model
# The SemanticModel class is something I wrote to make it easy to deal with word embedding spaces
from SemanticModel import SemanticModel
eng1000 = SemanticModel.load("data/we_word_embeddings/small_english1000sm.hdf5")

### Visualizing a word
First let's plot the length 985 vector for one word to see what it looks like.

In [None]:
plot_word = "finger"

f = plt.figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
ax.plot(eng1000[plot_word], 'k')
ax.axis("tight")
ax.set_title("English1000 representation for %s" % plot_word)
ax.set_xlabel("Feature number")
ax.set_ylabel("Feature value");

### Visualizing more than one word
Next let's plot the vectors for three words: "finger", "hand", and "language". Here you will see that "finger" (in black) and "hand" (in red) look pretty similar, but "language" (in blue) looks very different. Neat.

In [None]:
plot_words = ["finger", "hand", "language"]
colors = ["k", "r", "b"]

f = plt.figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
wordlines = []

for ii, (word, color) in enumerate(zip(plot_words, colors)):
    wordlines.append(ax.plot(eng1000[word] - 8*ii, color)[0])

ax.axis("tight")
ax.set_title("English1000 representations for some words")
ax.set_xlabel("Feature number")
ax.legend(wordlines, plot_words);

### Semantic smoothness
One nice test of a vector-space semantic model is whether it results in a "semantically smooth" representation of the words. That is, do nearby words in the space have intuitively similar meanings? Here you can test that using the method `find_words_like_word`. 

Give any word (that the model knows about), and it will print out the 10 closest words (that it knows about) and their cosine similarities (or correlations, same thing in this case). This includes the word you supplied.

You can put different words in here and see what the model comes up with. 

*(Be warned: the model knows some dirty words. It was trained using the internet, after all.)*

In [None]:
# Test semantic model
eng1000.find_words_like_word("finger")

Here is just another example, but this one an abstract noun, "language". Again the model does a pretty good job at finding related words.

In [None]:
eng1000.find_words_like_word("language")

A little more generally, we can grab the vectors for a set of words and then look at how related each pair of vectors is.

In [None]:
from covplot import covplot

sel_words = ['woman', 'girl', 'boy', 'man', 'street', 'park', 'alley', 'house']
sel_word_vectors = np.vstack([eng1000[w] for w in sel_words])
print(sel_word_vectors.shape)

sel_word_products = sel_word_vectors.dot(sel_word_vectors.T) / sel_word_vectors.shape[1]
covplot(sel_word_products)

plt.gca().xaxis.tick_top()
plt.xticks(range(len(sel_words)), sel_words, fontsize=15, rotation=90)
plt.yticks(range(len(sel_words)), sel_words, fontsize=15)
plt.colorbar();

## Using a word embedding space for Tikhonov regression

We're going to use these word embeddings to do Tikhonov regression for our fMRI experiment. Let's call the (number of embedding features $\times$ number of words) matrix of word embeddings $E$. We're going to choose the prior covariance for our regression weights to be proportional to $E^\top E$, i.e.

$$P(\beta) = \mathcal{N}(0, \lambda^{-1} E^\top E) $$

__This means that we expect (a priori) the regression weights on two words to be similar if those words have similar embedding vectors.__

For example, the words "woman" and "man" have very similar embedding vectors, according to the plot we created above. If we use the embedding vectors to create our Tikhonov prior, then we would be suggesting to our model that, if a voxel responds a lot to the word "woman", it probably also responds a lot to the word "man", and vice versa.

So how do we do this? We can partially follow the recipe from above, but we're actually going to have an easier time here than we would in the generic case since we don't need to take a matrix square root. We've already defined our prior covariance as $E^\top E$, so all we have to do is say $A = E^\top$. Let's give it a shot!

In [None]:
# again, let's load up the feature matrices
# these were stored as "sparse" matrices in order to save space
# but we'll convert them back to normal matrices in order to use them in our regression
from scipy import sparse
training_features = sparse.load_npz('data/we_word_embeddings/indicator_Rstim.npz').todense().A
test_features = sparse.load_npz('data/we_word_embeddings/indicator_Pstim.npz').todense().A

# and the brain responses
import tables
response_tf = tables.open_file('data/we_word_embeddings/small-fmri-responses.hdf5')
training_resp = response_tf.root.zRresp.read()
test_resp = response_tf.root.zPresp.read()
brain_mask = response_tf.root.mask.read()
response_tf.close()

In [None]:
# now we'll apply the word embedding, multiplying it by both the training and test feature matrices

emb_training_features = training_features.dot(eng1000.data.T)
emb_test_features = test_features.dot(eng1000.data.T)

In [None]:
# as before, to accurately predict BOLD responses we need to account for hemodynamic delays
# we'll do that here by creating multiple time-shifted versions of the same stimulus
# this is called a finite impulse response or FIR model

from util import make_delayed
delays = [1,2,3,4]

del_training_features = make_delayed(emb_training_features, delays)
del_test_features = make_delayed(emb_test_features, delays)

In [None]:
# to fit this ridge model we're going to use some code I wrote instead of the simple equation above
# this code is part of a package that does the really hard part of ridge regression,
# which is choosing the best lambda (called alpha here, apologies)
# here we are skipping that step, and just using a value that I know works pretty well
# if you want to see how the more complicated procedure works, 
# check out the `bootstrap_ridge` function in ridge.py

from ridge import ridge
beta_tik = ridge(del_training_features, training_resp, alpha=464.)

In [None]:
# now let's test our regression models on the held-out test data
pred_test_resp = del_test_features.dot(beta_tik)

import npp # a set of convenience functions I think are missing from numpy :)

test_correlations = npp.mcorr(test_resp, pred_test_resp)

In [None]:
# let's look at the histogram of correlations!
plt.hist(test_correlations, 50)
plt.xlim(-1, 1)
plt.xlabel("Linear Correlation")
plt.ylabel("Num. Voxels");

Now _that's_ a lot better! In fact, let's compare it to the OLS and ridge models.

In [None]:
ols_correlations = np.load('data/we_word_embeddings/ols_correlations.npy')
ridge_correlations = np.load('data/we_word_embeddings/ridge_correlations.npy')

plt.hist(ols_correlations, 50, label='OLS', histtype='step', lw=2)
plt.hist(ridge_correlations, 50, label='Ridge', histtype='step', lw=2)
plt.hist(test_correlations, 50, label='Tikhonov', histtype='step', lw=2)
plt.xlim(-1, 1)
plt.legend()
plt.xlabel("Linear Correlation")
plt.ylabel("Num. Voxels");

In [None]:
# let's also look at a brain map of the correlations!

import cortex

corr_volume = cortex.Volume(test_correlations, 'S1', 'fullhead', mask=brain_mask, vmin=-0.3, vmax=0.3, cmap='RdBu_r')
cortex.quickshow(corr_volume, with_curvature=True);

In [None]:
# you can also look at it in 3D!

cortex.webshow(corr_volume, open_browser=False)

## Conclusions

We've gone through three different types of regression models: ordinary least squares (OLS), ridge, and Tikhonov (with a prior based on word embeddings).

All three models used _exactly the same features_. They only differed in their prior beliefs about the weights:
* the OLS model made no assumptions, seeing all possible weights as equally reasonable,
* the ridge model assumed that the weights were _small_, i.e. close to zero, and
* the Tikhonov model assumed that the weights had a specific structure that matched the word embeddings we used.

These different prior beliefs led to _huge_ differences in prediction performance between models. We can think of the different priors as different **hypotheses** about how words might influence BOLD responses in cortex. Comparing the prediction performance of these models is then _testing_ these hypotheses. Our results suggest that the third hypothesis—that voxels respond similarly to words with similar meanings—is by far the most likely, given this brain data.

If you want to learn more about the science we can do with these models, again, please check out these papers:
* Huth, de Heer, Griffiths, Theunissen, & Gallant (2016) "Natural speech reveals the semantic maps that tile human cerebral cortex" [(Free Link)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852309/)
* Nunez-Elizalde, Huth, & Gallant (2019) "Voxelwise encoding models with non-spherical multivariate normal priors" [(Journal Link)](https://www.sciencedirect.com/science/article/pii/S1053811919302988) [(Preprint Link)](https://www.biorxiv.org/content/biorxiv/early/2018/08/09/386318.full.pdf)