Skip to content
A recipe to learn linear transformation between different word embeddings
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This document describes how to learn linear transformation between different word embeddings (e.g. CBOW and word2vec). For more details, see our paper:

Bollegala, Hayashi, Kawarabayashi. Learning Linear Transformations between Counting-based and Prediction-based Word Embeddings. PLoS ONE 12(9): e0184544, 2017.

Unfortunately, the original code is dirty, so I decided to show the core recipe of our learning algorithm.


Let u_i be the m-dimensional embedding vector and v_i be the n-dimensional embedding vector for word i. The core idea is to learn C, the m by n matrix that transforms v_i to u_i such that u_i ~= Cv_i. For this purpose, we define the objective function over p words as \sum_{i=1}^p ||u_i - Cv_i||^2 = ||U-VC||^2_F, where U and V are collections of embeddings over p words and ||.||_F denotes the Frobenius norm.

We use stochastic gradient descent (SGD) to learn C. For SGD, vowpal wabbit (VW) is helpful, because it efficiently works for large scale data.

Note that the problem is equivalent to m-variate linear regression. However, because VW cannot handle multidimensional output, we separate the problem as m scalar-output linear regression problems. For each prediction dimension j=1,...,m, we need to create a file in the VW input format. In the VW format, each line corresponds to a training sample, and the entire file is something like this:

u_1j | 1:v_11 2:v_12 ... n:v_1n
u_2j | 1:v_21 2:v_22 ... n:v_2n
u_pj | 1:v_p1 2:v_p2 ... n:v_pn

By running VW with the file for j=1,...,m, we can obtain c_j as the part of the transformation C=[c_1;...;c_m].

You can’t perform that action at this time.