The Linear Information Sieve
Python HTML
Switch branches/tags
Nothing to show
Latest commit d3e4b3a Aug 7, 2017 @gregversteeg committed on GitHub Update README.md
Permalink
Failed to load latest commit information.
tests Change filenames to not conflict with discrete sieve (which is sieve.py) May 30, 2016
LICENSE More clean-up May 25, 2016
README.md Update README.md Aug 7, 2017
linearsieve.py More clean-up May 30, 2016
vis_sieve.py

README.md

Extract common information with the linear sieve

This code is based on the following reference:

This work in turn builds on the theoretical results from Greg Ver Steeg and Aram Galstyan. "The Information Sieve", ICML 2016.

In turn, these results are largely superseded by an approach that learns latent factors jointly (instead of incrementally here).

While this version of the sieve makes some strong assumptions: latent factors are linear functions of the data and we assume that the form of marginal distributions are Gaussian, the algorithm is super-fast, stable, and scalable. As discussed in the paper, you are free to gaussianize the marginals of each observed variable yourself ahead of time, perhaps using this code.

The weights for each latent factor are found incrementally by optimizing a nonlinear objective, max TC(X;Y).
Therefore, we now specify just one number: the maximum number of latent factors to use, n_hidden, and the structure is learned automatically. As we add new latent factors, more and more common information is extracted until eventually TC(X|Y) = 0. In other words, the data is independent conditioned on Y.

Dependencies

Linear sieve requires numpy. If you use OS X, I recommend installing the Scipy Superpack.

The visualization capabilities in vis_sieve.py require other packages:

  • matplotlib - Already in scipy superpack.
  • networkx - A network manipulation library.
  • sklearn - Already in scipy superpack and only required for visualizations.
  • graphviz (Optional, for compiling produced .dot files into pretty graphs. The command line tools are called from vis_sieve. Graphviz should be compiled with the triangulation library for best visual results).

Install

To install, download using this link or clone the project by executing this command in your target directory:

git clone https://github.com/gregversteeg/LinearSieve.git

This research code will probably not be further developed. See Linear CorEx for ongoing developments. (And eventually look for tf_corex, for a more flexible framework implemented in TensorFlow.)

Example tests and command line interface

We generally try to follow the sklearn coding style using .fit(), .transform(), and .predict() methods.

Try the following tests from the project directory:

python tests/test_weak_correlations.py
python tests/test_faces.py
python vis_sieve.py tests/data/test_big5.csv --n_hidden=5 -v --no_row_names -o big5
python vis_sieve.py tests/data/adni_blood.csv --n_hidden=30 --missing=-1e6 -v -o adni

Vis_sieve.py automatically generates different plots. In the relationships folder, for each latent factor you will see pairwise plots of observed variables related to that factor. The color of the dot corresponds to the latent factor. The graphs folder will show the hierarchical relationships. The file tree_sfdp_w_splines.pdf is usually the best.

Note that missing values are imputed in vis_sieve beforehand. Using the command line API, entries that are np.nan are treated as missing.

You can do in-painting experiments by training a model, then mask missing values in a test set with np.nan values, then generate y's for these test examples. Finally, use the .predict() method to predict the values of all x's from the y's. Here's an example of how the python API looks.

import linearsieve as sieve
ns, nv = x_train.shape  # x_train is an array with rows for samples and columns for variables.
out = sieve.Sieve(n_hidden=3).fit(x_train)  
y = out.transform(x_test)  # missing values in x_test are set to np.nan
x_predict = out.predict(y)  # all values are predicted, missing and not missing
print out.ws[:,:nv]  # These are the weights for how each latent factor depends on each variable.