This notebook sets up the workflow for the various functions we have implemented. It also shows how we can use the `make` function in `hash.py` to return pickled objects by either creating or loading them. The example uses both the PMI and the PCA-reduced features, but focuses the explanation on the latter.

In [1]:
import pickle
import warnings

from utils.hash import make
from utils.calculate_pmi_features import *
from utils.reduce_dimensions import run_PCA

warnings.filterwarnings('ignore')

Let's load the PMI features.

In [3]:
X = make(run_PMI, 'calculate_pmi_features.py', {'p' : 'datamatrix.pkl'}) 

datamatrix.pkl does not exist; creating it


OSError: [Errno 9] Bad file descriptor

The PMI features are an input for `run_PCA`.

For `make`, the function that controls whether the object is loaded or created, there are three parameters: `fn`, `s`, and `kwargs`. The first is a pickle-making function, such as `run_PCA`. The second is the name of the Python script where that function lives. In our example, it's `reduce_dimensions.py`. `kwargs` is a `dict` that contains the arguments that `fn` requires. `run_PCA`, for example, is defined as follows:

```
def run_PCA(datamatrix, p, ppath, n_components = 50, force_update = False):
    ...
```

(`p` used to be `filename`.)

Note that this has been updated slightly to work with `make` and the associated functions in `hashes.py`. One other important note, as you'll see below, is that `ppath` doesn't actually need to be specified in `kwargs`. That is taken care of inside of `make`.

Because of the `n_components` default value in `run_PCA` (and the fact that `ppath` is added in `make`), we only need to specify two keys and values in `kwargs`: `p` and `datamatrix`. (For `run_PMI`, all that's needed in `kwargs` is `p`.)

In [3]:
make(run_PCA, 'reduce_dimensions.py',
     {'p' : 'pca_dict_50.pkl', 'datamatrix' : X})

pca_dict_50.pkl exists
the hash for pca_dict_50.pkl matches what's in hashes.json
checking whether the script has been updated...
it hasn't; loading pca_dict_50.pkl


{'X_reduced': array([[ 0.18461784, -0.59392098,  0.05070021, ..., -0.41726548,
         -0.44980572,  1.12606931],
        [-0.50393825, -0.35784087,  0.16208498, ...,  1.11070428,
         -1.01771069,  1.09786131],
        [ 0.41400084, -0.35436631,  0.02021982, ...,  0.12590572,
          0.38861386, -0.3308449 ],
        ..., 
        [ 0.17358788, -0.32249302,  0.06982146, ...,  0.4090696 ,
         -0.17619902, -0.48538154],
        [-0.53240028, -0.7816336 ,  0.14151083, ...,  0.08645779,
          0.84967482,  0.16384582],
        [ 0.08520294,  0.73222126,  0.01892021, ...,  0.72291067,
         -0.47952472, -0.81623336]]),
 'pca': PCA(copy=True, n_components=50, whiten=True)}

`make` prints information to let the user know what's happening. In the case above, the pickled object exists and the hash matches what's in `hashes.json`. Because the `reduce_dimensions.py` hasn't been updated&mdash;that is, it isn't newer than `pca_dict_50.pkl`&mdash;`make` simply returns the loaded pickle file.

In the example below, we are telling `make` to force an update&mdash;to recreate `pca_dict_50.pkl`. This will also update the hash value in `hashes.json`. Because the object is the same in this case&mdash;we have not changed anything about how the object is created&mdash;the value is the same.

In [4]:
make(run_PCA, 'reduce_dimensions.py',
     {'p' : 'pca_dict_50.pkl', 'datamatrix' : X, 'force_update' : True})

pca_dict_50.pkl exists
user forcing update of pca_dict_50.pkl
variance explained with 50 components: 0.4505119125117733


(PCA(copy=True, n_components=50, whiten=True),
 array([[ 0.18461784, -0.59392098,  0.05070021, ..., -0.41726548,
         -0.44980572,  1.12606931],
        [-0.50393825, -0.35784087,  0.16208498, ...,  1.11070428,
         -1.01771069,  1.09786131],
        [ 0.41400084, -0.35436631,  0.02021982, ...,  0.12590572,
          0.38861386, -0.3308449 ],
        ..., 
        [ 0.17358788, -0.32249302,  0.06982146, ...,  0.4090696 ,
         -0.17619902, -0.48538154],
        [-0.53240028, -0.7816336 ,  0.14151083, ...,  0.08645779,
          0.84967482,  0.16384582],
        [ 0.08520294,  0.73222126,  0.01892021, ...,  0.72291067,
         -0.47952472, -0.81623336]]))