# Self Governing Neural Networks (SGNN): the Projection Layer

> A SGNN's word projections preprocessing pipeline in scikit-learn

In this notebook, we'll use T=80 random hashing projection functions, each of dimensionnality d=14, for a total of 1120 features per projected word in the projection function P. 

Next, we'll need feedforward neural network (dense) layers on top of that (as in the paper) to re-encode the projection into something better. This is not done in the current notebook and is left to you to implement in your own neural network to train the dense layers jointly with a learning objective. The SGNN projection created hereby is therefore only a preprocessing on the text to project words into the hashing space, which becomes spase 1120-dimensional word features created dynamically hereby. Only the CountVectorizer needs to be fitted, as it is a char n-gram term frequency prior to the hasher. This one could be computed dynamically too without any fit, as it would be possible to use the [power set](https://en.wikipedia.org/wiki/Power_set) of the possible n-grams as sparse indices computed on the fly as (indices, count_value) tuples, too.

In [293]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.random_projection import SparseRandomProjection
from sklearn.base import BaseEstimator, TransformerMixin

from collections import Counter

## Preparing dummy data for demonstration:

In [282]:
class SentenceTokenizer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [
            [r.strip() for r in some_paragraph_or_text_block.split(".")]
            for some_paragraph_or_text_block in X
        ]

test_str_tokenized = SentenceTokenizer().fit_transform([
    "It's easier to start a project by using #Jupyter #Notebooks, " +
    "but you must move quickly to grow #SoftwareArchitecture out of the notebook before it gets too big. " +
    "Keeping #CleanCode 's rules of thumbs in mind *does help*. " +
    "But prioritize #FirstPrinciples over rules of thumb. " +
    "https://twitter.com/guillaume_che/status/1075891355866550274"
])[0]

test_str_tokenized

["It's easier to start a project by using #Jupyter #Notebooks, but you must move quickly to grow #SoftwareArchitecture out of the notebook before it gets too big",
 "Keeping #CleanCode 's rules of thumbs in mind *does help*",
 'But prioritize #FirstPrinciples over rules of thumb',
 'https://twitter',
 'com/guillaume_che/status/1075891355866550274']

## Creating a SGNN preprocessing pipeline's classes

In [283]:
class WordTokenizer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        begin_of_word = "<"
        end_of_word = ">"
        out = [
            [
                begin_of_word + word + end_of_word
                for word in sentence.replace("//", " /").replace("/", " /").replace("-", " -").replace("  ", " ").split(" ")
                if not len(word) == 0
            ]
            for sentence in X
        ]
        return out


In [284]:
char_ngram_range = (1, 4)

char_term_frequency_params = {
    'char_term_frequency__analyzer': 'char',
    'char_term_frequency__lowercase': False,
    'char_term_frequency__ngram_range': char_ngram_range,
    'char_term_frequency__strip_accents': None,
    'char_term_frequency__min_df': 2,
    'char_term_frequency__max_df': 0.99,
    'char_term_frequency__max_features': int(1e7),
}

class CountVectorizer3D(CountVectorizer):

    def fit(self, X, y=None):
        X_flattened_2D = sum(X.copy(), [])
        super(CountVectorizer3D, self).fit_transform(X_flattened_2D, y)  # can't simply call "fit"
        return self

    def transform(self, X):
        return [
            super(CountVectorizer3D, self).transform(x_2D)
            for x_2D in X
        ]
    
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)


In [285]:
import scipy.sparse as sp

T = 80
d = 14

hashing_feature_union_params = {
    # T=80 projections for each of dimension d=14: 80 * 14 = 1120-dimensionnal word projections.
    **{'union__sparse_random_projection_hasher_{}__n_components'.format(t): d
       for t in range(T)
    },
    **{'union__sparse_random_projection_hasher_{}__dense_output'.format(t): False  # only AFTER hashing.
       for t in range(T)
    }
}

class FeatureUnion3D(FeatureUnion):
    
    def fit(self, X, y=None):
        X_flattened_2D = sp.vstack(X, format='csr')
        super(FeatureUnion3D, self).fit(X_flattened_2D, y)
        return self
    
    def transform(self, X): 
        return [
            super(FeatureUnion3D, self).transform(x_2D)
            for x_2D in X
        ]
    
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)


## Fitting the pipeline 

Note: at fit time, the only thing done is to discard some unused char n-grams and to instanciate the random hash, the whole thing could be independent of the data, but here because of discarding the n-grams, we need to "fit" the data. Therefore, fitting could be avoided all along, but we fit here for simplicity of implementation using scikit-learn.

In [286]:
params = dict()
params.update(char_term_frequency_params)
params.update(hashing_feature_union_params)

pipeline = Pipeline([
    ("word_tokenizer", WordTokenizer()),
    ("char_term_frequency", CountVectorizer3D()),
    ('union', FeatureUnion3D([
        ('sparse_random_projection_hasher_{}'.format(t), SparseRandomProjection())
        for t in range(T)
    ]))
])
pipeline.set_params(**params)

Pipeline(memory=None,
     steps=[('word_tokenizer', WordTokenizer()), ('char_term_frequency', CountVectorizer3D(analyzer='char', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=False, max_df=0.99, max_features=10000000, min_df=2,
         ngram_...to', eps=0.1,
            n_components=14, random_state=None))],
        transformer_weights=None))])

In [287]:
result = pipeline.fit_transform(test_str_tokenized)

result

[<27x1120 sparse matrix of type '<class 'numpy.float64'>'
 	with 14003 stored elements in Compressed Sparse Row format>,
 <10x1120 sparse matrix of type '<class 'numpy.float64'>'
 	with 5269 stored elements in Compressed Sparse Row format>,
 <7x1120 sparse matrix of type '<class 'numpy.float64'>'
 	with 4436 stored elements in Compressed Sparse Row format>,
 <2x1120 sparse matrix of type '<class 'numpy.float64'>'
 	with 1159 stored elements in Compressed Sparse Row format>,
 <4x1120 sparse matrix of type '<class 'numpy.float64'>'
 	with 1832 stored elements in Compressed Sparse Row format>]

## Let's see some statistics of the output. 

In [307]:
print(result[0].toarray().shape)
print(result[0].toarray()[0].tolist())
print("")

# The whole thing is quite discrete:
print(set(result[0].toarray()[0].tolist()))

# We see that we could optimize by using integers here instead of floats by counting the occurence of every entry.
Counter(result[0].toarray()[0].tolist())

(27, 1120)
[0.0, 0.0, 0.0, 0.0, -0.9093104492176721, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9093104492176721, 0.0, -0.9093104492176721, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9093104492176721, 0.0, 0.0, 0.0, -0.9093104492176721, 0.9093104492176721, 0.0, 0.9093104492176721, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9093104492176721, 0.9093104492176721, 0.0, 0.0, 0.0, 0.0, 0.0, -0.9093104492176721, 0.0, 0.0, -0.9093104492176721, 0.0, 0.9093104492176721, 0.0, 0.0, 0.9093104492176721, 0.0, 0.9093104492176721, 0.0, -0.9093104492176721, 0.0, -0.9093104492176721, 0.0, 0.0, 0.9093104492176721, 0.0, 0.9093104492176721, -0.9093104492176721, -1.8186208984353442, 0.0, 0.0, 0.0, 0.9093104492176721, -0.9093104492176721, 0.0, 0.0, -1.8186208984353442, 0.0, 0.0, 0.9093104492176721, 0.0, 0.0, 0.0, 0.9093104492176721, 0.0, -0.9093104492176721, 0.0, 0.9093104492176721, 0.0, 0.0, 0.0, 0.9093104492176721, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.9093104492176721, -0.9093104492176721, -0.9093104492176721, 0.9093104492176721

Counter({0.0: 716,
         -0.9093104492176721: 178,
         0.9093104492176721: 177,
         -1.8186208984353442: 25,
         1.8186208984353442: 22,
         -2.7279313476530165: 2})

## Next up

So we have created the sentence preprocessing pipeline and the sparse projection (random hashing) function. We now need a few feedforward layers on top of that. 