## Basic Feature Engineering

Analysis by Jeremy Mann

These notes build a preliminary pipeline for extracting statistical features from the wrangled images.

More specifically, there will be 4 features, namely the first __ cumulants of the pixel values. This is computed via the unbiased estimator from `scipy.stats.kstat`:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstat.html

Note that this essentially ignores the geometric aspects of the images.


In [2]:
import xarray as xr
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline 
from scipy.stats import kstat

In [3]:
def load_netcdf(filepath):
    X = xr.open_dataarray(filepath).values
    X = X.reshape(X.shape[0], -1)
    return X

In [32]:
filepath = '../../data/clean_data/train_data/X_64_L_clean_train.nc'
X = load_netcdf(filepath)
X.shape

In [36]:
class cumulants_extractor(BaseEstimator, TransformerMixin):
    '''
    returns a numpy array of all k-th cumulants less than 
    highest_cumulant (which must be less than 4)
    '''
    def __init__(self, highest_cumulant):
        self.highest_cumulant = highest_cumulant
    def fit(self, X, y = None):
        return self
    
    def get_cumulants(self, v):
        kstats = np.array([kstat(data = v, n = k) 
                          for k in range(1, self.highest_cumulant + 1)])
        return kstats
        
    def transform(self, X):
        cumulants = np.apply_along_axis(func1d = self.get_cumulants,
                                       axis = 1, 
                                       arr = X,
                                       )
        return cumulants

In [39]:
c_extractor = cumulants_extractor(highest_cumulant = 4)
features = c_extractor.transform(X)
features.shape

(3750, 4)

## Quick Test

In [22]:
def cumulants_normal_test(cumulants_extractor):
    '''
    tests with standard normal distribution whose cumulants are
    0, 1, 0, 0, 
    '''
    X = np.random.normal(0, 1, (2, 10**4))
    return cumulants_extractor.transform(X)

In [23]:
cumulants_normal_test(c_extractor)

array([[ 0.00281287,  1.00377942,  0.01788164, -0.09940695],
       [ 0.00975741,  1.00658412,  0.01221571,  0.14060718]])