In [None]:
# General Image Encoder

## Summary

The gist here is that non-image problems can become image problems if we encode features as pixel channels, and then train an image classifier/regressor on the generated image.

## Part 0: Getting Data (handled by user)

### Summary

Get tabular data; user should ensure that all data is clean.

## Part 1: Feature Engineering

In this part, we have some tabular data (say, in a dataframe). Since this will be automated, we will brute-force create a large number of features that hopefully describe the data better than the raw input. We will do this by using the FeatureSynthesis class in this project.

## Part 2: Clustering Features and Assigning to Channels

### Steps
1. Ensure that the number of features, $m$, is divisible by the quantity $c\cdot4^n$ for some $n$, and number of channels per pixel, $c$. If not, go back to the __Feature Engineering__ step and create more features or remove features in order to satisfy this condition.
2. Let the data table with the features (in our case, ```pandas.DataFrame```) be called ```features```. Then we convert this to a numpy array, called $X$. The columns represent values for each feature, and each row represents one entry/example. Now note that in traditional cluster analysis, examples are clustered together by minimizing a distance metric, which is computed by finding the examples' different feature lengths. However, in this case, we actually want to group features by example values. Put another way, we want to group the synthetic features that are closest together in value, and the way we determine their similarity is by seeing how similar the values are for their various examples. Hence, in order to cluster the features, we will perform cluster analysis on $X^T$. When there is a large number of examples in $X$, then $X^T$ will have a large number of columns, implying that cluster analysis on $X^T$ will fall victim to the curse of dimensionality. Dimensionality reduction techniques may become helpful to reduce the number of columns (i.e. decrease the number of examples to be used). One option is to perform PCA and get only the most distinct columns of $X^T$. Keep in mind, each feature is represented as a row of $X^T$, so dimensionality reduction on $X^T$ will not decrease the number of generated features.
3. Run a function inside the ImageCreation class to create a transformation pipeline. This pipeline will have the following functionality:
    * Fit
        - Uses ```populate_image_with_feature_names``` on input synthetic features.
        - Flattens the feature name array
        - (IF fit_and_transform): Re-shuffles the input synthetic features
    * Transform
        - Takes in a dataframe of only original features, nothing more and nothing less; anything more will be dropped, and if required columns are missing, an exception will be raised.
        - Iteratively looks at lowest level features in feature_names_image, perform and store the transformations, then go on to the next higher level of features until there are no more features left. Store all features in a dataframe ```df```.
        - Rearrange the columns of ```df``` to fit the order of the columns in the image, and return the DataFrame.    
    

### Idea for feature selection/dimensionality reduction (in general):
    - Transpose of regular feature array ($X$ -> $X^T$)
    - K clusters
    - See what examples of $X^T$ are in what groups (i.e. what columns of $X$ are most similar)
    - Perform some sort of averaging for each group
    - Proceed with only K features    

## Part 3: Wrapping into an API

## Part 4: Make Predictions! (handled by user)

In [11]:
import random as rd
import pandas as pd

In [12]:
n_rows = 100

categories_1 = ('a', 'b', 'c', 'd', 'e')
categories_2 = ('InstaMed', 'is', 'a', 'cool', 'company', 'check', 'it', 'out', 'sometime')
rand_cats_1 = [rd.choice(categories_1) for i in range(n_rows)]
rand_cats_2 = [rd.choice(categories_2) for i in range(n_rows)]

example_df = pd.DataFrame({
    'example_numerical_col_1': np.random.rand(n_rows) * 50,
    'example_numerical_col_2': np.random.rand(n_rows) * 20,
    'example_categorical_col_1': rand_cats_1,
    'example_categorical_col_2': rand_cats_2,
    'example_boolean_col_1': np.random.randint(low=0, high=2, size=n_rows),
    'example_boolean_col_2': np.random.randint(low=0, high=2, size=n_rows)
})

In [58]:
te = example_df.transpose(); te

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
example_numerical_col_1,34.6687,2.99665,1.86134,2.05842,45.623,34.6985,10.6805,27.5274,3.30944,9.30971,...,39.9348,44.077,26.8351,22.82,29.0365,43.8795,41.7606,2.50871,38.018,0.42209
example_numerical_col_2,1.60882,3.21653,2.34413,19.4251,13.6199,6.92757,17.5574,1.4964,18.9592,15.7999,...,11.0552,12.3288,16.1861,15.5355,6.93201,10.4071,3.59924,15.3227,0.543366,12.5284
example_categorical_col_1,d,b,d,c,e,a,a,a,a,b,...,d,a,a,a,d,a,c,e,d,d
example_categorical_col_2,check,check,is,check,out,is,company,is,InstaMed,cool,...,sometime,it,check,check,is,it,cool,out,a,is
example_boolean_col_1,1,1,0,0,1,0,0,1,0,1,...,1,0,0,1,1,0,0,0,1,1
example_boolean_col_2,0,0,1,0,0,1,0,0,0,0,...,0,1,0,1,0,0,1,1,0,0
