In [None]:
#| include: false
import pandas as pd
from nbdev.showdoc import *

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [None]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame

In [None]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version="4.1")

2023-01-04 16:49:42,423 INFO numerapi.utils: starting download
synth_test/train.parquet: 1.45GB [00:52, 27.3MB/s]                            0:23, 35.0MB/s]


2023-01-04 16:50:35,987 INFO numerapi.utils: starting download
synth_test/validation.parquet: 1.50GB [00:50, 29.6MB/s]                            


In [None]:
dataf = create_numerframe("synth_test/train.parquet")

In [None]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_paul_v4_20,target_paul_v4_60,target_george_v4_20,target_george_v4_60,target_william_v4_20,target_william_v4_60,target_arthur_v4_20,target_arthur_v4_60,target_thomas_v4_20,target_thomas_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.5,0.25,0.25,0.0,0.333333,0.0,0.5,0.5,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,0.75,1.0,1.0,1.0,0.666667,0.666667,0.833333,0.666667,0.833333,0.666667


## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [None]:
from numerblox.preprocessing import BayesianGMMTargetProcessor

synth_test/numerai_training_data.parquet:  19%|█▊        | 190M/1.01G [02:16<09:54, 1.39MB/s]
2023-01-04 16:51:50.163656: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-04 16:51:50.324656: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-04 16:51:50.324692: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-04 16:51:51.192085: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so

In [None]:
show_doc(BayesianGMMTargetProcessor)

---

[source](https://github.com/crowdcent/numerblox/tree/master/blob/master/numerblox/preprocessing.py#LNone){target="_blank" style="float:right; font-size:smaller"}

### BayesianGMMTargetProcessor

>      BayesianGMMTargetProcessor (target_col:str='target',
>                                  feature_names:list=None, n_components:int=6)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param feature_names: Selection of features used for Bayesian GMM. All features by default.
:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

In [None]:
dataf.head()

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_paul_v4_20,target_paul_v4_60,target_george_v4_20,target_george_v4_60,target_william_v4_20,target_william_v4_60,target_arthur_v4_20,target_arthur_v4_60,target_thomas_v4_20,target_thomas_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.5,0.25,0.25,0.0,0.333333,0.0,0.5,0.5,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,0.75,1.0,1.0,1.0,0.666667,0.666667,0.833333,0.666667,0.833333,0.666667
n0048ac83aff7194,1,train,0.5,0.25,0.75,0.0,0.75,0.0,0.75,0.75,...,0.5,0.25,0.25,0.25,0.5,0.333333,0.333333,0.5,0.5,0.333333
n00691bec80d3e02,1,train,1.0,0.5,0.5,0.75,0.0,1.0,0.25,1.0,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5
n00b8720a2fdc4f2,1,train,1.0,0.75,1.0,1.0,0.0,0.0,1.0,0.5,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.666667,0.5,0.666667,0.5


In [None]:
bgmm = BayesianGMMTargetProcessor(target_col="target")
test_columns = ['era', 'data_type', 
"feature_polaroid_vadose_quinze", "feature_genuine_kyphotic_trehala",  'target']
sample_dataf = NumerFrame(dataf[test_columns].sample(100).fillna(0.5))
fake_dataf = bgmm(sample_dataf)

Generating fake target:   0%|          | 0/93 [00:00<?, ?it/s]

In [None]:
sample_dataf.head()

Unnamed: 0_level_0,era,data_type,feature_polaroid_vadose_quinze,feature_genuine_kyphotic_trehala,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
nc8e9c7b9fc085a4,408,train,1.0,1.0,0.5,0.5
nd4f0261b44c914e,37,train,0.5,1.0,0.5,0.5
nd0f06c2be2d501c,357,train,1.0,0.75,0.5,0.5
nd0ae8c0e299e660,454,train,0.5,0.0,0.25,0.5
n5eac28a1bcce5d9,481,train,0.25,1.0,0.5,0.5


The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [None]:
fake_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1
nc8e9c7b9fc085a4,0.5,0.5
nd4f0261b44c914e,0.5,0.5


Note that you can easily generate multiple fake targets in a loop.

In [None]:
for target_col in sample_dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    sample_dataf = bgmm(sample_dataf)
sample_dataf.get_target_data.head(2)

Generating fake target:   0%|          | 0/93 [00:00<?, ?it/s]

Unnamed: 0_level_0,target,target_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1
nc8e9c7b9fc085a4,0.5,0.5
nd4f0261b44c914e,0.5,0.5


## 2. UMAPFeatureGenerator

UMAP is a feature reduction technique that can be used to generate synthetic features. In other words, we create new representations of the existing features and add them to our dataset.

We will perform UMAP on the training and validation data combined. Note that the data created with `DeepDreamGenerator` is included in this dataset. Then, once again we train a model on it and evaluate results.

In [None]:
from numerblox.preprocessing import UMAPFeatureGenerator

`n_components` denotes the amount of additional features we are generating.

In [None]:
n_components = 3
umap_gen = UMAPFeatureGenerator(n_components=n_components, n_neighbors=9)

In [None]:
test_data = create_numerframe("../test_assets/mini_numerai_version_2_data.parquet")

In [None]:
test_data = umap_gen(test_data)

The new features follow the naming convention `f"feature_umap_{i}"`. All new components are scaled between 0 and 1.

In [None]:
umap_features = [f"feature_umap_{i}" for i in range(n_components)]
test_data[umap_features].head(3)

Unnamed: 0_level_0,feature_umap_0,feature_umap_1,feature_umap_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n559bd06a8861222,0.97219,0.257233,0.272728
n9d39dea58c9e3cf,0.804406,0.667506,0.0
nb64f06d3a9fc9f1,0.279315,0.0,0.731333


Contrast this with the deep dream results.

After you're done all the downloaded files can be cleaned up with `.remove_base_directory()`.

In [None]:
# Clean up environment
dl.remove_base_directory()