<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [None]:
#| include: false

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| include: false
from nbdev.showdoc import *

## 0. Base

These objects will provide a base for all pre- and post-processing functionality and log relevant information.

## 0.1. BaseProcessor

[`BaseProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#baseprocessor) defines common functionality for `preprocessing` and `postprocessing` (Section 5).

Every Preprocessor should inherit from [`BaseProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#baseprocessor) and implement the `.transform` method.

In [1]:
#| echo: false
#| output: asis
show_doc(BaseProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L32){target="_blank" style="float:right; font-size:smaller"}

### BaseProcessor

>      BaseProcessor ()

Common functionality for preprocessors and postprocessors.

## 0.2. Logging

We would like to keep an overview of which steps are done in a data pipeline and where processing bottlenecks occur.
The decorator below will display for a given function/method:
1. When it has finished.
2. What the output shape of the data is.
3. How long it took to finish.

To use this functionality, simply add `@display_processor_info` as a decorator to the function/method you want to track.

We will use this decorator throughout the pipeline (`preprocessing`, `model` and `postprocessing`).

Inspiration for this decorator: [Calmcode Pandas Pipe Logs](https://calmcode.io/pandas-pipe/logs.html)

In [2]:
#| echo: false
#| output: asis
show_doc(display_processor_info)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L50){target="_blank" style="float:right; font-size:smaller"}

### display_processor_info

>      display_processor_info (func)

Fancy console output for data processing.

In [None]:
#| echo: false
class TestDisplay:
    """
    Small test for logging.
    Output should mention 'TestDisplay',
    Return output shape of (10, 314) and
    time taken for step should be close to 2 seconds.
    """

    def __init__(self, dataf: NumerFrame):
        self.dataf = dataf

    @display_processor_info
    def test(self) -> NumerFrame:
        time.sleep(2)
        return self.dataf


dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
TestDisplay(dataf).test()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,297,train,0.25,0.75,0.25,0.75,0.25,0.5,1.0,0.25,...,0.0,0.5,0.25,0.5,0.0,0.5,0.166667,0.5,0.333333,0.5
n9d39dea58c9e3cf,3,train,0.75,0.5,0.75,1.0,0.5,0.25,0.5,0.0,...,0.5,0.75,0.5,0.5,0.666667,0.666667,0.5,0.666667,0.5,0.666667
nb64f06d3a9fc9f1,472,train,1.0,1.0,1.0,0.5,0.0,1.0,0.25,0.5,...,0.0,0.25,0.5,0.5,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333
n1927b4862500882,265,train,0.0,0.0,0.25,0.0,1.0,0.0,0.0,0.0,...,0.75,0.75,0.5,0.75,0.833333,0.833333,0.666667,0.833333,0.666667,0.666667
nc3234b6eeacd6b7,299,train,0.75,0.25,0.0,0.75,1.0,0.25,0.0,0.0,...,0.25,0.5,0.5,0.5,0.166667,0.666667,0.333333,0.5,0.5,0.666667
n1b41d583e12f051,9,train,0.0,0.5,0.5,0.25,0.25,0.5,0.5,1.0,...,0.5,0.25,0.5,0.0,0.5,0.333333,0.5,0.333333,0.5,0.333333
n116898cdc07d4e2,13,train,0.5,1.0,1.0,0.75,0.0,1.0,0.5,0.75,...,0.5,0.75,0.5,0.5,0.5,0.666667,0.5,0.666667,0.5,0.666667
nb0a7aef640025dc,232,train,0.25,0.25,0.5,0.0,1.0,0.0,0.5,0.0,...,0.5,0.25,0.5,0.0,0.5,0.166667,0.5,0.0,0.666667,0.166667
n12466a161ab0a24,92,train,0.5,0.75,1.0,0.5,0.25,0.25,0.25,0.75,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
n40132f4765f9185,270,train,0.5,0.5,0.0,0.5,0.75,0.5,0.0,0.25,...,0.25,0.5,0.0,0.5,0.333333,0.333333,0.333333,0.333333,0.333333,0.5


## 1. Common preprocessing steps

This section implements commonly used preprocessing for Numerai. We invite the Numerai community to develop new preprocessors.

## 1.0 Tournament agnostic

Preprocessors that can be applied for both Numerai Classic and Numerai Signals.

### 1.0.1. CopyPreProcessor

The first and obvious preprocessor is copying, which is implemented as a default in [`ModelPipeline`](https://crowdcent.github.io/numerblox/modelpipeline.html#modelpipeline) (Section 4) to avoid manipulation of the original DataFrame or [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) that you load in.

In [3]:
#| echo: false
#| output: asis
show_doc(CopyPreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L68){target="_blank" style="float:right; font-size:smaller"}

### CopyPreProcessor

>      CopyPreProcessor ()

Copy DataFrame to avoid manipulation of original DataFrame.

In [None]:
dataset = create_numerframe(
    "test_assets/mini_numerai_version_1_data.csv"
)
copied_dataset = CopyPreProcessor().transform(dataset)
assert np.array_equal(copied_dataset.values, dataset.values)
assert dataset.meta == copied_dataset.meta

### 1.0.2. FeatureSelectionPreProcessor

[`FeatureSelectionPreProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#featureselectionpreprocessor) will keep all features that you pass + keeps all other columns that are not features.

In [4]:
#| echo: false
#| output: asis
show_doc(FeatureSelectionPreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L80){target="_blank" style="float:right; font-size:smaller"}

### FeatureSelectionPreProcessor

>      FeatureSelectionPreProcessor (feature_cols:Union[str,list])

Keep only features given + all target, predictions and aux columns.

In [None]:
selected_dataset = FeatureSelectionPreProcessor(
    feature_cols=["feature_wisdom1"]
).transform(dataset)

assert selected_dataset.get_feature_data.shape[1] == 1
assert dataset.meta == selected_dataset.meta

In [None]:
selected_dataset.head(2)

Unnamed: 0,feature_wisdom1,target,id,era,data_type
0,0.25,0.5,n000315175b67977,era1,train
1,0.5,0.25,n0014af834a96cdd,era1,train


### 1.0.3. TargetSelectionPreProcessor

[`TargetSelectionPreProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#targetselectionpreprocessor) will keep all targets that you pass + all other columns that are not targets.

Not relevant for an inference pipeline, but especially convenient for Numerai Classic training if you train on a subset of the available targets. Can also be applied to Signals if you are using engineered targets in your pipeline.

In [5]:
#| echo: false
#| output: asis
show_doc(TargetSelectionPreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L102){target="_blank" style="float:right; font-size:smaller"}

### TargetSelectionPreProcessor

>      TargetSelectionPreProcessor (target_cols:Union[str,list])

Keep only features given + all target, predictions and aux columns.

In [None]:
dataset = create_numerframe(
    "test_assets/mini_numerai_version_2_data.parquet"
)
target_cols = ["target", "target_nomi_20", "target_nomi_60"]
selected_dataset = TargetSelectionPreProcessor(target_cols=target_cols).transform(
    dataset
)
assert selected_dataset.get_target_data.shape[1] == len(target_cols)
selected_dataset.head(2)

Unnamed: 0_level_0,target,target_nomi_20,target_nomi_60,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,...,feature_drawable_exhortative_dispersant,feature_metabolic_minded_armorist,feature_investigatory_inerasable_circumvallation,feature_centroclinal_incentive_lancelet,feature_unemotional_quietistic_chirper,feature_behaviorist_microbiological_farina,feature_lofty_acceptable_challenge,feature_coactive_prefatorial_lucy,era,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,0.25,0.25,0.5,0.25,0.75,0.25,0.75,0.25,0.5,1.0,...,1.0,0.0,0.0,0.25,0.0,0.0,1.0,0.25,297,train
n9d39dea58c9e3cf,0.5,0.5,0.75,0.75,0.5,0.75,1.0,0.5,0.25,0.5,...,0.25,0.5,0.0,0.25,0.75,1.0,0.75,1.0,3,train


### 1.0.4. ReduceMemoryProcessor

Numerai datasets can take up a lot of RAM and may put a strain on your compute environment.

For Numerai Classic, many of the feature and target columns can be downscaled to `float16`. `int8` if you are using the Numerai int8 datasets. For Signals it depends on the features you are generating.

[`ReduceMemoryProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#reducememoryprocessor) downscales the type of your numeric columns to reduce the memory footprint as much as possible.

In [6]:
#| echo: false
#| output: asis
show_doc(ReduceMemoryProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L123){target="_blank" style="float:right; font-size:smaller"}

### ReduceMemoryProcessor

>      ReduceMemoryProcessor (deep_mem_inspect=False)

Reduce memory usage as much as possible.

Credits to kainsama and others for writing about memory usage reduction for Numerai data:
https://forum.numer.ai/t/reducing-memory/313

:param deep_mem_inspect: Introspect the data deeply by interrogating object dtypes.
Yields a more accurate representation of memory usage if you have complex object columns.

In [None]:
dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
rmp = ReduceMemoryProcessor()
dataf = rmp.transform(dataf)

In [None]:
#| include: false
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,297,train,0.25,0.75,0.25,0.75,0.25,0.5,1.0,0.25,...,0.0,0.5,0.25,0.5,0.0,0.5,0.166626,0.5,0.333252,0.5
n9d39dea58c9e3cf,3,train,0.75,0.5,0.75,1.0,0.5,0.25,0.5,0.0,...,0.5,0.75,0.5,0.5,0.666504,0.666504,0.5,0.666504,0.5,0.666504


### 1.0.6. UMAPFeatureGenerator

Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that we can utilize to generate new Numerai features. This processor uses [umap-learn](https://pypi.org/project/umap-learn) under the hood to model the manifold. The dimension of the input data will be reduced to `n_components` number of features.

In [7]:
#| echo: false
#| output: asis
show_doc(UMAPFeatureGenerator)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L210){target="_blank" style="float:right; font-size:smaller"}

### UMAPFeatureGenerator

>      UMAPFeatureGenerator (n_components:int=5, n_neighbors:int=15,
>                            min_dist:float=0.0, metric:str='correlation',
>                            feature_names:list=None, *args, **kwargs)

Generate new Numerai features using UMAP. Uses umap-learn under the hood: 

https://pypi.org/project/umap-learn/
:param n_components: How many new features to generate.
:param n_neighbors: Number of neighboring points used in local approximations of manifold structure.
:param min_dist: How tightly the embedding is allows to compress points together.
:param metric: Metric to measure distance in input space. Correlation by default.
:param feature_names: Selection of features used to perform UMAP on. All features by default.
*args, **kwargs will be passed to initialization of UMAP.

In [None]:
n_components = 3
umap_gen = UMAPFeatureGenerator(n_components=n_components, n_neighbors=9)
dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = umap_gen(dataf)

The new features will be names with the convention `f"feature_umap_{i}"`.

In [None]:
umap_features = [f"feature_umap_{i}" for i in range(n_components)]
dataf[umap_features].head(3)

Unnamed: 0_level_0,feature_umap_0,feature_umap_1,feature_umap_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n559bd06a8861222,0.063341,0.853068,0.183456
n9d39dea58c9e3cf,0.0,0.459346,0.47207
nb64f06d3a9fc9f1,0.616552,0.668318,1.0


## 1.1. Numerai Classic

The Numerai Classic dataset has a certain structure that you may not encounter in the Numerai Signals tournament.
Therefore, this section has all preprocessors that can only be applied to Numerai Classic.

### 1.1.0 Numerai Classic: Version agnostic

Preprocessors that work for all Numerai Classic versions.

#### 1.1.0.1. BayesianGMMTargetProcessor

In [8]:
#| echo: false
#| output: asis
show_doc(BayesianGMMTargetProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L256){target="_blank" style="float:right; font-size:smaller"}

### BayesianGMMTargetProcessor

>      BayesianGMMTargetProcessor (target_col:str='target',
>                                  feature_names:list=None, n_components:int=6)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param feature_names: Selection of features used for Bayesian GMM. All features by default.
:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

### 1.1.1. Numerai Classic: Version 1 specific

Preprocessors that only work for version 1 (legacy data). 

As a new user we recommend to start modeling the version 2 data and avoid version 1. The preprocessors below are only there for legacy and compatibility reasons.

#### 1.1.1.1. GroupStatsPreProcessor

The version 1 legacy data has 6 groups of features which allows us to calculate aggregate features.

In [9]:
#| echo: false
#| output: asis
show_doc(GroupStatsPreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L339){target="_blank" style="float:right; font-size:smaller"}

### GroupStatsPreProcessor

>      GroupStatsPreProcessor (groups:list=None)

WARNING: Only supported for Version 1 (legacy) data. 

Calculate group statistics for all data groups. 

| :param groups: Groups to create features for. All groups by default.

In [None]:
dataf = create_numerframe(
    "test_assets/mini_numerai_version_1_data.csv"
)
group_features_dataf = GroupStatsPreProcessor().transform(dataf)
group_features_dataf.head(2)

Unnamed: 0,id,era,data_type,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,...,feature_charisma_skew,feature_dexterity_mean,feature_dexterity_std,feature_dexterity_skew,feature_strength_mean,feature_strength_std,feature_strength_skew,feature_constitution_mean,feature_constitution_std,feature_constitution_skew
0,n000315175b67977,era1,train,0.0,0.5,0.25,0.0,0.5,0.25,0.25,...,-0.004783,0.696429,0.200446,-0.60762,0.480263,0.292829,-0.372064,0.427632,0.27572,0.276155
1,n0014af834a96cdd,era1,train,0.0,0.0,0.0,0.25,0.5,0.0,0.0,...,-0.021737,0.267857,0.249312,0.382267,0.407895,0.309866,0.220625,0.644737,0.33408,-0.794938


In [None]:
#| include: false
new_cols = [
    "feature_intelligence_mean",
    "feature_intelligence_std",
    "feature_intelligence_skew",
    "feature_wisdom_mean",
    "feature_wisdom_std",
    "feature_wisdom_skew",
    "feature_charisma_mean",
    "feature_charisma_std",
    "feature_charisma_skew",
    "feature_dexterity_mean",
    "feature_dexterity_std",
    "feature_dexterity_skew",
    "feature_strength_mean",
    "feature_strength_std",
    "feature_strength_skew",
    "feature_constitution_mean",
    "feature_constitution_std",
    "feature_constitution_skew",
]
assert set(group_features_dataf.columns).intersection(new_cols)
group_features_dataf.get_feature_data[new_cols].head(2)

Unnamed: 0,feature_intelligence_mean,feature_intelligence_std,feature_intelligence_skew,feature_wisdom_mean,feature_wisdom_std,feature_wisdom_skew,feature_charisma_mean,feature_charisma_std,feature_charisma_skew,feature_dexterity_mean,feature_dexterity_std,feature_dexterity_skew,feature_strength_mean,feature_strength_std,feature_strength_skew,feature_constitution_mean,feature_constitution_std,feature_constitution_skew
0,0.333333,0.246183,0.558528,0.668478,0.236022,-0.115082,0.438953,0.25991,-0.004783,0.696429,0.200446,-0.60762,0.480263,0.292829,-0.372064,0.427632,0.27572,0.276155
1,0.208333,0.234359,0.382554,0.559783,0.358177,-0.062362,0.485465,0.252501,-0.021737,0.267857,0.249312,0.382267,0.407895,0.309866,0.220625,0.644737,0.33408,-0.794938


## 1.2. Numerai Signals

Preprocessors that are specific to Numerai Signals.

### 1.2.1. KatsuFeatureGenerator

[Katsu1110](https://www.kaggle.com/code1110) provides an excellent and fast feature engineering scheme in his [Kaggle notebook on starting with Numerai Signals](https://www.kaggle.com/code1110/numeraisignals-starter-for-beginners). It is surprisingly effective, fast and works well for modeling. This preprocessor is based on his feature engineering setup in that notebook.

Features generated:
1. MACD and MACD signal
2. RSI
3. Percentage rate of return
4. Volatility
5. MA (moving average) gap

In [10]:
#| echo: false
#| output: asis
show_doc(KatsuFeatureGenerator)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L374){target="_blank" style="float:right; font-size:smaller"}

### KatsuFeatureGenerator

>      KatsuFeatureGenerator (windows:list, ticker_col:str='ticker',
>                             close_col:str='close', num_cores:int=None)

Effective feature engineering setup based on Katsu's starter notebook.
Based on source by Katsu1110: https://www.kaggle.com/code1110/numeraisignals-starter-for-beginners

:param windows: Time interval to apply for window features: 

1. Percentage Rate of change 

2. Volatility 

3. Moving Average gap 

:param ticker_col: Columns with tickers to iterate over. 

:param close_col: Column name where you have closing price stored.

Let's create a simple synthetic dataset to test preprocessors on. Many preprocessor require at least `ticker`, `date` and `close` columns. More advanced feature engineering preprocessors should also have `open`, `high`, `low` and `volume` columns.

In [None]:
instances = []
tickers = ["ABC.US", "DEF.US", "GHI.US"]
for ticker in tickers:
    price = np.random.randint(10, 100)
    for i in range(100):
        price += np.random.uniform(-1, 1)
        instances.append(
            {
                "ticker": ticker,
                "date": pd.Timestamp("2020-01-01") + pd.Timedelta(days=i),
                "open": price - 0.05,
                "high": price + 0.02,
                "low": price - 0.01,
                "close": price,
                "volume": np.random.randint(1000, 10000),
            }
        )
dummy_df = NumerFrame(instances)

In [None]:
dummy_df.head(2)

Unnamed: 0,ticker,date,open,high,low,close,volume
0,ABC.US,2020-01-01,77.92798,77.99798,77.96798,77.97798,5959
1,ABC.US,2020-01-02,78.082728,78.152728,78.122728,78.132728,2232


In [None]:
dataf = NumerFrame(dummy_df)
dataf.loc[:, "friday_date"] = dataf["date"]

In [None]:
kfpp = KatsuFeatureGenerator(windows=[20, 40, 60], num_cores=8)
new_dataf = kfpp.transform(dataf)

Generating ticker DataFrames:   0%|          | 0/3 [00:00<?, ?it/s]

Generating features:   0%|          | 0/3 [00:00<?, ?it/s]

12 features are generated in this test (3*3 window features + 3 non window features).

In [None]:
new_dataf.sort_values(["ticker", "date"]).get_feature_data.tail(2)

Unnamed: 0,feature_close_ROCP_20,feature_close_VOL_20,feature_close_MA_gap_20,feature_close_ROCP_40,feature_close_VOL_40,feature_close_MA_gap_40,feature_close_ROCP_60,feature_close_VOL_60,feature_close_MA_gap_60,feature_RSI,feature_MACD,feature_MACD_signal
298,0.004565,0.001877,0.993699,0.02646,0.00194,1.003772,0.047572,0.001852,1.014498,49.761904,0.200531,0.443523
299,-0.005576,0.001893,0.987824,0.02516,0.001945,0.99695,0.043803,0.001861,1.00751,46.763659,0.114464,0.377711


### 1.2.2. EraQuantileProcessor

Numerai Signals' objective is predicting a ranking of equities. Therefore, we can benefit from creating rankings out of the features. Doing this reduces noise and works as a normalization mechanism for your features. [`EraQuantileProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#eraquantileprocessor) bins features in a given number of quantiles for each era in the dataset.

In [11]:
#| echo: false
#| output: asis
show_doc(EraQuantileProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L486){target="_blank" style="float:right; font-size:smaller"}

### EraQuantileProcessor

>      EraQuantileProcessor (num_quantiles:int=50, era_col:str='friday_date',
>                            features:list=None, num_cores:int=None,
>                            random_state:int=0)

Transform features into quantiles on a per-era basis

:param num_quantiles: Number of buckets to split data into. 

:param era_col: Era column name in the dataframe to perform each transformation. 

:param features: All features that you want quantized. All feature cols by default. 

:param num_cores: CPU cores to allocate for quantile transforming. All available cores by default. 

:param random_state: Seed for QuantileTransformer.

In [None]:
era_quantiler = EraQuantileProcessor(num_quantiles=50)
era_dataf = era_quantiler.transform(new_dataf)

  0%|          | 0/12 [00:00<?, ?it/s]

In [None]:
era_dataf.get_feature_data.tail(2)

Unnamed: 0,feature_close_ROCP_20,feature_close_VOL_20,feature_close_MA_gap_20,feature_close_ROCP_40,feature_close_VOL_40,feature_close_MA_gap_40,feature_close_ROCP_60,feature_close_VOL_60,feature_close_MA_gap_60,feature_RSI,...,feature_close_MA_gap_20_quantile50,feature_close_ROCP_40_quantile50,feature_close_VOL_40_quantile50,feature_close_MA_gap_40_quantile50,feature_close_ROCP_60_quantile50,feature_close_VOL_60_quantile50,feature_close_MA_gap_60_quantile50,feature_RSI_quantile50,feature_MACD_quantile50,feature_MACD_signal_quantile50
298,0.004565,0.001877,0.993699,0.02646,0.00194,1.003772,0.047572,0.001852,1.014498,49.761904,...,0.5,1.0,0.5,0.5,1.0,0.5,1.0,0.5,0.5,0.5
299,-0.005576,0.001893,0.987824,0.02516,0.001945,0.99695,0.043803,0.001861,1.00751,46.763659,...,0.5,1.0,0.5,0.5,1.0,0.5,0.5,0.5,0.5,0.5


### 1.2.3. TickerMapper

Numerai Signals data APIs may work with different ticker formats. Our goal with [`TickerMapper`](https://crowdcent.github.io/numerblox/preprocessing.html#tickermapper) is to map `ticker_col` to `target_ticker_format`.

In [12]:
#| echo: false
#| output: asis
show_doc(TickerMapper)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L550){target="_blank" style="float:right; font-size:smaller"}

### TickerMapper

>      TickerMapper (ticker_col:str='ticker',
>                    target_ticker_format:str='bloomberg_ticker',
>                    mapper_path:str='https://numerai-signals-public-data.s3-us-
>                    west-2.amazonaws.com/signals_ticker_map_w_bbg.csv')

Map ticker from one format to another. 

:param ticker_col: Column used for mapping. Must already be present in the input data. 

:param target_ticker_format: Format to map tickers to. Must be present in the ticker map. 

For default mapper supported ticker formats are: ['ticker', 'bloomberg_ticker', 'yahoo'] 

:param mapper_path: Path to CSV file containing at least ticker_col and target_ticker_format columns. 

Can be either a web link of local path. Numerai Signals mapping by default.

Use default signals mapping to convert between Numerai ticker, Bloomberg ticker and Yahoo ticker formats.

In [None]:
test_dataf = pd.DataFrame(["AAPL", "MSFT"], columns=["ticker"])
mapper = TickerMapper()
mapper.transform(test_dataf)

Unnamed: 0,ticker,bloomberg_ticker
0,AAPL,AAPL US
1,MSFT,MSFT US


You can also use a CSV file for mapping. For example, the mapping Numerai user degerhan provides in [dsignals](https://github.com/degerhan/dsignals) for EOD data.

In [None]:
test_dataf = pd.DataFrame(["LLB SW", "DRAK NA", "SWB MK", "ELEKTRA* MF", "NOT_A_TICKER"], columns=["bloomberg_ticker"])
mapper = TickerMapper(ticker_col="bloomberg_ticker", target_ticker_format="signals_ticker",
                      mapper_path="test_assets/eodhd-map.csv")
mapper.transform(test_dataf)

Unnamed: 0,bloomberg_ticker,signals_ticker
0,LLB SW,LLB.SW
1,DRAK NA,DRAK.AS
2,SWB MK,5211.KLSE
3,ELEKTRA* MF,ELEKTRA.MX
4,NOT_A_TICKER,


### 1.2.4. SignalsTargetProcessor

Numerai provides [targets for 5000 stocks](https://docs.numer.ai/numerai-signals/signals-overview#universe) that are neutralized against all sorts of factors. However, it can be helpful to experiment with creating your own targets. You might want to explore different windows, different target binning and/or neutralization. [`SignalsTargetProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#signalstargetprocessor) engineers 3 different targets for every given windows:
- `_raw`: Raw return based on price movements.
- `_rank`: Ranks of raw return.
- `_group`: Binned returns based on rank.

Note that Numerai provides targets based on 4-day returns and 20-day returns. While you can explore any window you like, it makes sense to start with `windows` close to these timeframes.

For the `bins` argument there are also many options possible. The followed are commonly used binning:
- Nomi bins: `[0, 0.05, 0.25, 0.75, 0.95, 1]`
- Uniform bins: `[0, 0.20, 0.40, 0.60, 0.80, 1]`

In [13]:
#| echo: false
#| output: asis
show_doc(SignalsTargetProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L590){target="_blank" style="float:right; font-size:smaller"}

### SignalsTargetProcessor

>      SignalsTargetProcessor (price_col:str='close', windows:list=None,
>                              bins:list=None, labels:list=None)

Engineer targets for Numerai Signals. 

More information on implements Numerai Signals targets: 

https://forum.numer.ai/t/decoding-the-signals-target/2501

:param price_col: Column from which target will be derived. 

:param windows: Timeframes to use for engineering targets. 10 and 20-day by default. 

:param bins: Binning used to create group targets. Nomi binning by default. 

:param labels: Scaling for binned target. Must be same length as resulting bins (bins-1). Numerai labels by default.

In [None]:
stp = SignalsTargetProcessor()
era_dataf.meta.era_col = "date"
new_target_dataf = stp.transform(era_dataf)
new_target_dataf.get_target_data.head(2)

Signals target engineering windows:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,target_10d_raw,target_10d_rank,target_10d_group,target_20d_raw,target_20d_rank,target_20d_group
0,0.032878,1.0,1.0,0.042264,1.0,1.0
1,0.024986,1.0,1.0,0.049748,1.0,1.0


### 1.2.5. LagPreProcessor

Many models like Gradient Boosting Machines (GBMs) don't learn any time-series patterns by itself. However, if we create lags of our features the models will pick up on time dependencies between features. [`LagPreProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#lagpreprocessor) create lag features for given features and windows.

In [14]:
#| echo: false
#| output: asis
show_doc(LagPreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L636){target="_blank" style="float:right; font-size:smaller"}

### LagPreProcessor

>      LagPreProcessor (windows:list=None, ticker_col:str='bloomberg_ticker',
>                       feature_names:list=None)

Add lag features based on given windows.

:param windows: All lag windows to process for all features. 

[5, 10, 15, 20] by default (4 weeks lookback) 

:param ticker_col: Column name for grouping by tickers. 

:param feature_names: All features for which you want to create lags. All features by default.

In [None]:
lpp = LagPreProcessor(ticker_col="ticker", feature_names=["close", "volume"])
dataf = lpp(dataf)

Lag feature generation:   0%|          | 0/2 [00:00<?, ?it/s]

All lag features will contain `lag` in the column name.

In [None]:
dataf.get_pattern_data("lag").tail(2)

Unnamed: 0,close_lag5,close_lag10,close_lag15,close_lag20,volume_lag5,volume_lag10,volume_lag15,volume_lag20
298,79.967452,80.110232,81.077144,79.644114,5740.0,7375.0,5833.0,3470.0
299,80.589557,79.963355,80.895863,79.958396,4791.0,2066.0,7200.0,4146.0


### 1.2.6. DifferencePreProcessor

After creating lags with the [`LagPreProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#lagpreprocessor), it may be useful to create new features that calculate the difference between those lags. Through this process in [`DifferencePreProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#differencepreprocessor), we can provide models with more time-series related patterns.

In [15]:
#| echo: false
#| output: asis
show_doc(DifferencePreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L669){target="_blank" style="float:right; font-size:smaller"}

### DifferencePreProcessor

>      DifferencePreProcessor (windows:list=None, feature_names:list=None,
>                              pct_diff:bool=False, abs_diff:bool=False)

Add difference features based on given windows. Run LagPreProcessor first.

:param windows: All lag windows to process for all features. 

:param feature_names: All features for which you want to create differences. All features that also have lags by default. 

:param pct_change: Method to calculate differences. If True, will calculate differences with a percentage change. Otherwise calculates a simple difference. Defaults to False 

:param abs_diff: Whether to also calculate the absolute value of all differences. Defaults to True

In [None]:
dpp = DifferencePreProcessor(
    feature_names=["close", "volume"], windows=[5, 10, 15, 20], pct_diff=True
)
dataf = dpp.transform(dataf)

Difference feature generation:   0%|          | 0/2 [00:00<?, ?it/s]

All difference features will contain `diff` in the column name.

In [None]:
dataf.get_pattern_data("diff").tail(2)

Unnamed: 0,close_diff5,close_diff10,close_diff15,close_diff20,volume_diff5,volume_diff10,volume_diff15,volume_diff20
298,0.000503,-0.00128,-0.013191,0.004565,-0.737282,-0.795525,-0.741471,-0.565418
299,-0.013364,-0.005637,-0.0171,-0.005576,1.015654,3.67425,0.34125,1.329233


### 1.2.7. PandasTaFeatureGenerator

This generator takes in a [pandas-ta](https://github.com/twopirllc/pandas-ta) strategy and processing them on multiple cores. There is a simple default strategy available with RSI features for 14 and 60 rows.

To learn more about defining pandas-ta strategies. Check [this section of the pandas-ta README](https://github.com/twopirllc/pandas-ta#pandas-ta-strategies).

In [16]:
#| echo: false
#| output: asis
show_doc(PandasTaFeatureGenerator)

---

### PandasTaFeatureGenerator

>      PandasTaFeatureGenerator (strategy:pandas_ta.core.Strategy=None,
>                                ticker_col:str='ticker', num_cores:int=None)

Generate features with pandas-ta.
https://github.com/twopirllc/pandas-ta

:param strategy: Valid Pandas Ta strategy. 

For more information on creating a strategy, see: 

https://github.com/twopirllc/pandas-ta#pandas-ta-strategy 

By default, a strategy with RSI(14) and RSI(60) is used. 

:param ticker_col: Column name for grouping by tickers. 

:param num_cores: Number of cores to use for multiprocessing. 

By default, all available cores are used.

In [None]:
pta = PandasTaFeatureGenerator()
new_pta_df = pta.transform(dummy_df)
new_pta_df.tail(2)

Generating ticker DataFrames:   0%|          | 0/3 [00:00<?, ?it/s]

Generating pandas-ta features:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,ticker,date,open,high,low,close,volume,friday_date,feature_RSI_14,feature_RSI_60
298,GHI.US,2020-04-08,79.957655,80.027655,79.997655,80.007655,1508,2020-04-08,49.761904,51.914054
299,GHI.US,2020-04-09,79.462572,79.532572,79.502572,79.512572,9657,2020-04-09,46.763659,50.963993


The feature data can be selected directly through a [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) convenience method called `.get_feature_data`.

In [None]:
new_pta_df.get_feature_data.tail(2)

Unnamed: 0,feature_RSI_14,feature_RSI_60
298,49.761904,51.914054
299,46.763659,50.963993


A custom `pandas-ta` strategy can be defined as follows. Check the [pandas-ta docs](https://github.com/twopirllc/pandas-ta#indicators-by-category) for more information on available indicators and arguments.

`ta` takes in a list of dictionaries defining indicators and optional additional arguments. We use `col_names` for convenience so features are prefixed by `feature_` and can be easily retrieved within a [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe).

In [None]:
strategy = ta.Strategy(name="mystrategy",
                       ta=[{"kind": "cmo", "col_names": ("feature_CMO")}, # Chande Momentum Oscillator
                           {"kind": "rsi", "length": 60, "col_names": ("feature_RSI_60")} # Relative Strength Index
                           ])

In [None]:
pta = PandasTaFeatureGenerator(strategy=strategy)
new_pta_df = pta.transform(dummy_df)
new_pta_df.get_feature_data.tail(5)

Generating ticker DataFrames:   0%|          | 0/3 [00:00<?, ?it/s]

Generating pandas-ta features:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,feature_CMO,feature_RSI_60
295,9.158636,53.543562
296,0.287006,52.256402
297,-11.012262,50.466545
298,-0.476191,51.914054
299,-6.472682,50.963993


## 2. Custom preprocessors

There are an almost unlimited number of ways to preprocess (selection, engineering and manipulation). We have only scratched the surface with the preprocessors currently implemented. We invite the Numerai community to develop Numerai Classic and Numerai Signals preprocessors.

A new Preprocessor should inherit from [`BaseProcessor`](https://crowdcent.github.io/numerblox/preprocessing.html#baseprocessor) and implement a `transform` method. For efficient implementation, we recommend you use [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) functionality for preprocessing. You can also support Pandas DataFrame input as long as the `transform` method returns a [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe). This ensures that the Preprocessor still works within a full `numerai-blocks` pipeline. A template for new preprocessors is given below.

To enable fancy logging output. Add the `@display_processor_info` decorator to the `transform` method.

In [17]:
#| echo: false
#| output: asis
show_doc(AwesomePreProcessor)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/preprocessing.py#L716){target="_blank" style="float:right; font-size:smaller"}

### AwesomePreProcessor

>      AwesomePreProcessor ()

TEMPLATE - Do some awesome preprocessing.

-------------------------------------------