# Predicting with Tabular Data

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 20/01/25   | Martin | Create  | Started working on ch7: tabular data | 

# Content

* [Introduction](#introduction)

# Introduction

Most data is stored in tables. This type of data is known as tabular data.

Common challenges when using DNN with tabular data:

1. Mixed features data types (e.g string, float, int, text, ...)
2. Data in sparse format - more zeros than non-zeros - DNN have difficults converting these into meaningful gradients to apply backpropogation
3. No state-of-the-art architecture to resolve it
4. Less data is available
5. Less interpretable
6. Alternative models (e.g XGBoost, LightGBM, CatBoost) perform better

# Processing Numerical Data

Types of numerical data:

* Data expressed as a floating number
* Integer that has a certain number of unique values (or order - ordinal)
* Interger data is not representing a class or label (standard categorical feature)

Potential issues:

* Missing data
* Constant values - slows down computation and interferes with bias in each neuron
* Skewed distributions
* Non-standardised data (extreme values)

🚨 __CRITICAL: Have to deal with these issues before passing data to neural network, else it will return errors__

Build a scikit-learn pipeline with the following functions:

* Minimum acceptable variance for a feature to be kept - else there will be unwanted constants in the network
* Inputer for missing values
  - 📜 __NOTE:__ More sophisticated inputation methods can utilise information from other variables to perform inputation.
  - Inputation is critical because sometimes missing values represent some relationship that is hidden within the dataset but is not captured numerically
  - Potentially use binary feature encoding for these missing values
  - Whether to add binary features to denote 
* Whether to transform distribution of variables to resemble a more symetric/ expected distribution
* Rescale variables or outputs based on their statistical distribution (Standardisation, Normalisation, etc.)

In [4]:
from catboost.datasets import amazon

X, Xt = amazon()

y = X['ACTION'].apply(lambda x: 1 if x == 1 else 0).values
X.drop(['ACTION'], axis=1, inplace=True)

In [5]:
X.head()

Unnamed: 0,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [7]:
import numpy as np
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline

In [None]:
def assemble_numeric_pipeline(
  variance_threshold=0.0,
  inputer='mean',
  multivariate_imputer=False,
  add_indicator=True,
  quantile_transformer='normal',
  scaler=True
):
  numeric_pipeline = []
  
  # Variance Threshold - Removes all low-variance features based on threshold specified
  if variance_threshold is not None:
    if isinstance(variance_threshold, float):
      numeric_pipeline.append(('var_filter', VarianceThreshold(threshold=variance_threshold)))
    else:
      numeric_pipeline.append(('var_filter', VarianceThreshold()))
  
  # Imputer - Replaces NaN values with specified logic
  ## 2 types of imputers
  ##   1. Multivariate Imputer - builds a regression model from other features and existing data
  ##                             and fills NaN values with predictions of model
  ##   2. Simple Imputer - Use summary statistics from existing values to fill missing data
  if imputer is not None:
    if multivariate_imputer:
      numeric_pipeline.append(('imputer', IterativeImputer(
        estimator=ExtraTreesRegressor(n_estimators=100, n_jobs=-2),
        initial_strategy=imputer,
        add_indicator=add_indicator
      )))
    else:
      numeric_pipeline.append(('imputer', SimpleImputer(
        strategy=imputer,
        add_indicator=add_indicator
      )))
