# Windmark Training Pipeline Demonstration

In [11]:
import polars as pl

import windmark as wm

In [12]:
ledger = "/home/grantham/windmark/data/quarter_ledger.parquet"

In [13]:
(
    pl.read_parquet(ledger)
    .filter(pl.col("customer_id") == pl.col("customer_id").first())
    .sort(pl.col("order_id"))
    .tail(10)
)

use_chip,merchant_state,merchant_city,mcc,card,timestamp,has_bad_cvv,has_bad_pin,has_bad_expiration,has_bad_zipcode,has_bad_card_number,has_technical_glitch,has_insufficient_balance,amount,merchant_name,target,transaction_id,customer_id,order_id,timedelta
str,str,str,i64,str,datetime[μs],bool,bool,bool,bool,bool,bool,bool,f64,str,str,str,str,u32,i64
"""Chip Transacti…","""CA""","""Watsonville""",7538,"""502025075610""",2020-02-27 09:10:00,False,False,False,False,False,False,False,27.77,"""Higgins, Bond …","""No""","""2020-02-27 09:…","""Kenneth Chapma…",82346,60
"""Chip Transacti…","""CA""","""Watsonville""",7538,"""34653790880949…",2020-02-27 10:48:00,False,False,False,False,False,False,False,25.97,"""Wilkerson-Harr…","""No""","""2020-02-27 10:…","""Kenneth Chapma…",82347,0
"""Chip Transacti…","""CA""","""Watsonville""",5812,"""35774201853993…",2020-02-27 11:33:00,False,False,False,False,False,False,False,12.14,"""Gibbs, Dixon a…","""No""","""2020-02-27 11:…","""Kenneth Chapma…",82348,0
"""Swipe Transact…","""CA""","""San Jose""",4829,"""35412309978311…",2020-02-27 16:55:00,False,False,False,False,False,False,False,140.0,"""Warren, Valdez…","""No""","""2020-02-27 16:…","""Kenneth Chapma…",82349,0
"""Swipe Transact…","""CA""","""San Jose""",4829,"""35412309978311…",2020-02-27 18:12:00,False,False,False,False,False,False,False,100.0,"""Warren, Valdez…","""No""","""2020-02-27 18:…","""Kenneth Chapma…",82350,0
"""Swipe Transact…","""CA""","""San Jose""",4829,"""34653790880949…",2020-02-27 18:51:00,False,False,False,False,False,False,False,140.0,"""Warren, Valdez…","""No""","""2020-02-27 18:…","""Kenneth Chapma…",82351,0
"""Chip Transacti…","""CA""","""Watsonville""",5411,"""35412309978311…",2020-02-27 22:03:00,False,False,False,False,False,False,False,10.39,"""Bentley-Romero…","""No""","""2020-02-27 22:…","""Kenneth Chapma…",82352,0
"""Swipe Transact…","""CA""","""Merced""",5719,"""35412309978311…",2020-02-28 09:48:00,False,False,False,False,False,False,False,23.39,"""Walters, Moore…","""No""","""2020-02-28 09:…","""Kenneth Chapma…",82353,0
"""Chip Transacti…","""CA""","""Watsonville""",7538,"""502025075610""",2020-02-28 14:02:00,False,False,False,False,False,False,False,22.0,"""Higgins, Bond …","""No""","""2020-02-28 14:…","""Kenneth Chapma…",82354,0
"""Swipe Transact…","""CA""","""San Jose""",4829,"""35412309978311…",2020-02-28 16:46:00,False,False,False,False,False,False,False,120.0,"""Warren, Valdez…","""No""","""2020-02-28 16:…","""Kenneth Chapma…",82355,0


### Modeling this complex data with traditional ML techniques is extremely challenging

Transforming this sequence of transactions into a tabular data format will require months of exploration and tens of thousands of dollars in compute.
The process of tabular feature engineering will require exploring thousands of potential aggregative window functions to thoroughly represent the behavior exhibited in previous transactions.

<style> td, th {border: none!important;} </style>

| ![tabular](docs/diagrams/tabular.drawio.svg) |
|:--:| 
| *Model development requirements for modeling complex sequential data* |

### There is a better way to model this complex sequential data in its natural form.

Instead of transforming this complex sequential data to fit the construct of a tabular model, we may instead build a custom model architecture that is able to learn from the data as it already exists.

The complex sequences of customer activity are sampled to create training observations.
An event within a sequence is sampled the previous events' fields are collated to create a 2D training observation.
Thousands of such observations are created every second to feed a custom model architecture.

<style> td, th {border: none!important;} </style>

| ![tabular](docs/diagrams/context.drawio.svg) |
|:--:| 
| *example process of sample a sequence's events of fields to an observation* |

In [14]:
# create a custom "schema" that represents the data to be modeled
schema = wm.Schema.create(
    # index of each unique "sequence"
    sequence_id="customer_id",
    # index of each unique "event" within a "sequence"
    event_id="transaction_id",
    # how to sort the events within a sequence
    order_by="order_id",
    # classification or regression target during fine-tuning
    target_id="target",
    # the desired "fields" available within each event
    use_chip="discrete",
    merchant_state="discrete",
    merchant_city="discrete",
    merchant_name="entity",
    has_bad_pin="discrete",
    has_bad_zipcode="discrete",
    has_bad_card_number="discrete",
    has_insufficient_balance="discrete",
    has_bad_expiration="discrete",
    has_technical_glitch="discrete",
    has_bad_cvv="discrete",
    card="entity",
    mcc="discrete",
    amount="continuous",
    timedelta="continuous",
    timestamp="temporal",
)

In [15]:
# model architecture, pre-training, and fine-tuning configuration
params = wm.Hyperparameters(
    n_steps=30,
    batch_size=128,
    max_epochs=16,
    d_field=48,
)

# how to perform a stratified split among the sequences to create training, validation, and test data
split = wm.SequenceSplitter(train=0.70, validate=0.15, test=0.15)

The model architecture is able to support a wide variety of complex, **nullable** input field types, including:
- `discrete`: Categorical values with a finite number of levels (transaction type, merchant category code, boolean flags, city / states)
- `continuous`: Numerical values of any co distribution (transaction amount, time differences, current account balance)
- `entity`: Categorical values with an infinite number of levels (card numbers, merchant names, IP addresses, device IDs)
- `temporal`: Timestamp values (current transaction date, customer sign up date)

We anticipate providing support for the following additional field types:
- `textual`: Raw unstructured text (comments, tweet contents, user feedback, call transcripts)
- `geospatial`: Geographic coordinates (rideshare pickup location, merchant address, tweet location)

Any of these field values can be `null`.

### The two dimensional observation may be modeled with customer model architecture that is dynamically defined by the sequence structure

The following model architecture was dynamically constructed for four fields:
- transaction type (`discrete`)
- transaction amounts (`continuous`)
- current account balance (`continuous`)
- transaction timestamp (`temporal`)

Gray fields indicate a non-valued `null` state.

<style> td, th { border: none!important; } </style>

| ![architecture](docs/diagrams/architecture.drawio.svg) | 
|:--:| 
| *example model architecture* |

### The pipeline is able to automate the entire training process from end-to-end

The model architecture will instantiate with the requested schema.
The pipeline will pre-train in a manner similar to how Large Language Models (LLMs) are able to learn from unlabeled text.
Lastly, the pipeline will fine-tune the pre-trained to solve any classification problem.

<style> td, th { border: none!important; } </style>
| ![pipeline](docs/diagrams/pipeline.drawio.svg) | 
|:--:| 
| *pipeline implementation overview* |

In [16]:
wm.train(datapath=ledger, schema=schema, params=params, split=split)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

`Trainer.fit` stopped: `max_epochs=16` reached.


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

`Trainer.fit` stopped: `max_epochs=16` reached.


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

### Features currently in development:
- Data quality checks and data exploration visualizations
- Larger-than-memory data preprocessing (Spark)
- Support for alternative supervised learning tasks (Regression, Survival)
- Automatically generated report of model performance
- Model deployment pipeline