# Notes

- All numeric values (integer, decimal, timestamp) and boolean values should be stored as floats that can be used by the model directly.
- String values might be categorical, where we can create a vocabulary, map to integers, and use as numeric values.
- String values might be free text, where we can to tokenize them and create a vocabulary.
- The model should accept a primary key or row number.

# Assumptions

- <1 million rows, single machines
    - Will eventually remove size limitation and support distributed training
    - Will also need better way to store embeddings (partitions?)

# ID Column

## Initialization

In [13]:
from summon.torch import Numeric

model = Numeric(columns=1)

## Data

In [3]:
import torch
import pandas as pd
import numpy as np
from pathlib import Path

data_dir = Path("/tmp/data/")

df = pd.read_parquet(str(data_dir / "fever.snappy.parquet"))

X = torch.arange(len(df))
Y = torch.tensor(df["id"].to_numpy(dtype=np.int32), dtype=torch.int32)

X.shape, Y.shape

(torch.Size([426559]), torch.Size([426559]))

## Standard Scaling

Standard scaling can be used for row numbers. Row numbers are sequential, contiguous, and without any outliers. Min/max scaling can result in very small numbers where double precision is required.

In [4]:
mean = X.double().mean()
var = X.double().var()
Xnorm = (X.double() - mean) / var

# TODO: this finds false! need to diagnose
torch.eq(X, ((Xnorm * var) + mean).long()).all()

X[0:10], ((Xnorm[0:10] * var) + mean).long(), Xnorm[0:10]

(tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 tensor([-1.4066e-05, -1.4066e-05, -1.4066e-05, -1.4066e-05, -1.4066e-05,
         -1.4066e-05, -1.4066e-05, -1.4066e-05, -1.4065e-05, -1.4065e-05],
        dtype=torch.float64))

## Min/Max Scaling

Min/max scaling ensures data is the the range 0-1. This is useful for data that is not sequential, contiguous, or has outliers. It is also useful for data that is not normally distributed.

In [5]:
# min/max scaling
min = Y.min()
max = Y.max()
Ynorm = (Y - min) / (max - min)

# TODO: this finds false! need to diagnose
torch.eq(Y, (Ynorm * (max - min) + min).long()).all()

Y[0:10], (Ynorm[0:10] * (max - min) + min).long(), Ynorm[0:10]

(tensor([ 75397,  75397, 150448, 150448, 214861, 156709,  83235, 129629, 129629,
         149579], dtype=torch.int32),
 tensor([ 75397,  75397, 150448, 150448, 214861, 156709,  83235, 129629, 129629,
         149579]),
 tensor([0.3286, 0.3286, 0.6557, 0.6557, 0.9364, 0.6830, 0.3628, 0.5650, 0.5650,
         0.6519]))

# Training

In [14]:
import torch
from torch.optim import SGD
from torch.nn import L1Loss

g = torch.Generator().manual_seed(2147483647)

model.train()

optimizer = SGD(model.parameters(), lr=1)
mae_loss = L1Loss()

batch_size = 32

In [16]:
iterations = 1_000_000

for i in range(iterations):
    optimizer.zero_grad()

    # mini-batch
    ix = torch.randint(0, len(X), (batch_size, ), generator=g)
    uX, uY = Xnorm[ix], Ynorm[ix]

    # forward pass
    x = uX.view(-1, 1).float()
    x = model(x)

    # loss
    loss = mae_loss(x.view(-1), uY)

    # optimize
    loss.backward()
    optimizer.step()

    # track stats
    if i % 100_000 == 0:
        print(f"{i} / {iterations}: {loss.item():.3f}")

0 / 100000: 0.675
20000 / 100000: 0.599
40000 / 100000: 0.425
60000 / 100000: 0.363
80000 / 100000: 0.461


In [18]:
@torch.no_grad()
def total_loss() -> "float":

    # mini-batch
    ix = torch.randint(0, len(Xnorm), (100_000, ), generator=g)
    uX, uY = Xnorm[ix], Ynorm[ix]

    # forward pass
    x = uX.view(-1, 1).float()
    x = model(x)

    # loss
    loss = mae_loss(x.view(-1), uY)

    return loss.item()

f"loss: {total_loss()}"

'loss: 0.5209963321685791'